Vous êtes sur la page 1sur 11

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 33

An Indexing Technique for Web


Ontologies
1
Abad Shah, 2Amjad Farooq, 1Syed Ahsan and 2Mohammad Imran
Abstract - With respect to semantic web, the ontologies are usually stored as Resource Description Framework (RDF)
documents. Different techniques including advanced indexing techniques based on path index, keyword index, suffix arrays and
linked data approach are being used to index RDF documents. However, existing RDF based indexing techniques have two
major problems, which results in poor performance. These problems, facing by most of the existing indexing techniques include
index size and lookup time. These techniques have large index and small lookup time. In this paper, we propose a technique for
indexing RDF documents with smaller index size and faster lookup time. We also present a lightweight implementation of our
proposed indexing scheme using Java, Perl and MySQL as database system. We use synthetic dataset from the Lehigh
University Benchmark containing 2.8 million triples to compare with implemented semantic systems like Jena2, Sesame and
Redland.

Index Terms - RDF documents, RDF based indexing techniques, index size, and lookup time.
——————————  ——————————

1 - INTRODUCTION
The rest of the paper is organized as follows: In
Web ontologies are represented through logic- Section 2, exiting indexing techniques for RDF
based technologies such as RDF, RDF-S and OWL documents including keyword based, path based
[1]. RDF represents ontologies in the triplets which and suffix arrays based are discussed and
form a special kind of directed acyclic graph. That analyzed in detail. In Section 4, new proposed
is all about statements describing things [2]. indexing technique based on Lexicon & Quad is
Indexing is an important process in an information discussed in detail. Design and Implementation of
retrieval system. Basically, indexing is performed new proposed RDF indexing technique is
by assigning each document with keywords or presented in Section 5. The proposed technique is
descriptive terms representing the document. The validated in Section 6 and results are compared
assigned terms must reflect the content of the with the results of some existing techniques used in
document to allow effective keyword searching. In some systems like Sesame and Jena. In Section 7,
the past, indexing has been done manually by the paper is concluded with some future
trained persons who are familiar with the topics of recommendations.
the texts. Today, with the increasing availability of
electronic texts online, manual indexing is 2. LITERATURE REVIEW
obviously too slow and, needless to mention, too
In this section, detailed analysis of existing indexing
expensive. Automatic text indexing which is much
techniques for RDF documents is described. Also,
faster and less error-prone has become a common
comparison of different indexing techniques using
place [3].
some comparison parameters is described in both
Different indexing techniques for indexing RDF
descriptive and tabular format.
documents are being used these days. Following
Nowadays, there are many Semantic web search
are most common and popular. Keyword based
engines have been developed and deployed over
Indexing Techniques for RDF: Path based Indexing
the web which are using different indexing
Techniques for RDF [4] and Suffix Arrays based
structures for indexing RDF documents. Two of the
Indexing Techniques for RDF [5]. The problems,
Semantic Web search engines are very popular
facing by most of the existing indexing techniques
that index the Semantic Web by crawling RDF
include index size and lookup time. These
documents and then offer a search interface over
techniques have large index and small lookup time.
these documents.
In this paper, we propose a technique for indexing
SWSE (SWSE) indexes not only RDF documents
RDF documents with smaller index size and faster
but also “normal” HTML Web documents and RSS
lookup time.
feeds and converts these to RDF [6] [7]. SWSE
stores the complete RDF found in the crawling
1 phase and offers rich queries (expressiveness
Khawarizmi Institute of Computer Science, comparable to SPARQL) over this RDF data [3].
University of Engineering and Technology, Lahore Since SWSE also stores the provenance of all
2
Department of Computer Science & Engineering, statements, it can also provide the source lookup
UET, Lahore functionality that we provide but with a cost:
lookups are slower than in Sindice [18] and the
index is larger.

© 2010 Journal of Computing


http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 34

Similar to SWSE, Swoogle, crawls and indexes the


Semantic Web data found online. Again, the same
differences apply: Swoogle offers richer
functionality than we do but at a cost of index size
and lookup times [7].
Now, let us analyse some existing indexing
techniques for RDF documents.
The query facility we offer is in the tradition of RDF
stores such as Jena and Sesame. These RDF
repositories store their data in a relational
database, and offer limited reasoning capabilities.
In contrast, we focus on fast storage and retrieval d) Entire Keyword Index
only and describe indexing techniques based on Fig 1. Structure of Keyword Index for RDF/S
multi-dimensional access methods that are B-Tree
based. For managing hierarchical relationships between
Multi-dimensional indexing methods such as R- classes or properties in RDF Schema, author
Trees and space-filling curves are not entirely prposed two tables titled “Class” and “Hierarchy”
suited for our problem because we often have (Balmin et al., 2003). The Class table stores class’s
queries along one particular dimension. name and class’s identification. The Hierarchy table
Now, let us discuss some existing RDF indexing stores information about subclass and depth of
techniques based on which different Semantic hierarchy. Author of this paper aimed to support
search engines are being developed. efficient keyword search over documents on the
2.1 Keyword based Indexing Techniques for Semantic Web through these information of class
RDF hierarchies and our index. Figure 2 shows Class
There are many studies about keyword indexing table and Hierarchy table for RDF Schema in figure
techniques based on RDF as well as for XML. In 1.
paper titled as “Indexing Scheme for Keyword
Search over Semantic Web Documents”, an
efficient technique for indexing RDF documents is
described in detail.
The general indexing schemes for keyword index
use an inverted list. In this paper, author presented
an index structure for RDF/S based on the inverted
list.
Figure 1 shows structure for keyword index. In Fig 2. An Example of Class table and Hierarchy
figure 1, the proposed index consists of three parts. table
The first part is a list of keywords in RDF, where a
Figure 3 displays an example of the proposed
keyword means a term in literal value for
index for the RDF document in figure 1. The RDF
properties. The second part is a set of direct
document describes some books, authors, and
posting files. Each posting file directly includes
bookstores. In figure 3, author introduces
information about resources and properties that
“Property” table that store information about
has corresponding keywords in the first part of
properties with keywords.
index. The third part is a set of indirect posting files.
Each posting file indirectly includes information
about resources and properties that has
corresponding keywords in the first part of index
through resources in the second part of index.

Keyword Frequency Resource


Pointer
a) Keyword Node Structure
Next Property Direct Related
Resource ID Resource Resource
Pointer ID Pointer
b) Structure of Direct Resource Node
Property ID Indirect Related (a) Keyword Index
Resource ID Resource
Pointer
c) Structure of Indirect Resource Node

(b) Property Table

© 2010 Journal of Computing


http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 35

Fig 3. Example of Keyword Index for Index elements


RDF
Although this approach was efficient, the
Suffix Θ(n2) N/A Θ(n2)
performance did not scale with respect to index
size.
array
2.2 Path based Indexing Techniques for RDF
Path-based indexing technique was proposed by
index
[6]. The study by Yamamoto et al. is a basis for
indexing techniques for RDF documents In this
Lexicon & Θ(n) Θ(n) Θ(n)
paper, given an XML document, all possible path
expressions were extracted, and suffix arrays were
Quad
constructed on path expressions and reverse path
expressions, and hence efficient processing of path
index
expressions (and reverse path expressions) was
achieved. Query processing was performed using
path expressions on the compact models. That is,
they achieved space efficient indexing by giving up Now, let’s take a flittering overview of pros and
accuracy [9]. An indexing scheme called Index cons of different options in Index implementation
Fabric, as an extension of Patricia trie, was for RDF documents and short comparison between
proposed by Cooper et al. Patricia trie was an them.
efficient and compact indexing scheme that could i) Secondary File System
deal with large-size text. Index Fabric is an File system can be used for indexing and retrieving
extension of Patricia trie, and is a height-balanced RDF documents. In File systems, one file is used
indexing structure for semi-structured data. corresponding to each RDF resource. The benefit
2.3 Suffix Arrays based Indexing Techniques of filing system is scalability. About 50 to 350 bytes
for RDF are required to store and index an RDF resource
There are many papers written on Suffix arrays file, this results in small index size but in file
based indexing techniques for RDF document. In a systems, block size have minimal limit on disk
paper titled as “An Indexing Scheme for RDF and space for every file. For example, in ext3, block
RDF Schema based on Suffix Arrays, author size is 2kb; this means each file will occupy 2K
proposed an indexing technique for RDF and RDF even our file size is only 100 bytes. Hence
Schema [1]. In this RDF indexing technique, author secondary file system is not better approach for
first extracted four kinds of DAGs (Directed Acyclic index creation.
Graphs) from an RDF data, and extracted all path ii) Databases
expressions from the DAGs. Then, he generated The benefit of using database tables with RDF
four kinds of suffix arrays based on the path resource Uniform Resource Identifiers as unique
expressions. Using the indices, author tried to keys is ease of implementation but databases are
achieve efficient processing of query retrievals on typically more expensive in query processing
RDF data including schematic information denoted operation, hence we will not use this as we are
by RDF Schema (for example, classes and/or using only simple lookup keys. Also in case of
properties). databases, all unique keys are temporarily stored in
main memory while accessing the data for efficient
Table 2: Time complexity comparison of different retrieval of data. For example, if there are 500
indexing techniques for RDF documents million unique keys with each RDF resource having
Index/Par Insert Delete Retriev Data 128 to 500 bytes size, this means we need main
memory of size about 63 GB to 250 GB which is
ameter e Struc impossible in currently available hardware.
iii) Hash table and Berkeley DB
Hash table and advanced hashing libraries for
ture
persistent hash tables such as Berkeley DB is a
better solution for implementation of RDF indexing
Keyword Θ(n) N/A Θ(nlogn Inv
structure but hash functions with a little chance of
collisions such as MD5 & SHA1 produce keys for
Index ) ert
OIDs (Object Identifiers) with at least 128 bit size.
iv) Lexicon and Quad
ed
Lexicon index is used to store literal mappings for
RDF graph. It facilitates fast reclamation of OIDs.
List
Quad index is used to represent and to invent quad
boundaries and extract stores structure information.
Path Θ(n1 + N/A N/A
Its benefit over other schemes is small index size
and less lookup time. Object identifiers in Lexicon
based n2) for 2
are represented by only 64 bit index size.

© 2010 Journal of Computing


http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 36

Table 5: Comparison of Existing Techniques for engines deployed on the World Wide Web (Swse;
Indices Implementation Swoogle). These both uses advance indexing
Tech./ Advantages Disadvantages techniques for indexing RDF documents. SWSE
Paramet indexes Resource Description Framework
er resources by saving the complete RDF obtained in
Scalability Large disk space (File the indexing stage and offers rich queries over
File systems typically have a RDF documents. But in SWSE, lookups are slower
System minimal block-size) and the index size is larger [7]. Similarly in
Swoogle semantic search engine, index design is
implemented is such a manner that it indexes the
Simplicity of Overhead in query
Databas Resource Description Framework resources found
implementat processing & occupy too
es on the web with better functionality than SWSE but
ion large disk space
its lookup time is also slower and index size is also
Persistent Large disk storage larger [16].
hash tables because hash functions The objective of this work is to analyze the existing
like with small probability of indexing techniques RDF documents and to
Berkeley collisions such as MD5 & propose an improved index structure for storing
DB are SHA1 produce keys for and querying RDF documents, to overcome the
Hash problems of large index size and slower lookup
better OIDs with at least 128 bit
Table time. To achieve this objective, it is essential that
solutions for size.
index there should be smaller index size and faster
structure lookup.
implementat
ion
4 - PROPOSED TECHNIQUE
Our proposed index structure consists
RDF is a W3C standard model and has emerged
of two indices. Its benefit over other
Lexicon as radical data format for semantic search engines.
schemes is small index size and less
& Quad RDF based Indexing techniques are used to index
lookup time. Object identifiers in Lexicon
RDF documents found on semantic search
are represented by only 64 bit index
engines. Many existing studies on indexing
size (Papakonstantinou et al., 1995).
techniques are being used to index RDF
documents but commonly they faced with two
World Wide Web (or the Web) is a largest portal of major problems of large index size and slow lookup
information in the world. The conventional web
time. In such scenario, a high performance based
search engines like Google and Ask find many
improved indexing technique is required to come
irrelevant results against user queries. The users
out from problems mentioned above.
then need to find the desired results manually from
We propose a simple but efficient indexing
retrieved results because according to one
technique to find certain resources about RDF
analysis about 70 % irrelevant links are retrieved
documents in decentralized surroundings. Our
against a user query. This situation is
technique only index RDF documents and
unacceptable and to handle it understanding of
occurrences of resources to keep index size small
information resources and their structures was
and lookup time fast.
necessary. This situation introduced the concept of
semantic web search engines to improve the
In following section, definitions of some terms used
retrieval percentage of relevant information. Unlike
in this thesis will be described.
the traditional search engines, which go through RDF Context: Context is described in different
the HTML web pages, semantic web search ways. Following is a definition as described in
engines index Resource Description Framework
W3C.
data stored on the web using advanced indexing “If context c (R U B) and t as rdf triple, then pair
techniques based on path index [11], suffix arrays
(c,t) is known as RDF triple in c (context)”.
[12] and linked data approach [13] etc. Advance RDF Namespaces & Documents: As mentioned
indexing techniques based on path index, keyword
earlier, RDF adopts container package model. In
index, suffix arrays and linked data etc which are
XML model, anyone can develop its own tags that
currently being used for indexing RDF documents
are difficult in structural standards and an
have two major problems. The first problem is the
obstruction for generation of web document’s
index size [10] and the second is lookup time [13].
semantics. RDF is a good platform for different
These indexing techniques have large index size
metadata schemas to follow structural
and slow lookup time. Now, we consider practical
standardisation [11]; hence RDF model is efficient
examples of the two semantic search engines
platform for indexing and retrieving of web
which index RDF documents but have the
documents.
problems mentioned above.
“Namespace” is used in RDF for the recognition of
SWSE (Semantic Web Search Engine) and
metadata. It’s a standard suffix to identify that
Swoogle are the two popular semantic search
which tag belongs to which schema.
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 37

In this section of new indexing technique for RDF Table 5: Comparison of Existing Techniques for
documents, different modules of new proposed Indices Implementation
indexing technique will be discussed including Comparison of Existing Techniques for
structure of Quad index and Lexicon index. Indices Implementation
Indices Architecture: As mentioned earlier, we Technique/ Advantag Disadvantages
aimed to an improved indexing technique for RDF Parameter es
documents with small indexing size and fast Scalability Large disk space
searching model. In other words, we inclined to File (File systems
optimize lookup time and disk storage space. For System typically have a
this requirement, we focus on simplicity; we minimal block-size)
consider an efficient but simple ranking algorithm to
Simplicity Overhead in query
get useful results. For optimization purposes, we do
of processing &
not focus on original documents in which resources Databases
implement occupy too large
happened but rather only indexes occurrences of
ation disk space
resources and our lookup route is only from rdf
Persistent Large disk storage
resources to rdf sources and not vice versa.
hash because hash
Our indexing architecture consists of three
tables like functions with small
components.
Berkeley probability of
Index: This component is used to parse and index
DB are collisions such as
RDF documents in any given URL.
Hash Table better MD5 & SHA1
Lookup: It is used to find specific URI and returns
solutions produce keys for
a ranked list of URLs.
for index OIDs with at least
Refresh: This component revives the index by
structure 128 bit size.
updating the known resources
implement
ation
Indices Design and Implementation: There is
always solid reason behind every specific selected Our proposed index structure
item from a list of items. This reason is concluded consists of two indices. Its benefit
after many experiences and experiments. Similarly, Lexicon & over other schemes is small index
our Indexing algorithm for RDF documents has Quad size and less lookup time. Object
specific and unique characteristics over others identifiers in Lexicon are
implemented to date for storing and indexing RDF represented by only 64 bit
docs. We are using following algorithms for RDF index size.
index design and lookup time process.
Following is a simple algorithm describing the We trade space for retrieval time. We want to avoid
basics of Index design for RDF documents. As expensive disk seeks to keep small index size and
shown in table 3, when a new source is indexed, all fast lookup time, which has a considerable
related URIs are extracted and are added to influence on the design of our index organization.
“Found” index to specify that URI is found in the In particular, we store information redundantly in
indexed source. different sorting order, which allows us to retrieve
Table3: Algorithm for RDF Index design any access patterns with a single index lookup.
Begin We use B+-Trees as indexing structure for disk
Define RDF_Index(RDF_source) storage. B+-trees are a well understood data
RDF_Resources = RDF_subject + RDF_object structure and have good properties regarding
in RDF_source inserts and deletions (Comer, 1979). Conceptually,
foreach RDF_resource in RDF_resources we have (key, value) pairs where retrieval based on
FoundEntry[RDF_resource] += RDF_source key yields the value using few disk operations.
End // For each loop We distinguish between core indices that have to
End be present to be able to retain all information, and
optional indices that are redundant and can be built
Time complexity of above mentioned algorithm is
given the core indices.
O(1). For lookup time, a separate algorithm is
In the following, we first describe the lexicon, which
defined as shown in table 4. Since RDF resource
maps nodes to OIDs and back; next, we define the
URI is used as key and found sources are
notion of a perfect index and describe how we use
considered as values, so time complexity for lookup
such a perfect index to store quads. Then, we
algorithm is also O(1).
present an analysis of the disk space requirements
Table 4: Algorithm for RDF lookup time design
of our proposed indexing structure.
Begin
Finally, we describe how we combine text index
Define RDF_Lookup(RDF_source)
and structure index to answer single quad queries.
RDF_sources = FoundEntry(RDF_sources)
4.2.3 Lexicon and Full Text Index
Return foundbyRank(RDF_sources)
The lexicon maps OIDs to node values, and node
End
values to OIDs. The oidnode index keeps (OID,
node) pairs, and the nodeoid index keeps (node,
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 38

OID) pairs. OIDs are represented as 64 bit values,


and node values, either blank nodes, resources, or
literals, are stored in their byte representation.
Additionally, it is possible to keep an inverted index
on string literals, which allows for full-text searches
known from traditional search engines.
One of oidnode or nodeoid is a core index, since
with only OIDs without the mappings, the original
data is lost. The lexicon features an inverted index
called textii on string literals, which is an optional
index used for keyword searches. Each literal is
tokenized into words. Each word is stored as key,
with a sorted list of OIDs as occurrences. Figure 4
has an overview of the (key, value) pairs by the
Lexicon. Keys are denoted using a grey
background, and values are transparent.
OIDNode
OID Node Value Fig 5. Graph Representation for RDF Example
Index construction works as follows: given our
NodeOID example, the Lexicon assigns #aslam the OID
Node Value OID 1000, foaf:knows the OID 1001, and so on, given
that we start the OID range for resources at 1000.
textii Literals such as "Aslam Iqbal" start at OID 2000.
Word OID OID OID ... OIDs are assigned subsequently, depending on the
Figure 4: Entries for the Lexicon Data Structures type of the node. The text index textii can be built
RDF Example periodically, or maintained continuously as new
Figure 5 shows a small RDF graph that we are data is added and old data is removed.
using as an example throughout the paper. 4.2.4 Structure Index
Example 1 shows statements in N3 format from In the structure index, we store OIDs in a way that
three different files (contexts): uet.rdf, foaf.rdf, and allows for fast retrieval of quads. In our example,
pubdb.rdf. the quad (<_:paper>, <dc:title>, "ADVANCED DB",
uet.rdf: <pubdb.rdf>) would be represented using OIDs as
<#aslam> <foaf:knows> <#akram>. (1, 1011, 2003, 1010). Don't care's are represented
<#aslam> <foaf:knows> <#ajmal>. with an '?' or using OID 0.
<#aslam> <foaf:name> "Aslam Iqbal". 4.2.5 Perfect Index
<#aslam> <rdf:type> <foaf:Person>. Given our goal that we want to be able to just
<#akram> <rdf:type> <foaf:Person>. retrieve the results for a given query with a minimal
foaf.rdf: amount of disk seeks, we need an index that allows
<#ajmal> <foaf:knows> <#aslam>. to lookup any combination of s, p, o, c directly.
<#ajmal> <foaf:name> "Ajmal Kamal". Definition of Perfect index: A perfect index is an
<#ajmal> <rdf:type> <foaf:Person>. index that covers all possible access patterns [15].
pubdb.rdf: In our case of storing quads, there exist 42 possible
<_:paper> <dc:description> "Advanced access patterns. One access pattern, for example a
Database". query for all triples with a given predicate p, is
<_:paper> <dc:title> "Advanced DB". represented as (?, p, ?, ?). If we want to index all
<_:paper> <pub:author> "Dr. Rizwan Pasha". possible query combinations, we would need 16
<_:paper> <pub:author> "Adnan Saikho". indices. However, since some access patterns
Example 1: Contents of the example files overlap, such as (s, ?, ?, ?) and (s, p, ?, ?), we can
reuse some indices for different access patterns.
Table 6 shows the indices needed for building a
perfect quad index, and the possible access
patterns each index covers.
Table 6: A perfect index for quads
Index Query patterns
SPOC (?, ?, ?, ?), (s, ?, ?, ?), (s, p, ?, ?),
(s, p, o, ?), (s, p, o, c)
CP (?, ?, ?, c), (?, p, ?, c)
OCS (?, ?, o, ?), (?, ?, o, c), (s, ?, o, c)
POC (?, p, ?, ?), (?, p, o, ?), (?, p, o, c)
CSP (s, ?, ?, c), (s, p, ?, c)
OS (s, ?, o, ?)

© 2010 Journal of Computing


http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 39

In the following, we describe how to implement a store a count that is incremented once a particular
perfect index efficiently using standard B-+ trees. key is inserted.
4.2.6 Key Concatenation Using our example, consider that we have the quad
We use key concatenation to create the structure (1:1011:2003:1010). We generate four other keys,
index on disk. Concatenated keys consist of a (0:0:0:0), (1:0:0,0), (1:1011:0:0), and
sequence of OID. To save space, we don't keep a (1:1011:2003:0), and insert them into the index as
separate list with statements and statement IDs, well. The key (0:0:0:0) is a special key that has the
but encode the whole quad into the key. For total count of keys in a particular index. Since
example, instead of CSP we append the required counting and storing the occurrence counts is a
information to retain the whole quad and generate quite expensive operation, we perform this
a key CSPO. operation in a batch manner once the quad indices
Each quad is sorted according to the different have been constructed. Determining the result size
access patterns. That means we store each quad of a query is as simple as looking up the
(s, p, o, c) as a key spoc in the spoc index, as key corresponding query key in the index. The value
pocs in the pocs index, and so on. Figure 6 associated with that key is the number of resulting
illustrates the idea. Our example quad would be quads for a given query.
stored as (1:1011:2003:1010) in spoc, as 4.2.8 Index Size Analysis
(1011:2003:1010:1) in pocs, and so on. In the following, we present an analysis of the
storage requirements of our indexing structures.
SPOC The size of the lexicon highly depends on the
OIDS OIDP OIDO OIDC characteristics of the input data. The size of the
structure indices grows linearly with the number of
POCS statements stored. Since we use B+-trees for all
OIDP OIDO OIDC OIDS our indices, we assume a storage utilization of 69
percent [12].
...... Let n be the number of distinct nodes, m the total
OSCP number of statements, the length of OIDs in bytes,
OIDO OIDS OIDC OIDP and 1/storage utilization of the index. Then, the size
Fig 6. Entries for the quad data structures of the oidnode and nodeoid index is
(α * n +∑ni=0 size of literal i) * ℓ 51
We just store keys in the corresponding B+-tree, For the text index textii, the inverted list of
and leave the value part empty. At quad insertion occurrences can become quite large, depending on
time, the indexer creates a key for every index and the characteristics of the string literals.
reorders the quad to the sequence the index The size of one of the structure indices spoc, cpos,
requires. When inserting the key into the B-tree, we ocsp, pocs, cspo, oscp is
maintain the order on the first part, then on the (α * 4 * m) * ℓ 50
second part, and so on. In that way, we end up with Adding statistical information to a quad index can
a B-tree that stores the keys in sorted order. grow the index size considerably. In case parts of
With the sorted index, we now can perform a range the key are repeated often, the additional storage
query over that partial tree that constitutes the required is less. For example, in the cpos index, a
query result. For example, we want to query for all lot of statements share the same context, and
statements with "Advanced DB-paper" as a subject therefore less additional space is required as
(remember that the blank node "Advanced DB- opposed to the spoc index, because there are
paper" has OID 1). The corresponding key is usually many distinct subjects. The statistical
(1:0:0:0). To retrieve the partial tree with the query information add in our experience between factor
results, we can range query over the spoc index 1.2 and 2.8 to the index size, depending on the
with lower bound (1:0:0:0) and upper bound distribution of the data.
(1:MAX:MAX:MAX).
4.2.7 Statistics 5 DESIGN AND IMPLEMENTATION
A large number of applications, such as query
In this section, we describe the detailed design and
optimization, data mining, or ranking of query
implementation of our proposed system. Different
results, require statistical information about the
components used in the design and implementation
data set. To allow these applications to quickly
of this system are explained in detail. In our
access basic statistical information, we can store
proposed system, architecture consists of following
occurrence counts directly in our index. For each
components.
key that is inserted into the structure index, we
5.1. Components of Proposed System
generate four additional keys; one that contains
i. HTTP server as an access interface
only 0 values, one that contains the first entry
A web server is a computer program that delivers
together with 0 values, one that contains the first
contents, such as web pages, using the Hypertext
and second entry, and on that contains the first
Transfer Protocol. There are different servers used
three entries. In the value part of each B-tree we
as Web Server. HTTP Server is also a popular web
server. The HTTP Server API enables applications
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 40

to communicate over HTTP without using Microsoft


Internet Information Server (IIS). Applications can
register to receive HTTP requests for particular
URLs, receive HTTP requests, and send HTTP
responses. The HTTP Server API includes SSL
support so that applications can exchange data
over secure HTTP connections without IIS. It is
also designed to work with I/O completion ports.
ii. Index construction module
This module is used to generate the index using
other components
iii. Quad Index
In this section, we will show that we need only
restricted set of indexes to cover up all access Fig 7. Architecture of proposed system
patterns for RDF documents' data. Both insert and delete operations are transactions.
The term Quad index is based on the notion of We used JDBM, a lightweight open-source library
Access pattern. An access pattern is a quad where that offers B-trees and hash tables, for storing data
any combination of s, p, o, c is either a variable or to disk. JDBM consist of a record manager that
specified. For example, an access pattern could be offers caching and transactions. We chose to use
a quad where only s is specified, and p, o, and c an already existing B-tree implementation over
are variables. The access pattern (s:?:?:?) denotes developing our own B-tree for similar reasons.
all quads where the subject equals to s, whereas Now let's see the detailed implementation of our
the other nodes have unspecified value. To API using Java and Perl program. In the program,
compute the total number of access patterns we we also used database as a repository. Using
just have to consider that for each element of the standard library routines, let's open a connection to
quad (4) there exist 2 possibilities (either a node is the repository. We then use API to send our N3QL
specified, or it is a variable). Therefore the total code to the database, and process the results that
number of access patterns is 2*2*2*2 = 16. A naive are returned. When finish everything related to
implementation of a complete index on quads database, we close the connection.
would need 16 indexes, one for each access The first thing to do, of course, is to install Java on
pattern. Implementing working machine. As we said earlier, before a
a complete index in the naive way is prohibitively repository can be accessed, a connection must be
expensive in terms of index construction time and opened between our program (client) and the
storage utilization. We can cover all possible database (server).
access patterns with just six indexes using the fact To make the connection, we created an instance of
that B+-trees provide support for range or prefix a Connection object using following statement.
queries. Connection con =
iv. Lexicon DriverManager.getConnection("http://localhost:80/s
Basic architecture of our system is depicted in ystem/");
Figure 7. The data format for input is N3, and Now, let's see what this jargon is. The parameter is
results are returned in N-Triples format. N3 or the URL for the repository including the protocol
Notation-3 is a language which is a compact and (http), the server (localhost), the port number (80).
readable alternative to RDF's XML syntax, but also Alternatively, we can also open a repository on the
is extended to allow greater expressiveness. It has local file system using following statement.
subsets, one of which is RDF 1.0 equivalent, and Connection con =
one of which is RDF plus a form of RDF rules. DriverManager.getConnection("file:///tmp/system");
There is a Java API as well that mirrors the The connection returned in the last step is an open
functionality of the HTTP interface. We defined an connection which we will use to pass N3QL
HTTP access interface with following operations. statements to the database. In this code snippet,
i. Insertion (HTTP PUT) con is an open connection, and we will use it
ii. Querying (HTTP POST) below.
iii. Retrieval (HTTP GET) An active connection is required to create a
iv. Result size (HTTP HEAD) Statement object. The following code snippet, using
v. Deletion (HTTP DELETE) our Connection object con, creates a statement
object.
Statement stmt =
con.createStatement("context");
At this point, a Statement object exists, but it does
not have an N3 statement to pass on to the
repository.
5.2. Insertion and Deletion of RDF data from
RDF Documents
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 41

Executing N3 statements in API varies depending it in the benchmarking. We tried to install Kowari,
on the "intention" of the N3 statement. Insert but failed to get a running version. In one of our
operations are executed using the method installations, inserting a 1 MB N-triples file via the
executeInsert, and delete operations using the Jena interface resulted in 30 minutes processing
method executeDelete. Notice that these time before the process threw an exception
commands change the state of the repository. because of a full disk (there was 200MB disk space
Insert operations take as parameter a string available before starting the Kowari server). On
containing N3 (Turtle) statements, or alternatively a another installation, we got a core dump of the JVM
java.io.Reader object for streaming inserts. Delete when running Kowari. Therefore we concentrated
operations take as parameter an N3QL query the in our efforts on Sesame 1.1RC2 and Redland
returns triples, which are then removed from the 0.9.18.
repository. The comparison is somewhat difficult since
Since the N3 statements will not quite fit on one Sesame doesn't support context. We refrained from
line on the page, we have split it into multiple using reification in Sesame to keep context
strings concatenated by a plus sign (+) so that it information and just stored triples. Redland was
will compile. Note that we are reusing the same with its context mechanism enabled. One thing to
Statement object rather than having to create a mention is that the random generation of the
new one benchmark files leads to duplicate triples in
The executeInsert method accepts both String and different contexts. In a system that doesn't store
Reader objects as parameters. Pass a Reader as contexts, multiple occurrences of triples in different
parameter if you have a large dataset which should contexts are only stored as one triple. In systems
be transferred in a streaming fashion. that store context information, the same triple can
5.3. Executing Queries be stored multiple times, and therefore the number
As opposed to the previous section statements, a of retrieved results can be much higher there.
query is expected to return a set of tuples as the 6.1 Test Cases of Index Size
result, and not change the state of the repository. For the testing of index size and its construction,
Not surprisingly, there is a corresponding method we loaded the files from N-Triples format into the
called executeQuery, which returns its results as a repository. The index creation time for our system
ResultSet object: Triple triple; consists of the quad index excluding statistical
ResultSet rs = stmt.executeQuery("@prefix ql: information, and the Lexicon without inverted index
<http://www.w3.org/2004/12/ql#> . \n"+ to be comparable to the other repositories that
"@prefix systems: don't construct these indices either. Following are
<http://sw.deri.org/2004/06/systems#> . \n <> chart based results of our test cases for Insert and
ql:select { ?s ?p ?o . \n" + Delete operations for Sesame, Jena2, Redland and
"}; ql:where { { ?s ?p ?o . } systems:context ?c . } our proposed system.
.");
while (rs.next()) { triple = rs.getTriple(); Fig 8. Insert Operations Graph for different
System.out.println("Triple: " + triple); Systems
}
The RDF triples (or RDF lists, depending on the
ql:select clause) resulting from the query are
contained in the variable rs which is an instance of
ResultSet. A set is of not much use to us unless we
can access each row and the attributes in each
row. The ResultSet provides a cursor to us, which
can be used to access each row in turn. The cursor
is initially set just before the first row. Each
invocation of the method next causes it to move to
the next row, if one exists and return true, or return
false if there is no remaining row.

6 RESULTS, ANALYSIS AND


DISCUSSION
We considered the following RDF implemented
index structures for evaluation of our new proposed
system. To be able to discuss the results, let’s first briefly
i. Sesame introduce the indexing methods of the various data
ii. Redland stores. Figure 8 shows the performance
iii. Jena2 measurements of index construction time for Insert
[8] Shows that Sesame generally supersedes Jena operation. Redland stores three indices, based on
in performance results, therefore we did not include hash tables. The po2s index maps a key on p and
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 42

o to value s, the so2p index maps s and o to p, and univ:worksFor". A


the sp2o index maps s and p to o. The native store query that displays all
of Sesame has been recently included into the quads where a given
Sesame distribution. Internally, resources and predicate occurs.
literals are mapped to OIDs, and then an index on
SPO only is kept. In the MySQL store, one large Following is a graph based outcome of
table is kept containing all triples. Our system performance measurement for retrieve operations
keeps the perfect index on quads without statistics for different systems including Sesame, Redland
as well as the nodeoid and oidnode indices. and our proposed system.
Although we are keeping full indices and have a
sophisticated index structure, index construction Each query was executed against the repository
times for our system are comparable with the other after a random 300 MB file was copied to hard disk
systems. In a triple store, triples that occur in to flush buffers. We issued each query ten times,
multiple files /contexts are just stored once. In a but only included the first result here since we
quad store, those triples are stored as many times wanted to test index lookup time and not the cache
as they occur in different contexts. Please note that manager of the persistence layer.
we keep 8 byte OIDs instead of Sesame's 4 byte
OIDs, which roughly doubles our index size. Table 9: Performance results comparison for Quad
Following is a comparison table to show index size queries
of differing indexing structures and our new Redland Sesame Sesame Proposed
proposed system. MySQL Native System
Table 7: Comparison of Index Size 0:10.48 0:18.87 1:05.16 0:18.41
Database System Index size (Bytes) 0:44.14 0:00.73 0:00.55 0:00.49
Redland71 2.164.019.200 0:44.15 0:00.46 0:00.47 0:00.32
Sesame/Native65 39.997.992 3:04.21 0:03.42 0:01.95 0:00.47
Sesame/MySQL59 340.381.636
New Proposed System 1.090.002.944 The results obtained in the query tests reflect the
internal index structures of the various repositories.
6.2 Test Cases of Lookup Time In query 1, Redland is very fast since the query is
Since the queries associated with the Lehigh only an index lookup in the po2s index.
benchmark take into account reasoning, we Sesame/MySQL performs quite well here, probably
created four basic queries that test different access due to extensive optimizations in MySQL.
patterns and have different characteristics. Sesame/Native need to perform an index scan over
Table 8: Quad queries for our System all subjects and is therefore slow returning results.
Database System Index size (Bytes) Our System does just an index lookup and streams
(?, <rdf:type>, "Get all URIs of type back the results. Result size for query 1 is around
<univ:UndergraduateStu univ:UndergraduateStu 160.000 triples. Queries 2 to 4 are returning smaller
dent>, ?) dent". The query result sets, usually only a few triples/quads. Here
returns a large number as well, the performance results reflect the index
of organization of the store. Since our system keeps
results. We chose the perfect indices, all quad queries can be mapped to
query to test how fast a simple index lookup operations.
repository can stream Our system has some overhead for resolving the
results. dependencies and order in the different indices as
(?, ?, "Get the quads with the shown by the first query. However, as soon as in
"UndergraduateStudent0 object the other stores multiple indices are involved, our
", ?) 'UndergraduateStudent system shows a better performance than the other
0'". We expect this to systems. For query 4, our system was 400x faster
be a very than Redland, and still 4 to 7x faster than the
common query, that available Sesame implementations.
resembles a keyword
search in today's 7 CONCLUSIONS AND FUTURE WORK
search engines. We have presented a simple indexing technique
(<http://www.University9 "Get all quads with for RDF documents. In comparison with many other
65.edu>, ?, ?,?) subject indexing techniques that break a query down into
http://www.University9 pieces and then join the results, our technique has
65.edu". A query that the advantage of querying quads as a whole. Using
displays all quads, we are able to track provenance of
information of a given information in a scalable way. We made first steps
subject. in integrating text and structural indices and queries
(?, <univ:worksFor>, ?, "Get all quads with by using OIDs and perform keyword search over
?) predicate the lexicon. In our indexing scheme, there's room
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 43

for considerable improvement in index size. Our Database Management System for
index is small and scalable and allows for fast Semistructured Data," Vol. 26, No. 3, pp 54-66.
lookups. In contrast to other approaches, our scope Retrieved from http://www-
is purposely limited to keep the index simple and db.stanford.edu/~melnik/rdf/db.html
small. Future work and discussion will be focused 12. N., Yamamoto, O. Tatebe, and S. Sekiguchi,
on improving performance and concurrency and (2004), "Parallel and Distributed Astronomical
evaluating and discussing the indexing ranking Data Analysis on Grid Datafarm," Proceedings
approach. of 5th IEEE/ACM International Workshop on
Grid Computing, pp.461-466, 2004.
REFERENCES 13. Papakonstantinou Y., Garcia-Molina H., and
1. Brickley, D., and Guha, R.,V., (2002), "RDF Widom, J., (1995), "Object Exchange across
vocabulary description language 1.0: RDF Heterogeneous Information Source,"
Schema". Retrieved from Proceedings of the 11th ICDE, Taipei, Taiwan,
http://www.w3.org/tr/2002/wd-rdf-schema- IEEE (1995) 251–260.
20020430 14. Melnik, S., Raghavan, S., Yang, B. and
2. Prudhommeaux E., and Seaborne A., (2005), Garcia-Molina, H. (2001), "Building a
"SPARQL Query Language for RDF". Distributed Full-Text Index for the Web,"
Retrieved from http://www.w3.org/TR/rdf- Proceedings of the tenth
sparql-query 15. Bray T., Paoli J., and Sperberg-McQueen C.,
3. Agrawal R., and Jagadish H., V., (1994), M., (February 10, 1998), "Extensible Markup
"Algorithms for Searching Massive Graphs," Language (XML) 1.0," W3C Recommendation.
IEEE TKDE 6 (1994) 225–238. Retrieved from http://www.w3.org/
TR/1998/REC-xml-19980210
4. Akiyoshi, M., Toshiyuki A., Masatoshi, Y.,
Shunsuke, U.: An Indexing Scheme for RDF 16. Rajasekar, A., Wan, M.,and Moore, R., (July,
and RDF Schema based on Suffix Arrays. 2002), "MySRB and SRB - Components of a
SWDB 2003: 151-168 Data Grid," The 11th International Symposium
on High Performance Distributed Computing
5. Harth, A., and Decker, S. "Optimized Index (HPDC-11) Edinburgh, Scotland.
Structures for Querying RDF from the Web,"
Third Latin American Web Congress (LA-Web
2005), Buenos Aires, Argentina. IEEE
Computer Society 2005, ISBN 0-7695-2471-0,
pp: 71-80.
6. Balmin A., Hristidis V., Koudas N.,
Papakonstantinou Y., Srivastava D., and Wang
T, (2003), "A System for Keyword Proximity
Search on XML Databases," Proceedings of
29th VLDB Conference. (2003) 1069–1072
7. Oren, E. and Tummarello, G. A lookup index
for semantic web resources. In Proceedings of
the ESWC Workshop on Scripting for the
Semantic Web, pp. 71-78. Jun. 2007.
8. Guan, T., and Wong K. KPS: a Web
Information Mining Algorithm. Computer
Networks 31(11-16): 1495-1507 (1999).
9. Hayes J., and Gutierrez C., (Number 2004),
"Bipartite Graphs as Intermediate Model for
RDF," Proceedings of the 3th ISWC
Conference. Springer-Verlag 47-61.
10. Horrocks, I., Sattler, U., and Tobies, S.,
(2000), "Reasoning with individuals for the
description logic SHIQ," In David MacAllester,
editor, Proceedings of the 17th International
Conference on Automated Deduction (CADE-
17), Lecture Notes in Computer Science,
Germany.
11. McHugh, J., Abiteboul, S., Goldman, R.,
Quass, D. And Widom, J. (1997), "Lore: A

© 2010 Journal of Computing


http://sites.google.com/site/journalofcomputing/
 

Vous aimerez peut-être aussi