Académique Documents
Professionnel Documents
Culture Documents
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 33
Index Terms - RDF documents, RDF based indexing techniques, index size, and lookup time.
—————————— ——————————
1 - INTRODUCTION
The rest of the paper is organized as follows: In
Web ontologies are represented through logic- Section 2, exiting indexing techniques for RDF
based technologies such as RDF, RDF-S and OWL documents including keyword based, path based
[1]. RDF represents ontologies in the triplets which and suffix arrays based are discussed and
form a special kind of directed acyclic graph. That analyzed in detail. In Section 4, new proposed
is all about statements describing things [2]. indexing technique based on Lexicon & Quad is
Indexing is an important process in an information discussed in detail. Design and Implementation of
retrieval system. Basically, indexing is performed new proposed RDF indexing technique is
by assigning each document with keywords or presented in Section 5. The proposed technique is
descriptive terms representing the document. The validated in Section 6 and results are compared
assigned terms must reflect the content of the with the results of some existing techniques used in
document to allow effective keyword searching. In some systems like Sesame and Jena. In Section 7,
the past, indexing has been done manually by the paper is concluded with some future
trained persons who are familiar with the topics of recommendations.
the texts. Today, with the increasing availability of
electronic texts online, manual indexing is 2. LITERATURE REVIEW
obviously too slow and, needless to mention, too
In this section, detailed analysis of existing indexing
expensive. Automatic text indexing which is much
techniques for RDF documents is described. Also,
faster and less error-prone has become a common
comparison of different indexing techniques using
place [3].
some comparison parameters is described in both
Different indexing techniques for indexing RDF
descriptive and tabular format.
documents are being used these days. Following
Nowadays, there are many Semantic web search
are most common and popular. Keyword based
engines have been developed and deployed over
Indexing Techniques for RDF: Path based Indexing
the web which are using different indexing
Techniques for RDF [4] and Suffix Arrays based
structures for indexing RDF documents. Two of the
Indexing Techniques for RDF [5]. The problems,
Semantic Web search engines are very popular
facing by most of the existing indexing techniques
that index the Semantic Web by crawling RDF
include index size and lookup time. These
documents and then offer a search interface over
techniques have large index and small lookup time.
these documents.
In this paper, we propose a technique for indexing
SWSE (SWSE) indexes not only RDF documents
RDF documents with smaller index size and faster
but also “normal” HTML Web documents and RSS
lookup time.
feeds and converts these to RDF [6] [7]. SWSE
stores the complete RDF found in the crawling
1 phase and offers rich queries (expressiveness
Khawarizmi Institute of Computer Science, comparable to SPARQL) over this RDF data [3].
University of Engineering and Technology, Lahore Since SWSE also stores the provenance of all
2
Department of Computer Science & Engineering, statements, it can also provide the source lookup
UET, Lahore functionality that we provide but with a cost:
lookups are slower than in Sindice [18] and the
index is larger.
Table 5: Comparison of Existing Techniques for engines deployed on the World Wide Web (Swse;
Indices Implementation Swoogle). These both uses advance indexing
Tech./ Advantages Disadvantages techniques for indexing RDF documents. SWSE
Paramet indexes Resource Description Framework
er resources by saving the complete RDF obtained in
Scalability Large disk space (File the indexing stage and offers rich queries over
File systems typically have a RDF documents. But in SWSE, lookups are slower
System minimal block-size) and the index size is larger [7]. Similarly in
Swoogle semantic search engine, index design is
implemented is such a manner that it indexes the
Simplicity of Overhead in query
Databas Resource Description Framework resources found
implementat processing & occupy too
es on the web with better functionality than SWSE but
ion large disk space
its lookup time is also slower and index size is also
Persistent Large disk storage larger [16].
hash tables because hash functions The objective of this work is to analyze the existing
like with small probability of indexing techniques RDF documents and to
Berkeley collisions such as MD5 & propose an improved index structure for storing
DB are SHA1 produce keys for and querying RDF documents, to overcome the
Hash problems of large index size and slower lookup
better OIDs with at least 128 bit
Table time. To achieve this objective, it is essential that
solutions for size.
index there should be smaller index size and faster
structure lookup.
implementat
ion
4 - PROPOSED TECHNIQUE
Our proposed index structure consists
RDF is a W3C standard model and has emerged
of two indices. Its benefit over other
Lexicon as radical data format for semantic search engines.
schemes is small index size and less
& Quad RDF based Indexing techniques are used to index
lookup time. Object identifiers in Lexicon
RDF documents found on semantic search
are represented by only 64 bit index
engines. Many existing studies on indexing
size (Papakonstantinou et al., 1995).
techniques are being used to index RDF
documents but commonly they faced with two
World Wide Web (or the Web) is a largest portal of major problems of large index size and slow lookup
information in the world. The conventional web
time. In such scenario, a high performance based
search engines like Google and Ask find many
improved indexing technique is required to come
irrelevant results against user queries. The users
out from problems mentioned above.
then need to find the desired results manually from
We propose a simple but efficient indexing
retrieved results because according to one
technique to find certain resources about RDF
analysis about 70 % irrelevant links are retrieved
documents in decentralized surroundings. Our
against a user query. This situation is
technique only index RDF documents and
unacceptable and to handle it understanding of
occurrences of resources to keep index size small
information resources and their structures was
and lookup time fast.
necessary. This situation introduced the concept of
semantic web search engines to improve the
In following section, definitions of some terms used
retrieval percentage of relevant information. Unlike
in this thesis will be described.
the traditional search engines, which go through RDF Context: Context is described in different
the HTML web pages, semantic web search ways. Following is a definition as described in
engines index Resource Description Framework
W3C.
data stored on the web using advanced indexing “If context c (R U B) and t as rdf triple, then pair
techniques based on path index [11], suffix arrays
(c,t) is known as RDF triple in c (context)”.
[12] and linked data approach [13] etc. Advance RDF Namespaces & Documents: As mentioned
indexing techniques based on path index, keyword
earlier, RDF adopts container package model. In
index, suffix arrays and linked data etc which are
XML model, anyone can develop its own tags that
currently being used for indexing RDF documents
are difficult in structural standards and an
have two major problems. The first problem is the
obstruction for generation of web document’s
index size [10] and the second is lookup time [13].
semantics. RDF is a good platform for different
These indexing techniques have large index size
metadata schemas to follow structural
and slow lookup time. Now, we consider practical
standardisation [11]; hence RDF model is efficient
examples of the two semantic search engines
platform for indexing and retrieving of web
which index RDF documents but have the
documents.
problems mentioned above.
“Namespace” is used in RDF for the recognition of
SWSE (Semantic Web Search Engine) and
metadata. It’s a standard suffix to identify that
Swoogle are the two popular semantic search
which tag belongs to which schema.
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 37
In this section of new indexing technique for RDF Table 5: Comparison of Existing Techniques for
documents, different modules of new proposed Indices Implementation
indexing technique will be discussed including Comparison of Existing Techniques for
structure of Quad index and Lexicon index. Indices Implementation
Indices Architecture: As mentioned earlier, we Technique/ Advantag Disadvantages
aimed to an improved indexing technique for RDF Parameter es
documents with small indexing size and fast Scalability Large disk space
searching model. In other words, we inclined to File (File systems
optimize lookup time and disk storage space. For System typically have a
this requirement, we focus on simplicity; we minimal block-size)
consider an efficient but simple ranking algorithm to
Simplicity Overhead in query
get useful results. For optimization purposes, we do
of processing &
not focus on original documents in which resources Databases
implement occupy too large
happened but rather only indexes occurrences of
ation disk space
resources and our lookup route is only from rdf
Persistent Large disk storage
resources to rdf sources and not vice versa.
hash because hash
Our indexing architecture consists of three
tables like functions with small
components.
Berkeley probability of
Index: This component is used to parse and index
DB are collisions such as
RDF documents in any given URL.
Hash Table better MD5 & SHA1
Lookup: It is used to find specific URI and returns
solutions produce keys for
a ranked list of URLs.
for index OIDs with at least
Refresh: This component revives the index by
structure 128 bit size.
updating the known resources
implement
ation
Indices Design and Implementation: There is
always solid reason behind every specific selected Our proposed index structure
item from a list of items. This reason is concluded consists of two indices. Its benefit
after many experiences and experiments. Similarly, Lexicon & over other schemes is small index
our Indexing algorithm for RDF documents has Quad size and less lookup time. Object
specific and unique characteristics over others identifiers in Lexicon are
implemented to date for storing and indexing RDF represented by only 64 bit
docs. We are using following algorithms for RDF index size.
index design and lookup time process.
Following is a simple algorithm describing the We trade space for retrieval time. We want to avoid
basics of Index design for RDF documents. As expensive disk seeks to keep small index size and
shown in table 3, when a new source is indexed, all fast lookup time, which has a considerable
related URIs are extracted and are added to influence on the design of our index organization.
“Found” index to specify that URI is found in the In particular, we store information redundantly in
indexed source. different sorting order, which allows us to retrieve
Table3: Algorithm for RDF Index design any access patterns with a single index lookup.
Begin We use B+-Trees as indexing structure for disk
Define RDF_Index(RDF_source) storage. B+-trees are a well understood data
RDF_Resources = RDF_subject + RDF_object structure and have good properties regarding
in RDF_source inserts and deletions (Comer, 1979). Conceptually,
foreach RDF_resource in RDF_resources we have (key, value) pairs where retrieval based on
FoundEntry[RDF_resource] += RDF_source key yields the value using few disk operations.
End // For each loop We distinguish between core indices that have to
End be present to be able to retain all information, and
optional indices that are redundant and can be built
Time complexity of above mentioned algorithm is
given the core indices.
O(1). For lookup time, a separate algorithm is
In the following, we first describe the lexicon, which
defined as shown in table 4. Since RDF resource
maps nodes to OIDs and back; next, we define the
URI is used as key and found sources are
notion of a perfect index and describe how we use
considered as values, so time complexity for lookup
such a perfect index to store quads. Then, we
algorithm is also O(1).
present an analysis of the disk space requirements
Table 4: Algorithm for RDF lookup time design
of our proposed indexing structure.
Begin
Finally, we describe how we combine text index
Define RDF_Lookup(RDF_source)
and structure index to answer single quad queries.
RDF_sources = FoundEntry(RDF_sources)
4.2.3 Lexicon and Full Text Index
Return foundbyRank(RDF_sources)
The lexicon maps OIDs to node values, and node
End
values to OIDs. The oidnode index keeps (OID,
node) pairs, and the nodeoid index keeps (node,
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 38
In the following, we describe how to implement a store a count that is incremented once a particular
perfect index efficiently using standard B-+ trees. key is inserted.
4.2.6 Key Concatenation Using our example, consider that we have the quad
We use key concatenation to create the structure (1:1011:2003:1010). We generate four other keys,
index on disk. Concatenated keys consist of a (0:0:0:0), (1:0:0,0), (1:1011:0:0), and
sequence of OID. To save space, we don't keep a (1:1011:2003:0), and insert them into the index as
separate list with statements and statement IDs, well. The key (0:0:0:0) is a special key that has the
but encode the whole quad into the key. For total count of keys in a particular index. Since
example, instead of CSP we append the required counting and storing the occurrence counts is a
information to retain the whole quad and generate quite expensive operation, we perform this
a key CSPO. operation in a batch manner once the quad indices
Each quad is sorted according to the different have been constructed. Determining the result size
access patterns. That means we store each quad of a query is as simple as looking up the
(s, p, o, c) as a key spoc in the spoc index, as key corresponding query key in the index. The value
pocs in the pocs index, and so on. Figure 6 associated with that key is the number of resulting
illustrates the idea. Our example quad would be quads for a given query.
stored as (1:1011:2003:1010) in spoc, as 4.2.8 Index Size Analysis
(1011:2003:1010:1) in pocs, and so on. In the following, we present an analysis of the
storage requirements of our indexing structures.
SPOC The size of the lexicon highly depends on the
OIDS OIDP OIDO OIDC characteristics of the input data. The size of the
structure indices grows linearly with the number of
POCS statements stored. Since we use B+-trees for all
OIDP OIDO OIDC OIDS our indices, we assume a storage utilization of 69
percent [12].
...... Let n be the number of distinct nodes, m the total
OSCP number of statements, the length of OIDs in bytes,
OIDO OIDS OIDC OIDP and 1/storage utilization of the index. Then, the size
Fig 6. Entries for the quad data structures of the oidnode and nodeoid index is
(α * n +∑ni=0 size of literal i) * ℓ 51
We just store keys in the corresponding B+-tree, For the text index textii, the inverted list of
and leave the value part empty. At quad insertion occurrences can become quite large, depending on
time, the indexer creates a key for every index and the characteristics of the string literals.
reorders the quad to the sequence the index The size of one of the structure indices spoc, cpos,
requires. When inserting the key into the B-tree, we ocsp, pocs, cspo, oscp is
maintain the order on the first part, then on the (α * 4 * m) * ℓ 50
second part, and so on. In that way, we end up with Adding statistical information to a quad index can
a B-tree that stores the keys in sorted order. grow the index size considerably. In case parts of
With the sorted index, we now can perform a range the key are repeated often, the additional storage
query over that partial tree that constitutes the required is less. For example, in the cpos index, a
query result. For example, we want to query for all lot of statements share the same context, and
statements with "Advanced DB-paper" as a subject therefore less additional space is required as
(remember that the blank node "Advanced DB- opposed to the spoc index, because there are
paper" has OID 1). The corresponding key is usually many distinct subjects. The statistical
(1:0:0:0). To retrieve the partial tree with the query information add in our experience between factor
results, we can range query over the spoc index 1.2 and 2.8 to the index size, depending on the
with lower bound (1:0:0:0) and upper bound distribution of the data.
(1:MAX:MAX:MAX).
4.2.7 Statistics 5 DESIGN AND IMPLEMENTATION
A large number of applications, such as query
In this section, we describe the detailed design and
optimization, data mining, or ranking of query
implementation of our proposed system. Different
results, require statistical information about the
components used in the design and implementation
data set. To allow these applications to quickly
of this system are explained in detail. In our
access basic statistical information, we can store
proposed system, architecture consists of following
occurrence counts directly in our index. For each
components.
key that is inserted into the structure index, we
5.1. Components of Proposed System
generate four additional keys; one that contains
i. HTTP server as an access interface
only 0 values, one that contains the first entry
A web server is a computer program that delivers
together with 0 values, one that contains the first
contents, such as web pages, using the Hypertext
and second entry, and on that contains the first
Transfer Protocol. There are different servers used
three entries. In the value part of each B-tree we
as Web Server. HTTP Server is also a popular web
server. The HTTP Server API enables applications
© 2010 Journal of Computing
http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 40
Executing N3 statements in API varies depending it in the benchmarking. We tried to install Kowari,
on the "intention" of the N3 statement. Insert but failed to get a running version. In one of our
operations are executed using the method installations, inserting a 1 MB N-triples file via the
executeInsert, and delete operations using the Jena interface resulted in 30 minutes processing
method executeDelete. Notice that these time before the process threw an exception
commands change the state of the repository. because of a full disk (there was 200MB disk space
Insert operations take as parameter a string available before starting the Kowari server). On
containing N3 (Turtle) statements, or alternatively a another installation, we got a core dump of the JVM
java.io.Reader object for streaming inserts. Delete when running Kowari. Therefore we concentrated
operations take as parameter an N3QL query the in our efforts on Sesame 1.1RC2 and Redland
returns triples, which are then removed from the 0.9.18.
repository. The comparison is somewhat difficult since
Since the N3 statements will not quite fit on one Sesame doesn't support context. We refrained from
line on the page, we have split it into multiple using reification in Sesame to keep context
strings concatenated by a plus sign (+) so that it information and just stored triples. Redland was
will compile. Note that we are reusing the same with its context mechanism enabled. One thing to
Statement object rather than having to create a mention is that the random generation of the
new one benchmark files leads to duplicate triples in
The executeInsert method accepts both String and different contexts. In a system that doesn't store
Reader objects as parameters. Pass a Reader as contexts, multiple occurrences of triples in different
parameter if you have a large dataset which should contexts are only stored as one triple. In systems
be transferred in a streaming fashion. that store context information, the same triple can
5.3. Executing Queries be stored multiple times, and therefore the number
As opposed to the previous section statements, a of retrieved results can be much higher there.
query is expected to return a set of tuples as the 6.1 Test Cases of Index Size
result, and not change the state of the repository. For the testing of index size and its construction,
Not surprisingly, there is a corresponding method we loaded the files from N-Triples format into the
called executeQuery, which returns its results as a repository. The index creation time for our system
ResultSet object: Triple triple; consists of the quad index excluding statistical
ResultSet rs = stmt.executeQuery("@prefix ql: information, and the Lexicon without inverted index
<http://www.w3.org/2004/12/ql#> . \n"+ to be comparable to the other repositories that
"@prefix systems: don't construct these indices either. Following are
<http://sw.deri.org/2004/06/systems#> . \n <> chart based results of our test cases for Insert and
ql:select { ?s ?p ?o . \n" + Delete operations for Sesame, Jena2, Redland and
"}; ql:where { { ?s ?p ?o . } systems:context ?c . } our proposed system.
.");
while (rs.next()) { triple = rs.getTriple(); Fig 8. Insert Operations Graph for different
System.out.println("Triple: " + triple); Systems
}
The RDF triples (or RDF lists, depending on the
ql:select clause) resulting from the query are
contained in the variable rs which is an instance of
ResultSet. A set is of not much use to us unless we
can access each row and the attributes in each
row. The ResultSet provides a cursor to us, which
can be used to access each row in turn. The cursor
is initially set just before the first row. Each
invocation of the method next causes it to move to
the next row, if one exists and return true, or return
false if there is no remaining row.
for considerable improvement in index size. Our Database Management System for
index is small and scalable and allows for fast Semistructured Data," Vol. 26, No. 3, pp 54-66.
lookups. In contrast to other approaches, our scope Retrieved from http://www-
is purposely limited to keep the index simple and db.stanford.edu/~melnik/rdf/db.html
small. Future work and discussion will be focused 12. N., Yamamoto, O. Tatebe, and S. Sekiguchi,
on improving performance and concurrency and (2004), "Parallel and Distributed Astronomical
evaluating and discussing the indexing ranking Data Analysis on Grid Datafarm," Proceedings
approach. of 5th IEEE/ACM International Workshop on
Grid Computing, pp.461-466, 2004.
REFERENCES 13. Papakonstantinou Y., Garcia-Molina H., and
1. Brickley, D., and Guha, R.,V., (2002), "RDF Widom, J., (1995), "Object Exchange across
vocabulary description language 1.0: RDF Heterogeneous Information Source,"
Schema". Retrieved from Proceedings of the 11th ICDE, Taipei, Taiwan,
http://www.w3.org/tr/2002/wd-rdf-schema- IEEE (1995) 251–260.
20020430 14. Melnik, S., Raghavan, S., Yang, B. and
2. Prudhommeaux E., and Seaborne A., (2005), Garcia-Molina, H. (2001), "Building a
"SPARQL Query Language for RDF". Distributed Full-Text Index for the Web,"
Retrieved from http://www.w3.org/TR/rdf- Proceedings of the tenth
sparql-query 15. Bray T., Paoli J., and Sperberg-McQueen C.,
3. Agrawal R., and Jagadish H., V., (1994), M., (February 10, 1998), "Extensible Markup
"Algorithms for Searching Massive Graphs," Language (XML) 1.0," W3C Recommendation.
IEEE TKDE 6 (1994) 225–238. Retrieved from http://www.w3.org/
TR/1998/REC-xml-19980210
4. Akiyoshi, M., Toshiyuki A., Masatoshi, Y.,
Shunsuke, U.: An Indexing Scheme for RDF 16. Rajasekar, A., Wan, M.,and Moore, R., (July,
and RDF Schema based on Suffix Arrays. 2002), "MySRB and SRB - Components of a
SWDB 2003: 151-168 Data Grid," The 11th International Symposium
on High Performance Distributed Computing
5. Harth, A., and Decker, S. "Optimized Index (HPDC-11) Edinburgh, Scotland.
Structures for Querying RDF from the Web,"
Third Latin American Web Congress (LA-Web
2005), Buenos Aires, Argentina. IEEE
Computer Society 2005, ISBN 0-7695-2471-0,
pp: 71-80.
6. Balmin A., Hristidis V., Koudas N.,
Papakonstantinou Y., Srivastava D., and Wang
T, (2003), "A System for Keyword Proximity
Search on XML Databases," Proceedings of
29th VLDB Conference. (2003) 1069–1072
7. Oren, E. and Tummarello, G. A lookup index
for semantic web resources. In Proceedings of
the ESWC Workshop on Scripting for the
Semantic Web, pp. 71-78. Jun. 2007.
8. Guan, T., and Wong K. KPS: a Web
Information Mining Algorithm. Computer
Networks 31(11-16): 1495-1507 (1999).
9. Hayes J., and Gutierrez C., (Number 2004),
"Bipartite Graphs as Intermediate Model for
RDF," Proceedings of the 3th ISWC
Conference. Springer-Verlag 47-61.
10. Horrocks, I., Sattler, U., and Tobies, S.,
(2000), "Reasoning with individuals for the
description logic SHIQ," In David MacAllester,
editor, Proceedings of the 17th International
Conference on Automated Deduction (CADE-
17), Lecture Notes in Computer Science,
Germany.
11. McHugh, J., Abiteboul, S., Goldman, R.,
Quass, D. And Widom, J. (1997), "Lore: A