Académique Documents
Professionnel Documents
Culture Documents
in a P2P World
Xiaoqi Zhang
x.zhang4@pgrad.unimelb.edu.au
8/7/2008
Supervisor: Dr. Egemen Tanin
Distance Join Processing in a P2P World
Abstract
P2P networks have expanded their use to the area of distributed database
systems. The P2P paradigm is famous for its various advantages over the conventional
client-server paradigm in that it provides excellent scalability both in computation and
bandwidth as well as no single point of failure due to decentralization. Spatial data is
widely used today in P2P applications. By exploiting the features of the P2P paradigm,
efficient spatial data retrieval becomes possible. A large body of work has been done
in spatial data retrieval over P2P networks, which focuses on the classic query
operations of range query and nearest neighbor query. However, to the best of my
knowledge, no work has been done in spatial data distance join operations in the
context of P2P paradigm. This report gives a detailed review on the first distance join
algorithm for P2P networks along with its implementation. A comprehensive
experiment is carried out at the end to examine different aspects of the algorithm.
1. Introduction
Spatial data has become a critical ingredient in various applications and
databases including location-based services [1], public transportation services
scientific data management [2,3,4] and digital government [5]. Not only is spatial data
widely used in scientific or government organizations but also it is used by the general
public, such as in-car GPS systems, real-estate agencies, etc.
2D worlds and their representations are the most frequently used spatial data in
spatial data processing domain. A 2D presentation of a virtual or a real world in an
application contains many spatial objects which have positional values. One solution
to eliminate the bottleneck problem that the conventional client-server architecture
may bring into the applications is to distribute such spatial objects among machines in
the P2P networks so that operations on the spatial data are carried out in a P2P
paradigm rather than a client-server paradigm. New P2P applications, i.e.,
job-employee seeker networks, buyer-seller networks, event/location finders for a city,
follow the solution. For example, in a buyer-seller P2P network, information about
sellers and products is distributed over the network. A potential buyer may supply
his/her location and an area in the map where sellers may be located along with some
information about the product to a search system and the system returns a list about
the sellers who is selling the related products. This type of operation can be done by
simply clicking on a 2D map to choose the location and area. Another similar type of
query will yield the distance join result which contains ordered pairs of spatial objects.
Such order depends on the distance between the two spatial objects. Finding the
2|P a g e
closest bar-restaurant pair will be one example of such applications. One
straightforward approach towards this type of operation is to simply forward
messages among available nodes in the network for locating desired data. Such an
approach is obviously not feasible, which makes an extra large amount of peers that
do not have the desired data participate in this operation. In the unpublished paper [6],
Tanin et al. have proposed an elegant way that exploits the features of P2P networks.
They used a data structure called quadtree [7] to partition underlying spatial data in
2D worlds on which distance join queries are carried out. The content of this report is
based on [6, 8]. It gives a detailed explanation of the proposed distance join algorithm
and the results of a comprehensive experiment are presented at the end.
The rest of this report is organized as follows. Section 2 gives a brief review of
related works focusing on sequential distance join algorithms and distributed quadtree
index; section 3 discusses 2 other types of query on distributed quadtree index;
section 4 explains the distance join algorithm and one implementation of mine;
section 5 gives the details of the experiments and the results; in section 6, conclusion
and future work are given.
2. Related Work
2.1. Base Sequential Algorithm
Several works has been done regarding to distance join algorithms. Hjaltason
and Samet examined various similarity search algorithms in metric spaces in [9] with
the main contribution being the use of a priority queue-based ranking algorithm for
spatial data. This algorithm can find the results of a ranking query in an incremental
fashion. In [10], they proposed a distance join algorithm that works on a hierarchical
spatial data structures. In the paper, the authors use a data structure called R-tree as
the storage of the spatial data/R-tree blocks. Priority queue based approach is adopted
to facilitate the process of the ranking algorithm. Pairs of spatial objects and R-tree
blocks are inserted into the priority queue. The distance between each pair is used as
the criterion for ordering the queue. At each step of the algorithm, the pair at the head
of the priority queue is retrieved and processed, i.e., the pair with the smallest distance.
If the pair is formed by two data objects, then the pair is reported as the next closest
pair. If one of the items in the dequeued pair is a node from the R-tree, then the R-tree
node in the pair is substituted by its descendants, i.e., objects or sub-nodes, to form
new pairs. This method works in an incremental fashion. Their algorithm has a
drawback. Pairs in the priority queue are processed sequentially. Thus in a P2P
network, the algorithm will work inefficiently due to the accumulated communication
delay. The algorithm examined in this report employs the similar priority queue based
approach but it is carefully designed so that it works efficiently in P2P networks by
utilizing the parallelism in the network.
3|P a g e
2.2.1. Partition Spatial Data Using Quad-CIF Tree
The distance join algorithm examined in this report is based on distributed
quadtree index proposed in [11]. In the paper [11] a data structure called quad-CIF
tree [12] is used for partitioning spatial data. A quad-CIF tree is a variation of quad
tree [13] and is originally used for speeding-up algorithms used in computer-aided
design of integrated circuits [12]. A quadtree is a tree data structure with each node
can have maximum 4 sub nodes. The quadtree can represent a 2D space in the
following way: At the beginning a root node in the quadtree represents the entire 2D
space. The space is then divided into 4 identical sub regions, which equals the root
root node node splitting itself into 4 sub nodes
with each one of them corresponding to
a sub region. For each one of the sub
regions, the same process then proceeds
o recursively until a certain criterion is
met. Figure 1 shows this process. Quad
CIF-tree extends quadtree definition in
root node
root node
A B O
level 1 nodes
o
C D
root node
O
root node
B
A A D
B C
O
A B
level 1 nodes
o
C D
C D
B
that it specifies the criteria of when to start A A D
B C
the subdivision and when to stop the O
4|P a g e
splits itself into 4 identical sub regions; and for any one of the 4 sub regions that
completely contains the spatial object, split itself again, until no sub region can
contain the spatial object in its entirety. And the spatial object is inserted to the node
which corresponds to the smallest region that contains the spatial object in its entirety.
The process is depicted in figure 2.
In the paper [11], the proposers give a concept of “control point” for each
region and sub region, which is simply the centroid of the region. As shown in figure
2, each node in the quadtree maintains the information about its corresponding control
point denoted as ݑ, which can be represented in the following formula:
݀(݀{( = )ݑଵ, ݀ଶ, ݀ଷ, ݀ସ}, ܲ(ݔ, )ݕ, ݈݅)ݐݏ
Basically, these are 3 pieces of information: first, the information about the 4 children
of the node, denoted as ݀ଵ, ݀ଶ, ݀ଷ, ݀ସ, which are just type of integer indicating how
many spatial objects does the corresponding child have; ܲ(ݔ, )ݕis the 2D Cartesian
point ݑin the 2D region; and ݈݅ ݐݏcontains all the spatial objects which are inserted
to this quadtree node. The information is crucial for searching algorithms (rang query,
nearest neighbor query, distance join query) to conduct. It makes it possible to decide
whether to forward a query further down on the quadtree. Details will be given at
section 3.
5|P a g e
Chord specification. Figure 3 shows one possible result of hashing control point of
each quadtree node to the Chord virtual circle space. As depicted in the figure, peer1
root node
O
O C
CB
peer 1
B peer m
A A D A
D
B C
O
CA
CD
CA
CA CB
CB CC
peer 1567
C D CD
B
CC CD peer 345
CC
rectangle 1
InitiateRangeQuery(query Q)
{
control point list G = {}
Subdivide (Q, root, G)
for each u in G do
Delegate(u)-> DoRangeQuery(Q, u)
}
DoRangeQuery (query Q, control point u)
{
intersect objects in D(u).list with Q
send results
for i = 1 to 4 do
if (Ints(R(C(u, i)), Q) is not empty) and(D(u).di > 0) then
Delegate (C(u, i))->DoRangeQuery(Q, C(u, i))
}
Figure 4.
Algorithm for range query
4). Upon arrival, peers that get the forwarded range query return any spatial objects
that intersect with the query range Q and then for each children of the queried control
point, forward the query Q to those who have spatial objects and whose controlled
range intersects with the query. The range query process is shown in figure 5 with
Fmin=0. Peer1567 initiates the range query. Translucent rectangle (denoted as “query
Q” in the figure) is the query rectangle. In a distributed quadtree index P2P network,
every query starts to process from Fmin level in the quadtree. In this case, Fmin=0,
the query starts from root node. Query is passed down on the quadtree. Initially, the
1
root node
2 O 1
O C
CB
peer 1
B peer m
A A D A
D
B C
2
O
CA
CA
CD 3
CA CB
CB CC
query Q peer 1567
C D CD
B 3
CC CD peer 345
CC
rectangle 1
7|P a g e
result of Subdivide contains only control point O which controls the entire region.
Peer1567 then passes the query to the peer in the network which has the data about
control point O. This process is depicted as the curve marked 1 in figure 5. With the
help of Chord, the query is then passed to peer1 who has information about control
point O. When query is arrived in peer1, peer1 first examines whether it has any
spatial data (in this simplified example, rectangles) that intersects the query
rectangle; and then, it checks are there any children of the node O whose controlled
range intersects the query rectangle Q and who has spatial data. After examining,
peer1 finds that the children of O, C meets such requirements. Then peer1 forwards
the query to the peer who has information about control point C. With Chord, we
know that the peer is still peer1. This process is depicted by curve marked 2. Peer1
repeated process 1, and finds that sub region CD intersects the query Q and has spatial
data in it. Then peer1 forwards the query to the peer who has information regarding
control point CD, namely the peer345. The routing process is depicted by the curve 3.
When query arrives at peer345 it finds it has spatial object rectangle1 and no sub
regions have spatial objects. Then, after sending the result back, the range query stops.
As described, the query starts at root node and is passed down on the quadtree with
the order: O->C->CD.
3.1.2. Implementation
For implementation part, I use tables to show the features which I implemented
and in “Extra” column, I added some specials and key points that must be paid
attention to.
Table 1 shows the implementation details.
Item Implemented Extra
Routing (Chord) Basic data structures This project does not deal with the
Delegate(u)
DoRangeQuery(Q, u)
Table 1. Implementation details for range query
9|P a g e
top will be deleted as soon as they are found). Thus the WCDist= Min (d, D). Then,
for each control point in the priority queue, those with the distance from their
root node
O
O C
CB
peer 1
B peer m
A A D A
Wc D
Wcd d is B C
is t 2 t1
O
CA
CD
CA
CA CB
CB CC
query Q peer 1567
C D CD
B
CC CD peer 345
CC
rectangle 1
q
C A D B CD D B rect0 D B rect0
controlled ranges to the query point less than or equal to WCDist are contacted in
parallel. The entire process is depicted in figure 7 with Fmin=1. Peer345 initiates the
nearest neighbor query by calling InitiateNNQuery. GetSortedControlPoints will
return a priority queue, which contains level 1 control points, namely, A,B,C and D.
The status of the priority queue is denoted as “priority queue status 1” in the figure.
The first WCDist and the range it covers are denoted by the quadrant marked as
“Wcdis1” in the figure. Therefore, SendMessagesWithin will forward the query in
parallel to the peers who possess control points C, A and D respectively. As shown in
the figure, peer345, peer1 and peer m get this message. Then DoNNQuery procedure
is called at each one of them. They will create reply message put any spatial objects
they have along with any control points which have spatial object in it to the message
and send it back to query initiating peer, in this case, peer345. Assuming the reply
message corresponding to control point C arrives at peer345 first (the arriving order
may vary due to message delay; however, this doesn’t affect the correctness of the
algorithm). ReceiveNNMessage is called at peer345. After inserting all the control
points and spatial objects into the priority queue, the status of the priority queue is
denoted as “priority queue status 2” in figure 7. Control point C is deleted from the
priority queue after handling the reply message corresponding to it. Then
UpdateWCDist is called to update the WCDist. The updated WCDist is shown as the
smaller quadrant in figure 7 denoted as “Wcdis2”, where the SendMessagesWithin
procedure will sent the query to the peer that just has control point CD (because
control points A and D has been contacted previously). This time peer345 is contacted.
Before peer345 returns a reply message back, assuming reply message about control
point A just arrives at the query initiating peer which is peer345, according to the
algorithm, the spatial objects and control points are inserted into the priority queue.
“priority queue status 3” in the figure shows the status of the priority queue after
insertion. Note that the distance from control point D, B, CD to query point is closer
10 | P a g e
than that of rectangle 0, thus, rectangle 0 is at the end of the priority queue. Now
peer345 sent the reply message back along with the spatial object rectangle 1 to the
query initiating peer. After this iteration, the status of the priority queue is shown as
“priority queue status 4”. Now, there is a spatial object becoming the head of the
queue. So it will be the nearest spatial object with respect to query point q. The
algorithm can now stop or proceed as needed. Because neither do both control points
B and D possess any spatial objects nor their children, when the reply messages
corresponding to them are returned, B and D are simply deleted. The nearest neighbor
query stops automatically when the priority queue is empty.
3.2.2. Implementation
Table 2 shows the implementation details of algorithm for nearest neighbor
query.
children of a control point. The former Algorithm for distance join query
12 | P a g e
is implicitly known by every peer in the P2P network, thus no communication is
required. The latter is automatically obtained after distributing the quad-CIF tree
among the machines in the P2P network (mentioned in section 2.2.1, each control
point contains information in the form: ݀(݀{( = )ݑଵ, ݀ଶ, ݀ଷ, ݀ସ}, ܲ(ݔ, )ݕ, ݈݅))ݐݏ. Therefore, it is
very easy for a query initiating peer to forward the distance join query from root node
down on the quadtree. Figure 8 is the pseudo code for P2P distance join algorithm.
Initially, there is only one pair in the priority queue, namely, the root control point of
each quadtree. As the algorithm proceeds, pairs of control points and spatial objects
are inserted into the priority queue. Thus, four types of queue element exist, (spatial
object, spatial object), (spatial object, control point), (control point, spatial object),
(control point, control point). The processing of a pair in the query initiating peer
must be strictly synchronized in the sense that messages that are sent as a pair must be
processed together. In the P2P distance join algorithm, elements in priority queue are
control points and objects pair. As algorithm proceeds, pairs of messages are sent. The
reply messages corresponding to paired-messages sent previously must be handled
together. However, due to the uncertainty in communication delay, reply messages
may arrive at query initiating peer at arbitrary time. Therefore, for handling reply
messages pairwise, extra work has to be done. My solution is giving the messages that
are sent in pair a unique ID and caching the single message to which that hasn’t
received a paired reply message. Whenever a reply message with the same ID as the
BA
Status1: Status3:
SETX SETY
SETX SETY
Head C B
A B rectX0 BA
Tail A B
rectX0 BD
CD BD
BD
rectangleY1 Tail CD BA
O
Status2: Status4:
CA CB
SETX SETY SETX SETY
Head A B rectX0 rectY0
C D CD BD rectX0 BD
Tail CD BA CD BD
CC CD
Tail CD BA
rectangle X1
cached one is received, we can say that the two replay messages are in one pair, thus
they can be handled together. This strict synchronization property of pairwise
message processing guarantees that the new pairs generated from doCombine will not
contain redundant pairs. As shown in the algorithm, pairs in the priority queue are
contacted in parallel rather than sequentially. The newly defined variable WCDist is
used here to be a criterion to determine which pairs are contacted. The procedure
UpdateWCDist updates the WCDist in the following way: let D be the maximum
13 | P a g e
distance between the items of a pair that is in the head of the priority queue and is
none-object-object pair. And let d be the maximum distance between the spatial
objects of the first object-object pair (if any) found in the priority queue (can not be
the first, because as soon as found in head, it will be retrieved as the next closest pair).
Then WCDist=Min(D,d). Then for those pairs in the priority queue whose distance
between the two items in the pair is less than or equal to WCDist is contacted in
parallel, which makes this algorithm distinct from the traditional sequential algorithm.
Figure 9 shows a simple case to demonstrate the distance join algorithm. There
are 2 sets of data, depicted using two different colors. Rectangles X0, X1 belong to
dataset X. Rectangles Y0, Y1 belong to dataset Y. At the beginning, procedure JoinInit
is called at query initiating peer. As shown in the pseudo code, peers that own the root
control point of each data set are first contacted; in this case, two control points O of
two data sets. Two distance join initialization messages are sent with the same unique
ID (for processing messages in pair). Whenever a peer receives a distance join related
message procedure ProcessReply is called, it will put any spatial objects along with
any children control points which contain spatial objects in a reply message and sent it
back to the query initiating peer. Procedure RecvMessage is called at query initiating
peer upon receiving a reply message. Due to the fact that reply messages
corresponding to pairwise sent messages can be delay randomly, for being able to
process the messages in pair, a message cache is used to temporarily store the early
arrived reply message (the unique id is used to pair messages). Assuming reply
message from peer that owns control point O of data set X arrives first, and that of
data set Y arrives second. The algorithm then finds the paired reply messages, and
calls procedure doCombine to generate new pairs from the reply messages. After
processing the messages, it deletes the processed element from the queue. Now one of
the possible statuses of the priority queue is denoted as “Status 1” in figure 9 (it also
can be (A,B),(C,B), because the distance between control block A and B is equal to
that of C and B). Then the worst case distance WCDist is calculated, the result is
denoted in the figure as WCDist1 which is the maximum distance between control
block C and B. Then pairs in priority queue whose distance between two items in the
pair is less than or equal to WCDist1 are contacted. Thus peer that has control point C
in data set X and peer that has control point B in data set Y are contacted. The same for
pair (A,B). Until now, the first iteration of the algorithm finishes. Note that same
control points in one data set may appear in more than one pair in the priority queue,
thus potentially will be contacted multiple times, which causes communication
overheads. To overcome the problem, the results of previously contacted control
points are stored locally in the query initiating peer for eliminating unnecessary
communication. In the next iteration, assuming paired reply messages for (C,B) arrive
first (algorithm works correctly if paired reply messages for (A,B) arrive first). “Status
2” in figure 9 shows the content of priority queue after receiving reply messages for
(C,B). “Status 3” shows the content after receiving reply messages for (A,B). Note that
a new iteration may begin when the queue is in “Status 2” where the previously
contacted pair (A,B) will not be contacted again. Assuming the new iteration begins
after “Status3”. The corresponding updated WCDist is denoted as “WCDist2” in the
14 | P a g e
figure, which is the maximum distance between rectangle X0 and control block BA.
Again, pairs in the priority queue that satisfy the worst case criterion are contacted. In
this case, all 4 pairs are contacted. For the reason of clarity and simplicity, we only
look at pair (rectX0, BA). When the reply messages for control point BA is received,
after calling procedure doCombine, the content of the queue is denoted in the figure as
“Status 4”. As shown in the figure, an object-object pair appears at the top of the
queue; this is the closest pair in two different data sets. Once such a pair is found, it is
retrieved immediately and the algorithm should allow the users to determine whether
to proceed or stop the distance join algorithm.
The simple example described previously started the query from the root control
point of each data set. The distributed quadtree index allows spatial data to be inserted
from Fmin level in the quadtree rather than from root level which is the same as when
Fmin=0. Therefore a slight modification of the algorithm is needed to allow query to
start from Fmin level rather than root level to avoid communication overheads when
forwarding query from level 0 to Fmin level.
4.2. Implementation
Table below shows the implantation details of P2P distance join algorithm.
Item Implemented Extra
Routing Basic data structures This project does not deal with the issues that arise
(Chord) find_predecessor when node join or leave the Chord network, only
Indexing Basic data structures Quadtree, control point. Quadtree node, rectangle,
Fmin, Fmax, etc..
Algorithm Basic data structures Implementation strictly follows the protocol defined
UpdateWCDist()
Table 3. Implementation details of P2P distance join algorithm
5. Experiments
15 | P a g e
5.1. Experimental Environment
Transit domain3
stub node
transit node
stub domain
16 | P a g e
Parameter Value Unit
Network delay in local area network 10 ms
Network delay between stub nodes 40 ms
Network delay between stub node and transit node 200 ms
Network delay between transit nodes 200 ms
Bandwidth in local area network 54 Mbps
Bandwidth between stub nodes 100 Mbps
Bandwidth between stub node and transit node 100 Mbps
Bandwidth between transit nodes 1000 Mbps
Table 4. Physical parameters for underlying network
For test data sets, obtaining real life data can be tricky. Thus a solution must be
found to generate near real life test data sets, for example, all the restaurants
distribution in urban region in Melbourne and all the seven-eleven
seven convenience
onvenience store
stores
in urban region in Melbourne. Merely adopting random functions provided by API
can only yield uniformly distributed
distri data which cannot reflect the genuine
F
Figure 11. Sample test data with 400 spatial object
performance of this algorithm towards real world. According to Zipf's law [[18], many
types of data studied in the physical and social sciences can be approximated with a
Zipfian distribution [18].. My test data sets are generated roughly following the
Zipfian distribution. For a 2D region, it is divided into 8 square rings with each one of
them sharing a centroid (the innermost one becomes a square). A fixed number of
spatial objects are distributed in the following manner:
manner: the number of spatial objects
in the inner square ring is roughly twice as many as that of in its immediate outer
17 | P a g e
square ring; and within a certain square ring, random function API is used to generate
spatial data. By doing this, spatial objects are densely distributed in the central area in
the 2D region while sparsely distributed in the outer region,, which simulates the real
life data distribution. Figure 11 shows one example of 400 spatial objects distribution
that follows Zipfian distribution.
distribution
Generally
enerally speaking, the experiments are conducted by changing the following
parameters: Fmin;; number of peers in the P2P network; number of queries
simultaneously initiated; number of spatial objects in each data set. The he one query is
said to be finished when the top 10 closest pairs are found.
Besides, peers
eers are almost equally allocated to stub nodes and number of queries
from each stub domain is roughly the same.
5.2. Results
5.2.1. Different Fmin:
The first experiment examines how Fmin affects the algorithm. There are 2200
peers in the network, which are uniformly distributed in the stub domain
domains. Each dada
set contains 200 spatial objects.
objects The number of simultaneous us client request
requests is set to
10 and Fmax is set to 9. The philosophy behind the variable Fmin is to avoid single
point of failure. With Fmin
min,, the spatial objects are forced to be inserted into the Fmin
level or deeper in the quadtree. Therefore, queries are no longer processed from root
node.. Multiple peers in the network are contacted as soon as the queries start. One of
the effects of increasing Fmin
F will be that as Fmin increases the bigger
ger spatial objects
are split into smaller pieces and pieces of objects are falling deeper down the quadtree
resulting in increasing the height of the quadtree, which in turn causes the algorithm
complexity to become bigger. Another effect is that more messages have to be sent
before actual spatial data is retrieved which causes overheads in communi
communication.
Changing Fmin
Average Response Time
25.000
20.000
15.000
10.000
5.000
0.000
0 1 2 3 4 5 6 7 8
Fmin
Figure
igure 12. Average query response time as Fmin increases
As can be observed in the figure 12, different Fmins do not affect the average
processing time so much, as Fmin increase, the average response time curve remain
roughly steady. However,
owever, as Fmin reaches its maximum, a slight increase is observed.
This is due to the longer query messages propagation delay introduced when queries
are forwarded from root level to Fmin level in the distributed quadtree where the
spatial objects are actually
ually located.
located For the first few Fmins, there is no significant
18 | P a g e
difference in average response time, which is because: 1. For finding the first 10
closest pairs is quite different from that of finding all the pairs; 2. Fmin doesn’t affect
the test data set significantly before it is reaching a certain value due to the fact that
the test data set contains many smaller spatial objects; 3. Even if spatial objects are
split into smaller pieces which will cause communication overheads (shown in figure
13), the parallel communication property of the algorithm compensates for such
overheads with regard to average response time.
140000
Messages Per Request
0 1 2 3 4 5 6 7 8
Fmin
Figure 13. Average number of messages for finishing one query as Fmin increases
Figure 13 shows the variation in the number of messages per query (each query
finds the first 10 closest pairs) as Fmin increases. As expected, number of messages
increases when Fmin increases. For the first few cases, Fmin doesn’t affect the
number of messages so much. However, as it reaches 5, there is a relatively steep
increase due to the fact that the underlying 2D space is divided into so many tiny
squares and hence the increase in height of the distributed quadtree.
25
20
15
10
0
0 1 2 3 4 5 6 7 8
Fmin
For different Fmins, figure 14 shows the load distribution in terms of the
standard deviation. As can be observed, as Fmin increases, the standard deviation
drops gradually which means the load among peers in the network tends to be more
balanced.
Figure 15 shows the actual load for peers in the network. There are 15 slots on
19 | P a g e
the x-axis with each of them representing a number-of-message-range a certain
number of peers have received for finishing 10 queries. Each of the slots potentially
has 9 bars indicating load for different Fmin. For example, if one wants to know the
load distribution for Fmin=0, then he/she needs to see the first bar in every slot. As
shown in the figure, there are around 80 peers in the network which get less than or
equal to 10 messages; and around 7 peers which got more than 10 but less than or
equal to 20 messages, etc. There is a general trend can be seen, as the Fmin increases,
more and more peers in the network handle more messages. When Fmin=0, 81 out of
200 peers handle less than 10 messages, no peer handles more than 5120 messages.
While when Fmin come to 8, only 14 peers in the network handle less than 10
messages, 47 out of 200 peers handle more than 5120 messages totally. Load is
increasing along with the increase of Fmin, However, load is roughly uniformly
distributed among the network.
60 fmin=2
50 fmin=3
40 fmin=4
30 fmin=5
20 fmin=6
10 fmin=7
0 fmin=8
Figure 15. Load distribution for finishing 10 queries with different Fmins
400.000
277.483
300.000 269.224
196.662
171.358
200.000
0.000
0 1 2 3 4 5 6 7 8
Fmin
Figure 16. Average response time per query for P2P distance join algorithm in
comparison to centralized sequential algorithm
30.000 27.498
25.014 25.538
22.796
25.000 21.007
20.000
15.000
10.000
5.000
0.000
200 400 600 800 1000
Figure 17. Average response time per query as number of peers increases
21 | P a g e
5.2.4. Different Number of Simultaneous Queries:
The second scalability experiment examines how well the algorithm scales as
the number of simultaneous queries increases. Again, Fmin is set to 2; Fmax is set to 9;
there are 200 spatial objects in the 2D space; number of peers in the network is set to
200; and only the first 10 closest pairs found account for finishing 1 query. The result
is shown in figure 18. In the figure, there is a drop at the beginning. One possible
reason that introduces the drop in average response time is that most of the queries are
forwarded to the same peers that previously forwarded the same messages. However,
the rest of the curve remains steady.
26.500
26.000 26.153
25.500
25.000 24.966 24.928
24.703
24.500 24.365
24.000
23.500
23.000
5 10 20 40 80
Figure 18. Average response time per query as number of query increases
25.000
24.000
23.555
23.000
(seconds)
22.717
22.000 21.813
21.404
21.000
20.000
19.000
200 400 600 800 1000
Figure 19. Average response time per query as number of objects increases
22 | P a g e
Figure 19 shows the result. As expected, as the number of spatial objects
increases, the general trend in average response time is in a decreasing pattern
regardless of a sudden increase when the number of objects is set to 600, which is
possible for the reason of the randomness in distribution of spatial objects among the
machines in the P2P network.
Although the average response time decreases, as more and more spatial objects
are inserted into the network, the number of messages generated for finishing one
query is in an increasing pattern (shown in figure 20). The reason is intuitive. As more
spatial objects are inserted, more quadtree blocks (control points) are needed to be
inserted into the network including both the quadtree blocks (control points) that
contain spatial objects or those whose children contain spatial objects. Therefore,
either the distributed quadtree is becoming fuller or the height of the quadtree is
increasing. In either case, more messages are needed to finish one query.
Changing Number of Spatial Objects (messages/request)
Average Number of Messages
50,000
45,652
40,000
Per Request
30,000 32,512
27,648
20,000
16,471
10,000
7,480
0
200 400 600 800 1000
24 | P a g e
References
[1]. Front Page of Business Link. Business Link Web Site. [Online]
http://www.businesslink.gov.uk.
[2]. Wilson, Jim. Front Page of National Aeronautics and Space Administration.
NASA Official Web Site. [Online] http://www.nasa.gov.
[3]. Front Page of National Institutes of Health. Official Web Site of National
Institutes of Health. [Online] http://www.nih.gov.
[4]. Front Page of National Geospatial Intelligence Agency. Official Web Site of
National Geospatial Intelligence Agency. [Online] http://www.nga.mil.
[5]. Front Page of National Institute of Justice. Official Web Site of National
Institute of Justice. [Online] http://www.ojp.usdoj.gov/nij.
[6]. Egemen Tanin and Deepa Nayar. An Efficient Distributed Distance Join
Algorithm for Peer-to-Peer Networks.
[7]. Raphael Finkel and J.L. Bentley. Quad Trees: A Data Structure for Retrieval on
Composite Keys. Acta Informatica 4 (1): 1-9.
[8]. E. Tanin, A. Harwood, H. Samet, D. Nayar, and S. Nutanong. Building and
querying a P2P virtual world, Geoinformatica, 2006, 10(1):91-116,.
[9]. G.R. Hjaltason and H. Samet. Index-Driven Similarity Search in Metric Spaces,
ACM Tran. On Database Systems, Dec 2003, Vol.28, No. 4, pp. 517-580.
[10]. G.R.Hjaltason and H.Samet, Incremental. Distance Join Algorithms for Spatial
Databases, Proc. Of the ACM SIGMOD Conference, Seattle, WA, 1998, pp.
237-248.
[11]. E. Tanin, A. Harwood and H. Samet. A distributed quadtree index for
peer-to-peer settings, in Proceedings of the IEEE International Conference on
Data Engineering, Tokyo, Japan, April 2005, pp. 254-255.
[12]. Gershon Kedem. The Ouad-ClF Tree:A Data Structure for Hierarchical On-Line
Algorithms, University of Rochester Rochester, New York 14627.
[13]. Raphael Finkel and J.L. Bentley. Quad Trees: A Data Structure for Retrieval on
Composite Keys, Acta Informatica 4(1): 1-9.
[14]. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari
Balakrishnan. A scalable peer-to-peer lookup service for Internet applications,
in Proceedings of the ACM SIGCOMM 01, San Diego, CA, August 2001, pp.
149-160.
[15]. Secure Hash Standard, FIPS PUB 180, by US government standards agency
NIST (National Institute of Standards and Technology).
[16]. Zegura EW, Calvert KL and Donahoo MJ. A quantitative comparison of
graph-based models for Internet topology. IEEE/ACM Trans. on Networking,
1997, 5(6):770-783.
[17]. Looking Glass and Network Information. Rogers Communications Inc. [Online]
https://supernoc.rogerstelecom.net/ops/.
[18]. G.K.Zipf. Human Behavior and the Principle of Least-Effort,
Addison-Wesley ,MA, 1965.
25 | P a g e