2012 - Szilagyi - Decentralized Bootstrapping in Clouds (SISY12)

Decentralized Bootstrapping in Clouds
P eter Szil agyi

PhD School of Informatics E otv os Lor and University, Budapest, Hungary Department of Mathematics and Computer Science Babes -Bolyai University, Cluj-Napoca, Romania
AbstractOne of the challenges of deploying distributed systems, is the difculty for nodes and services to nd one another (commonly called bootstrapping). There has been extensive research on the topic, resulting in solutions based on broadcasting/multicasting, rendezvous servers and domain name based service discoveries among others. Still, currently deployed solutions generally are centralized or have some failure-sensitive points. This paper presents a completely decentralized solution based on the random address probing algorithm, with different extensions to both circumvent some of the shortcoming of the baseline algorithm as well as take advantage of the environmental specics of cloud architectures.
presented. Concluding the paper is the summary of the results and possible future directions. II. E XISTING S OLUTIONS One of the most basic solutions is the rendezvous server model, used in the initial versions of the Napster [1] [2], BitTorrent [3] [4] and Skype [5] protocols. This method relies on a centralized bootstrapping server where each peer reports to and gets the list of participating nodes. The main benets of the model are its simplicity and support for network address translation (irrelevant here). The drawback is the single point of failure. Although the single point of failure can be circumvented with replication, the complexity and maintenance costs rise dramatically. Additionally, in cloud environments it can be hard to dene xed points of contact. For the sake of completeness, the above method can be extended with peer caches, where previously seen nodes are saved and during a start-up they are queried rst. The extension is however useless in the cloud churn model: if a node restarts and gets a new IP, chances are high that others suffered the same; if a node just starts up after a long downtime, again IPs probably changed; nally if a brand new node starts, the cache will be empty. Thus, although the extension might prove useful in certain cases, the bother is more than the potential benets. Broadcasting and/or multicasting solutions [6] [7] might work in certain cases, but their portability will suffer since they require appropriate routing hardware and congurations, which may or may not be available at certain cloud operators. The challenges become even harder if support for inter-cloud federation is desired. Domain name based service discovery [8] [9] has also been ruled out due to their potentially high conguration complexity and overhead. Suggestions were also made to create a large global bootstrapping service, which would then in turn bootstrap smaller peer-to-peer networks [10]. The main issue here is privacy. Such a solution would potentially leak too much information about a private or internal P2P network. In an enterprise environments, public knowledge about internal systems is not an option. The only solutions that seem to scale well and remain completely decentralized are all based on the concept of address probing: nodes use some heuristic to generate potential
I. I NTRODUCTION The concept of distributed systems is not new, but with the advent of cloud computing, it has elevated to a whole new level. Medium to small sized companies - and even a single person - can have access to previously unimaginable computing capacities at a relatively affordable price. This widespread access among mainstream customers has lead to a demand in easy-to-use distribution libraries. Although many different distribution patterns exist, their discussion is out of the scope of this paper. The only important thing needed mentioning, is the fact, that most of the existing solutions (that are deployed) are either fully centralized, or have some critical central components. Beside the possible failure scenarios, the main disadvantage of such approaches lies in the high maintenance costs associated. One possible direction worth exploring is the peer-to-peer (P2P) model, which has already proven itself a worthy contender (e.g. BitTorrent and Skype), but is yet to become widely used. This paper is the rst in a series (discussing P2P models in cloud environments), where the challenges and existing solutions for bootstrapping distributed systems are presented, with suggested modications and extensions to cater for - and take advantage of - the features provided by cloud operators. Emphasis is placed on the ease of use, meaning that robustness and usability are sometimes preferred over pure performance. In the following sections the existing solutions and their shortcoming are presented, followed by a few of the assumptions made when dealing with cloud environments, which inuenced the chosen algorithms and their extensions. Afterwards, the chosen baseline algorithm is detailed, analyzing its performance and highlighting its weaknesses. Finally, two extensions which exclude the mentioned shortcomings are
IP addresses and use a trial and error approach to nd their peers. This paper introduces a couple of these and their combinations, as well as extensions in section IV. III. M ODELS AND M ETRICS Peer-to-peer algorithms make few assumptions about the distribution of the allocated IPs, but in cloud environments an important phenomenon is observable: address allocations are far from uniform. Usually compute units are rented not one by one, but in batches, resulting in IP addresses within that single batch clustering near each other. Eventually long running application will likely have addresses more uniformly distributed, but some degree of clustering would still remain. To model this effect, the algorithms were tested and analyzed on three different IP distributions for comparison: A single Gaussian cluster (having its mean at the center of the IP space and its variance as 0.3 nodes) to model short lived application. The classical uniform distribution to model innitely long running applications. Multiple clusters modeled with a Gaussian mixture (uniform random means with variances same as above) to model average applications (Fig. 1).
IP Distributions - Clustered (7 clusters, 20bit subnet) 0.09 Sample 1 Sample 2 Sample 3 0.06
and as much as in classical peer-to-peer system - the cluster convergence was deemed more important (i.e. fast start-up). The other analyzed aspect of the algorithms were their ability to react to churn: the number of cycles it takes for the system to converge if some percentage of the nodes are removed or new ones inserted. Tested percentages were ranging from -50 to 50 in increments of 10%. The algorithms were tested on networks ranging from 22 to 214 peers, but because of the high costs of renting nodes from real cloud operators, results were obtained through simulations, each scenario run 100 times and averaged out. Due to space limitations only the 20bit subnet charts have been presented, but similar results have been obtained on both smaller and larger networks. IV. A LGORITHM AND E XTENSIONS This section presents the chosen baseline algorithm, the random address probing. After highlighting two of its major issues, namely the high network load and (given the studied cloud environment) the potentially slower convergence time, two extensions are suggested: partial views for the rst and local scanning for the second. A. Random Address Probing The baseline probing algorithm is fairly simple (Fig. 2). Each node that takes part in the system generates uniform random IP addresses and tries to connect to them. If the connection succeeds, both the connecting node and the one connected starts monitoring each other through a heartbeat mechanism. The algorithm is run indenitely, ensuring that joining nodes get discovered quickly whilst viability against crashing or leaving nodes is achieved with the heartbeats.
t=0 t=1 t=2 t=11
Probability
0.03
-0.03
33
3 9 5 1 7 3 9 5 1 7 3 67 100 134 168 201 235 268 302 336 369 403
Host Address Space
Fig. 1. Samples of multi-clustered IP distribution, achieved with mixtures of 7 Gaussians. The three plots represent different runs of the same simulation.
probing
heartbeat
One of the investigated aspect of the bootstrapping algorithms was the convergence time: the number of cycles it takes the system to reach a stable state dened by the following criteria: if the system is modeled with a directed graph, where the nodes are the peers participating in the system and the edges are the peers knowledge about each other, then Cluster convergence is obtained when the graph reaches strong connectivity, meaning that given two random nodes in the graph, there is at least on path between them. Stability convergence is obtained when the graph reaches k-connectedness, meaning that the removal of any k-1 nodes still does not violate the strong connectivity. Stability convergence results have not been presented in this paper due to space limitations but also because in a cloud environment - where churn does not occur as dynamically
Fig. 2. The random address probing baseline algorithm. The 3 3 blocks represent the IP address space, where light nodes are unallocated IPs and dark ones are actual peers. At t = 0 both nodes missed, but at t = 1 one of the nodes found the other. At t = 2 both keep on probing for new nodes. After a few cycles (t = 11) the heartbeat kicks in, and the neighbours ping each other.
Multiple variations of the algorithm exist, some optimizing for geographical proximity [11] others for IPv6 address spaces [12], but in cloud environments these issues dont appear (only a relatively constrained IPv4 space is needed [13] and proximity is non-trivial to dene). Intuition would suggest, that the above algorithm - based solely on uniform sampling - would perform worst (converge slower) on clustered IPs. Results however show, that this is not so obviously the case; and considering the spread (not presented here), the differences between run-times are
statistically insignicant (Fig. 3). Nonetheless, with a few extensions this will change.
Cluster Formation Slowdown (20bit subnet) 0.114 Uniform Normal Clustered 0.057 Slowdown Ratio
-0.057
-0.114 4 8 16 32 64 128 Nodes 256 512 1024 2048
world. As the nodes nd each other, the number of heartbeat packets in the network grows quadratically (i.e. each node ping everyone it knows). A solution to prevent such network encumbrance is to allow each node in the network to maintain contact with only a handful of other peers at any one time. This reduces the communication requirements from quadratic to linear load. Of course, since the nodes are now limited to a partial view of the network, the time required to converge into a single cluster will increase accordingly. Interestingly, even with a relatively tiny partial view, cluster formation time is close to that of full world knowledge, approaching it asymptotically. The reactability of the modied algorithm was plotted (Fig. 5) for various view sizes for a network of 212 nodes. As expected, smaller views (Fig. 5a, 5b) behave more chaotically but a slight increase in view size results in quick stabilization, as seen on gure 5c.
System Reactability - Clustered - 8 View (20bit subnet)
32 nodes 64 nodes 128 nodes 316 Convergence Time 256 nodes 512 nodes 1024 nodes 2048 nodes 100
Fig. 3. The performance of the baseline algorithm on different IP address distributions. The blue (uniform sampling) is considered the base of the comparison and the slowdown ratio of others compared to the base have been plotted (e.g. a value of 0.1 would suggest a 10% slower convergence compared to the base)
System Reactability - Clustered - 6 View (20bit subnet) 1000
316
32 nodes 64 nodes 128 nodes 256 nodes 512 nodes 1024 nodes 2048 nodes
100 Convergence Time
As a reference for later, the convergence times of the baseline probing algorithm after churn (both joining and leaving nodes) were plotted on gure 4.
System Reactability - Clustered (20bit subnet) 10000 4 nodes 8 nodes 16 nodes 1000 Convergence Time 32 nodes 64 nodes 128 nodes 256 nodes 100 512 nodes 1024 nodes
32
32
10
10
3
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
-0.5
-0.4
-0.3
-0.2
-0.1
0.1
0.2
0.3
0.4
0.5
Churn Ratio
Churn Ratio
(a) View size 6
(b) View size 8
System Reactability - Clustered - 10 View (20bit subnet) 1000 16 nodes 32 nodes 64 nodes 200 Convergence Time 128 nodes 256 nodes 512 nodes 1024 nodes 40 2048 nodes
2048 nodes 10
-0 .
-0 .
-0 .
-0 .
-0 .
0 .1
0 .2
0 .3
0 .4
0 .5
Churn Ratio
Fig. 4. Time it takes the system to fully converge after a churn event. Since the heartbeat mechanism res every few cycles, usually leaving nodes get detected in constant time independent of network size. Exceptions being very small networks where a large churn could lead to partitioning, which requires more time to heal. Convergence after nodes joining is network size dependent: with more nodes, more probing is executed per cycle.
-0 .
-0 .
-0 .
-0 .
-0 .
0 .1
0 .2
0 .3
0 .4
0 .5
Churn Ratio
(c) View size 10 Fig. 5. Time it takes the system to fully converge after a churn event for various partial view sizes. A partial view of 6 is sensitive to both nodes leaving (the network may partition) and nodes joining (some peers may be referenced multiple times, others none). Maintaining 8 neighbours is already more stable, whilst 10 approximates the stability of the system with full world knowledge.
Since the baseline algorithm keeps account of all nodes ever connected to (i.e. full world knowledge), associated maintenance costs will quickly rise: both local storage used and network trafc generated. This might not be a problem with smaller networks, but it is denitely an undesirable sideeffect that can be eliminated. B. Partial World Knowledge In the baseline model, the primary reason for the high network load is due to the combination of the heartbeat mechanism and the maintenance of full knowledge about the
Compared to the baseline algorithm, the communication costs of the system as a whole becomes close to insignicant noise (e.g. 1.2KB per cycle per node with a view of 10). Another possibility to further reduce network trafc is splitting the partial view into two parts dening a start-up phase and a maintenance one. As long as the number of peers found does not cross a threshold value, the probing is executed as previously. After this limit is reached, the
algorithm switches into maintenance mode with a reduced rate of sampling. Looking at the obtained results, this too increases the convergence time of previous algorithm (Fig. 6a), but really just slightly. The network load too, although smaller, not signicantly (Fig. 6b).
Convergence Comparison (10 sized view, 20bit subnet) 1000 Base partial Threshold 4 Threshold 6 200 Convergence Time Threshold 8
A proposed extension of the baseline probing algorithm is the introduction of a bit of determinism, whilst also taking advantage of the potential IP proximities: beside executing the random probing, each node will also scan IPs close-by for possible neighbors (Fig. 7). After the two closest IPs (one in each direction) are found, the scanning stops. This will quickly chain together clustered peers, allowing the random probing to nd large blocks at once.
t=0 t=1
40
t=2
8
2 16 32 64 128 256 512 1024 2048
Nodes
(a) Impact of different view splits on convergence time.

Heartbeat Traffic Comparison (12bit subnet, 512 nodes) 100000 Base partial Threshold 4 Threshold 6 25119 Heartbeat Traffic Threshold 8
Fig. 7. The local scanning extension for the baseline algorithm. Each row represents the whole IP address space in different time instances, where light nodes are unallocated IPs and dark ones are actual peers. At t = 0 nodes scan to the left, with all but the last nding a neighbor (linking the two). At t = 1 most nodes scan right, but since third already found its right neighbor, it looks leftwards (with success). The algorithm continues until either all nodes connect, or they reach the end of the address space.
6310
Two important results have been reached with the scanning extension. First, since the algorithm takes advantage of nearby peers, it does matter what the IP distribution is: the more clustered the nodes are to each other, the faster the system converges (Fig. 8). Secondly, cluster convergence time decreases dramatically in every case (Fig. 9).
Cluster Formation Slowdown with Scan (20bit subnet) 0.4 Uniform Normal Clustered 0 Slowdown Ratio
1585
398
13
17
21
25
Time
29
33
37
41
45
(b) Impact of different view splits on network load. Fig. 6. The impact of splitting the view on the cluster convergence time and maintenance costs. The algorithm with the whole partial view (i.e. non-split) is taken as the baseline. Then the lower the startup threshold is, the larger the convergence time will become, but inversely, the load on the network become lower.
-0.4
-0.8
The results presented above on gure 6 would suggest that the impact of split views is insignicant enough not to be worth the extra added complexity. However, it does introduce a potentially useful advantage of controlling the startup and maintenance speed separately. A very aggressive probing speed could prove invaluable to jumpstart a system (or to quickly integrate a joining node), but the maintenance mechanism will be needed to back down from the high network load. C. Local Scanning An issue already mentioned with the baseline algorithm, is its complete randomness. This can either cause small anomalies if the IP allocations are clustered together, or in the best case perform the same as with uniformly distributed IP addresses. Still, it might prove valuable not to ignore the possibility of clusters.
-1.2 4 8 16 32 64 128 Nodes 256 512 1024 2048
Fig. 8. The performance of the scanning extension on top of the baseline algorithm for different IP address distributions. The blue (uniform sampling) is considered the base of the comparison and the slowdown ratio of others compared to the base have been plotted.
It is important to note, that as, the address space starts to get saturated, the extension has less and less effect. Also it should be noted on the above mentioned charts (Fig. 8, 9), that the extended algorithm probes twice as many peers as the baseline until it nds its neighbors. This minor presentation aw however has no consequences, because the speedup is on the order of magnitudes. Beside, the scanner on most of the nodes terminates after a couple of cycles. Finally, analysing the effect of the scanning extension on the
Convergence Comparison (20bit subnet) 1000 Baseline BaselineEx View 10 200 Convergence Time ViewEx 10 Split 6 SplitEx 6 40
2 16 32 64 128 256 512 1024 2048
Nodes
having one with dynamically changing dimensions based on the observed/estimated network size. The heartbeat mechanism could also be removed in favor of a self-healing method in a similar fashion to the Newscast protocol [14], but a thorough analysis would be needed to quantify the gains vs. drawbacks. Further analysis could be done on the local scanner extension too, checking whether it is worth nding multiple neighbors; whether running the algorithm periodically would have any noticeable reactability gains and nally whether the introduction of startup and maintenance modes would have any benets. ACKNOWLEDGMENT This research is supported from the project POSDRU/88/ 1.5/S/60185 Innovative doctoral studies in a Knowledge Based Society PhD scholarship, Project co-nanced by the SECTORAL OPERATIONAL PROGRAM FOR HUMAN RESOURCES DEVELOPMENT 2007 - 2013, Babes -Bolyai University, Cluj-Napoca, Romania. R EFERENCES
[1] The Napster Protocol Specications, 2000. [Online]. Available: http://opennap.sourceforge.net/napster.txt [2] The Napster Protocol Specications - Revised, 2001. [Online]. Available: http://cleannap.sourceforge.net/napster.txt [3] B. Cohen, Incentives Build Robustness in BitTorrent, 2003. [Online]. Available: http://www2.sims.berkeley.edu/research/ conferences/p2pecon/papers/s4-cohen.pdf [4] , The BitTorrent Protocol Specication, 2008. [Online]. Available: http://www.bittorrent.org/beps/bep\ 0003.html [5] S. A. Baset and H. G. Schulzrinne, An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol, in 25th IEEE International Conference on Computer Communications. Barcelona, Spain: IEEE, 2006, pp. 111. [Online]. Available: http://arxiv.org/abs/cs/0412017 [6] UPnP Forum. [Online]. Available: http://www.upnp.org [7] Zero Conguration Networking. [Online]. Available: http://www. zeroconf.org [8] S. Cheshire and M. Krochmal, DNS-Based Service Discovery, 2011. [Online]. Available: http://les.dns-sd.org/draft-cheshire-dnsext-dns-sd. txt [9] , Multicast DNS, 2011. [Online]. Available: http://tools.ietf.org/ html/draft-cheshire-dnsext-multicastdns-15 [10] M. Conrad and H.-J. Hof, A generic, self-organizing, and distributed bootstrap service for peer-to-peer networks, Self-Organizing Systems, vol. 4725, pp. 5972, 2007. [Online]. Available: http://www.springerlink. com/index/w5777632635647w3.pdf [11] C. G. Dickey and C. Grothoff, Bootstrapping of peer-to-peer networks, in The 2008 International Symposium on Applications and the Internet (SAINT2008). Turku, Finland: IEEE, 2008, pp. 205208. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs\ all. jsp?arnumber=4604572 [12] R. Bless, O. P. Waldhorst, C. Mayer, and H. Wippel, Decentralized and Autonomous Bootstrapping for IPv6-based Peer-to-Peer Networks, in 2nd German IPv6 Summit, 2009, pp. 68. [Online]. Available: http://doc.tm.uka.de/2009/ipv6-contest-p2p-bootstrap.pdf [13] Amazon, Virtual Private Cloud. [Online]. Available: http://aws. amazon.com/vpc [14] M. Jelasity, W. Kowalczyk, and M. Van Steen, Newscast Computing, Virje Universiteit Amsterdam, Amsterdam, The Netherlands, Tech. Rep. IR-CS-006, Nov. 2003. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62. 3761\&rep=rep1\&type=pdf
Fig. 9. A study of the different algorithm extensions and their combinations on clustered address distributions. The Ex label means the scanning extension.
reactability to churn, when nodes leave, the algorithm performs as previously (relying only on the heartbeat mechanism). On the other hand, when new nodes join the system, the cluster converges much quicker due to they themselves scanning locally (Fig. 10).
System Reactability - SplitEx - 6/10 View (20bit subnet) 316 16 nodes 32 nodes 64 nodes 100 Convergence Time 128 nodes 256 nodes 512 nodes 1024 nodes 32 2048 nodes
10
-0 .
-0 .
-0 .
-0 .
-0 .
0 .1
0 .2
0 .3
0 .4
0 .5
Churn Ratio
Fig. 10. Time it takes the system to fully converge after a churn event for the nal combination of extensions: split viewed random probing with local scanning. Due to space limitations churn reactability for other combinations have not been presented, but they are the same as the above results.
V. C ONCLUSION In this work the baseline random address probing protocol was presented, emphasizing its drawbacks on the convergence time in cloud systems and on the generated network trafc. Afterwards solutions were presented both to the reduction of network load by maintaining only partial knowledge about the world, as well as to the reduction of convergence time by taking advantage of the clustering aspect of clouds. Very promising results were obtained, highlighting that it is worth optimizing for the special circumstances that may appear in long running cloud applications. A possible future extension would be the removal of the hard coded partial view size and threshold values and instead

2012 - Szilagyi - Decentralized Bootstrapping in Clouds (SISY12)

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

2012 - Szilagyi - Decentralized Bootstrapping in Clouds (SISY12)

Transféré par

Droits d'auteur :

Formats disponibles

Decentralized Bootstrapping in Clouds

P eter Szil agyi

-0.114 4 8 16 32 64 128 Nodes 256 512 1024 2048

System Reactability - Clustered - 6 View (20bit subnet) 1000

100 Convergence Time

(a) View size 6

(b) View size 8

2 16 32 64 128 256 512 1024 2048

(a) Impact of different view splits on convergence time.

-1.2 4 8 16 32 64 128 Nodes 256 512 1024 2048

2 16 32 64 128 256 512 1024 2048

Vous aimerez peut-être aussi