Vous êtes sur la page 1sur 16

Competitive Coevolutionary Learning of Fuzzy Systems for Job Exchange in Computational Grids

Alexander Folling Christian Grimme Joachim Lepping Alexander Papaspyrou Uwe Schwiegelshohn
alexander.foelling@udo.edu christian.grimme@udo.edu joachim.lepping@udo.edu alexander.papaspyrou@udo.edu

uwe.schwiegelshohn@udo.edu Robotics Research Institute, Section Information Technology, Technische Universit t a Dortmund, D-44227 Dortmund, Germany

Abstract In our work, we address the problem of workload distribution within a computational grid. In this scenario, users submit jobs to local high performance computing (HPC) systems which are, in turn, interconnected such that the exchange of jobs to other sites becomes possible. Providers are able to avoid local execution of jobs by offering them to other HPC sites. In our implementation, this distribution decision is made by a fuzzy system controller whose parameters can be adjusted to establish different exchange behaviors. In such a system, it is essential that HPC sites can only benet if the workload is equitably (not necessarily equally) portioned among all participants. However, each site egoistically strives only for the minimization of its own jobs response times regularly at the expense of other sites. This scenario is particularly suited for the application of a competitive coevolutionary algorithm: the fuzzy systems of the participating HPC sites are modeled as species that evolve in different populations while having to compete within the commonly shared ecosystem. Using real workload traces and grid setups, we show that opportunistic cooperation leads to signicant improvements for each HPC site as well as for the overall system. Keywords Coevolutionary algorithms, genetic fuzzy systems, grid scheduling.

Introduction

About 10 years ago, users typically executed their computing jobs at the systems of their local computing center. Unfortunately, resource demand within a small institution is rarely equally distributed but shows diurnal, weekly, and also aperiodic patterns. This led to frequent imbalances between demand and availability of resources as the number of resources only change after large time intervals when an old machine is retired or a new system is installed. With the availability of powerful networks, it has

2009 by the Massachusetts Institute of Technology

Evolutionary Computation 17(4): 545560

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

become possible to also execute computer jobs at remote sites. Ideally, this process is transparent for the user, that is, an automatic scheduling system handles the allocation of jobs to resources. This concept is called grid computing. A very simple rst form of grid computing is a collaboration of different high performance computer (HPC) centers that provide resources to each other in case of high demand at one center. As the user still submits jobs to the local center, he or she is not aware of a remote execution. Only the scheduler of the local computing system will try to push some jobs to the systems of collaborating centers while in turn it will accept jobs from these centers in the case of locally available resources. Thus, the migration and acceptance policies realized by each collaborating HPC center are crucial parts for the benecial operation of grids. Unfortunately, due to the online characteristics of job submission and frequent unavailability of system state information, the realization of job exchange is rather difcult. Remember that an HPC center will only join a grid if its participation is benecial for the own local user community that directly or indirectly provides the funds for the center. However, previous studies (see Grimme et al., 2008) have shown that if all participants behave cooperatively, a signicant improvement of common scheduling objectives for all participating sites can be expected in contrast to mere local execution. As our grid problem is a competitive give-and-take market scenario, it is not easy to obtain suitable parameters for the participating local schedulers: Each site has the egoistic motivation to execute locally submitted jobs as fast as possible in order to deliver good service to its local customers. But a fully egoistic distribution policy frequently migrating local jobs, but never accepting remote jobsleads to overloading other sites. In the long run, this behavior deteriorates the opportunity to utilize other sites for the delegation of jobs from other participants. Eventually, this will result in congestion and poor response times for the egoistic site as well. As already mentioned, it is more rewarding for a participant to accept remote jobs for execution, as this improves the overall grid performance and consequently its own response time, too. Hence, although all sites egoistically strive only for short job response times, they must learn that a relaxation of egoism leads to a higher benet for all users, including their own. This problem shows many analogies to the concept of competing species within a shared ecosystem. Thus, this problem suggests itself for the application of coevolutionary algorithms in the design of job exchange policies in such decentralized grid environments. These algorithms assume multiple populationseach driven by cooperation or competitionthat pursue different goals within a common ecosystem. The concept is based upon the biological paradigm of evolution induced by species interplay. Akin to the biological scheme, the exchange policies of computing centers in a grid can be modeled as populations or species that compete in a common (grid) environment. In our grid scenario, we need two layers of scheduler. The rst layer decides whether to locally execute a local or a remote job while the second layer produces the actual local schedule. In this work, we only address the rst layer as the second layer has already been the subject of various previous research. This rst layer, which is also called the grid scheduler, is in this work realized by a fuzzy based approach where the controller actions depend on the current system state. These states are modeled by fuzzy sets that are represented by simple membership functions. Such fuzzy system based scheduling techniques have been successfully applied to online scheduling problems before (see, e.g., Franke et al., 2008). They outperform most static scheduling heuristics due to their ability to exibly adapt decisions to changing environments. As they have proven to be 546

Evolutionary Computation

Volume 17, Number 4

CCA Learning of Fuzzy Systems for Computational Grids

a reliable concept to tackle challenging online scheduling problems, we also use fuzzy controlled scheduling to design our job exchange policies. Furthermore, the easy parameterization of fuzzy membership functions allows a simple and efcient encoding of the whole controller. This makes the approach perfectly suitable for optimization by an evolutionary algorithm. Such a combination of fuzzy systems and evolutionary algorithms is commonly denoted as Genetic Fuzzy Systems (see Cordon et al., 2001). The remainder of this paper is structured as follows: Section 2 gives an overview on coevolutionary algorithms and fuzzy systems. Job scheduling notation and the problem description are presented in Section 3, and Section 4 describes the applied competitive coevolutionary mechanisms. Next, Section 5 and Section 6 explain our fuzzy controller and encoding concept. Finally, Section 7 shows the experimental setup and Section 8 presents the results from the experiments.

Background

In this section, we discuss the main distinct facets of coevolutionary algorithms and provide a basic description of fuzzy systems. 2.1 Coevolutionary Algorithms

Coevolutionary algorithms (CAs) rely on the observation that, in nature, the dynamic interplay of different species results in mutual pressure for each others development. Species strategically interact and undergo mutual adaptation. This process is based upon a reward structure that guides evolution toward the development of increasingly adaptive behaviors. Obviously, CAs are particularly suited to search for optimal species behaviors in domains of strategic interaction: Due to its interactive learning capabilities, the CA is able to change the species behaviors to nd good solution approximations. CAs differ from conventional evolutionary algorithms mainly in their evaluation process: In CAs, an individual can only be evaluated by having it interact with other evolving individuals and the interaction partners are usually members of different populations. Therefore, CAs commonly involve more than one evolving population.1 Among the various possible interaction models, the most conventional patterns feature either the interaction of each member from each population with all other individuals, or random pairings. Technically, the modularization approach in CAs is motivated from the nding that many real-world problems are too difcult to optimize when being represented as a monolithic instance, but can be decomposed into a collection of subproblems with reduced complexity. The achieved subproblems solutions can then be combined to a solution for the original problem (see, e.g., Jansen and Wiegand, 2003). In the context of CAs, two types of interaction can be distinguished (see Paredis, 2000). Cooperative coevolution, introduced by Potter and Jong (2000), solves a complex problem by coevolving a set of solutions to decomposed subproblems. Each subproblem is represented by a certain population of species that coevolve all together to cooperatively solve the original problem. Such interaction (sometimes called symbiosis) describes the situation where two or more species coexist, and each one benets from the evolutionary progress of the others. Contrary to the symbiotic concept, in competitive coevolutionary
Multiple populations can also be used in ordinary evolution which are then called island models or species niching.
1

Evolutionary Computation

Volume 17, Number 4

547

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

algorithms (CCAs), individual tness values are determined through competition with other individuals in the population. The evolutionary progress of one species increases the selection pressure on the other species. Thus, an absolute tness measure is not known to the different evolving species. In other words, the algorithm is parallel and distributed and can be seen as an interaction process where each species strives for its own local goal, and the global system behavior is the result of all interactions. Due to the competitive nature of the algorithm, an increased tness in one solution often leads to a decreased tness for another. Ideally, competing solutions will continually displace one another, leading to increasingly better solutions in the long run. In the literature, this concept is often described as an arms race where populations push each other to improved tness values (see Rosin and Belew, 1997). Further, the idea of competing solutions assigned to parallel evolving populations of species shows a strong similarity to predator-prey relationships (see, e.g., Grimme et al., 2007). In their work, complex (even multi-objective) optimization tasks are solved by emerging solutions from egoistically pursued individual goals. However, the use of competitive coevolution to facilitate emergent behavior particularly emerging cooperation among competing partnersremains relatively unexplored in current research. Hillis (1990) presents a model application of sorting network evolution. This coevolutionary approach consists of hosts and parasites integrated in a genetic algorithm where the hosts represent sorting strategies and the parasites represent test cases in the form of sequences of numbers to be sorted. Further, Olsson (2001) investigates models based on competitive species that turned out to be very effective in preserving genetic diversity. This concept usually produces better nal solutions than noncoevolutionary approaches. Similar to the problem tackled in this paper, Delgado et al. (2004) use a coevolutionary genetic approach to design Takagi-Sugeno-Kang (TSK) fuzzy systems. They support hierarchical, collaborative relations between individuals representing different parameters of the fuzzy systems. The CA denes species to represent partial solutions of fuzzy modeling problems organized into several hierarchical levels. The individuals have to compete within a shared environment to measure the performance of each single individual. Constraints are observed and particular targets are dened throughout the hierarchical levels. They favor rule compactness, rule base consistency, and partition set visibility. 2.2 Fuzzy Systems

The concept of fuzzy systems allows researchers to represent and process information such that uncertainty and imprecision can be efciently modeled. To this end, a level of condence instead of declaring decisions as simply true or false is considered, which accounts for ambiguities in situations. In fuzzy systems, this ambiguity is realized by the use of fuzzy logic, where data are processed by allowing partial set membership rather than crisp set membership or nonmembership. The advantage of representing a knowledge base by fuzzy logic lies in the interpolative nature of fuzzy systems: They have the ability to express partial and concurrent activations of behaviors and gradual transitions between them by a set of if-then rules that encode the expert knowledge. Due to its approximate reasoning capabilities, fuzzy logic produces controllers that are robust to uncertainty and imprecision. In this paper, we take advantage of a specialized type of fuzzy systems, namely genetic fuzzy systems. Such systems employ an evolutionary algorithm to tune different parameters of the fuzzy system. Evolutionary algorithms are ideally suited for this 548

Evolutionary Computation

Volume 17, Number 4

CCA Learning of Fuzzy Systems for Computational Grids

purpose, as they are capable of exploring a huge or even unknown search space requiring only a minimum amount of both computational effort and external knowledge. In addition, the generic code structure and independent performance features of evolutionary algorithms are well suited to incorporate a priori knowledge in the form of linguistic variables, fuzzy membership function parameters, or the actual number of rules. A com plete survey of genetic fuzzy systems is given by Cordon et al. (2001). The common use case for genetic fuzzy systems applies when neither expert knowledge nor training data are available, or the data cannot be transformed directly into corresponding rules.

Problem Description

In this section, we review the problem space of workload allocation in high performance computing (HPC) systems. First, we introduce the basic notation for the basic job scheduling problem on traditional HPC systems and then discuss the decentralized grid scenario and the corresponding job exchange problem. 3.1 Job Scheduling on HPC Systems

We model each HPC system, further addressed as a site, by a set of mk identical nodes. Therefore, a job can be allocated on any subset of these nodes if the size of the subset matches the parallelism of the job. Multi-site computation, that is, splitting jobs over different HPC systems, is not supported, due to technical issues such as the protection of systems by rewalls. Moreover, we assume that all sites only differ in the number of available nodes as the differences in execution speeds can be neglected. HPC scheduling is an online problem as jobs are submitted over time and information about future jobs is usually unknown in advance. Moreover, their precise processing time is usually unknown in advance. We assume rigid parallel batch jobs for our analysis, where every job j requires concurrent and exclusive access to mj mk nodes with mk being the total number of nodes on site k. The amount of required nodes mj is xed at the release date rj of each job j and does not change during the execution. Further, our jobs must run to completion as most current real HPC systems do not allow preemption (i.e., suspending and restarting of jobs during their runtime). The completion time of job j within the schedule S is then denoted by Cj (S). Furthermore, precedence constraints between jobs are rare in real systems and seldom described by the user, and it is almost impossible to automatically detect them if they exist. Although there exist several infrastructural approaches to handle complex dependencies between jobs, the problem domain of planning algorithms is still largly unexplored, usually resorting to traditional heuristics such as HEFT (see Topcuoglu et al., 2002), which is not designed for grid environments. Hence, we assume that all jobs are independent. 3.2 Job Exchange in Decentral Grid Environments

In this work, the HPC sites establish a grid environment. Usually, an institute or company uses its HPC installation in order to conduct its own computationally demanding experiments, simulations, or calculations. The necessity to join a grid is only given if the locally available resources are insufcient to satisfy a required quality of service for the users. In this case, the grid participant hopes for the provision of additional resources through the grid. In return, he or she may be able to provide resources to other grid participants during phases of low utilization. 549

Evolutionary Computation

Volume 17, Number 4

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

Figure 1: Schematic depiction of the decentralized grid environment. Jobs are offered to each sites grid scheduler either by local users or remote grid schedulers. The participant usually has a strong interest in autonomy of the participants HPC installation and discreetness regarding information. Thus, a central entity will usually not be allowed to take over management processes in the participants domain (that is, interfering with the local resource management) or to monitor internal states. Thus, the scheduling strategy for local resources as well as the information policy remains under the site administrators control. As a consequence, we do not consider a centralized grid architecture with a single entity, but assume a fully decentralized environment under restricted information policies. In such environments, each site autonomously decides on acceptance and refusal of jobs and exchanges them in a peer-to-peer fashion. Figure 1 schematically depicts the assumed architecture: each site provides a decision making interface which serves as a submission component for jobs from local users and remote sites. The grid scheduling layer decides on the acceptance or rejection of jobs. If a job has been accepted, it is inserted into the local job queue and executed using a local scheduling strategy. Note that scheduling on the local resource layer is not within the scope of this paper. Therefore, we utilize the simple rst-come-rst-serve (FCFS) heuristic (see also Section 7.1) whichdue to its good average performanceis implemented in many current installations. Apart from accepting local and remote job submissions, the grid scheduling layer is also able to offer locally submitted jobs to remote HPC sites. In case a remote site accepts the job, it is migrated and executed there. However, if the job is rejected by all remote sites, it must be executed on its originating site in order to prevent jobs from bouncing and waiting forever. Once a job has been delegated to a sites local resource management system, it cannot be migrated again. This realizes the strict separation of local and grid scheduling layer. Although HPC sites predominantly join a grid due to egoistic interests, they are forced to cooperate with other participants in order to maximize their own benet: a fully egoistic behaviorthat is, migrating jobs but never accepting any jobs from other siteswill overload other participating sites and nally lead to a similar behavior by the other participants, that is, there is no grid anymore.

Customization of the Competitive Coevolutionary Algorithm

Striving for the very own goals on the one hand and cooperating for workload balancing on the other hand leads to the idea of coevolutionary adaptation of the decision making 550

Evolutionary Computation

Volume 17, Number 4

CCA Learning of Fuzzy Systems for Computational Grids

process. This can be done by representing the decision makers strategies as competing species whose interaction may lead to an overall emergent and benecial behavior in the whole grid environment. In this section, we describe the competitive coevolutionary algorithm (CCA) architecture we used in a top-down fashion, starting from the general architecture and environment and ending with the encoding scheme for individuals. 4.1 Architecture of the Algorithm

In order to optimize the addressed controller design problem with a CCA, an appropriate decomposition into a set of subcomponents is required. Since the assumed grid environment naturally consists of distributed, independent entities, we represent each participating HPC site (initially regardless of the internal encoding of individuals) by a unique species within the CCA, and let them evolve as competing entities. This corresponds to the competitive nature of grid sites that strive to achieve optimal tness values for their own users, see Section 3.2. Further, each species belongs to its own population such that mating and reproduction among different species is impossible. Different species populations then inhabit a common ecosystem: the grid. Within this shared environment, the actual competition takes places if job exchange between grid participants is established. Thus, each individuals tness is determined by its interaction with representatives of other species in the ecosystem. 4.2 Competitive Coevolutionary Learning Approach

We use equally sized populations for each species consisting of parental individuals. As in conventional evolution strategies, variation operators are used to generate children from those parents. We follow the plus strategy concept where good parents are always conserved for the next generation. The tness assignment and corresponding selection process is the crucial part of the CCA. The tness ranking is performed for each species separately, and, as usual, the best individuals are selected as the parents that are allowed to reproduce the children. To determine each individuals tness, the different species compete within the shared environment. To this end, grid environments with different fuzzy rule systems for job exchange are formed and simulated on given input data. In order to produce a high competition between the different populations, we deterministically chose pairings with respect to their tness ranks. In detail, the best individuals of each species compete in one grid setup, while the second best are evaluated in another setup, and so forth, as shown in Figure 2. With this procedure it is assured that the ttest individuals of each species have to compete with the best of all other species, which results in fair competitions. In a second evaluation step, each grid setup is ranked with the aggregated tness of all participating individuals. As a result, the best performing combination of individuals in the current generation is stored. This conserves the best interacting rule bases until a better set is found. The tness evaluations on both levels consider a specic waiting time as a performance indicator that is detailed in Section 7.4. A key issue in our approach lies in the handling of selshness: High ranked individuals of each species might achieve good performance at the expense of other species. However, when such egoistic individuals are then combined during the next evaluation 551

Evolutionary Computation

Volume 17, Number 4

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

Figure 2: Fitness evaluation concept with pairing strategy for different competing species.

process, their insistence on egoistic behavior will worsen their tness drastically. As a consequence, we assume thatin the long runonly cooperative individuals will survive in each species. Eventually, we get the best interacting rule bases that evolved in our competitive environment. In the next section, we explain the fuzzy set based controller concept and detail the choice of fuzzy membership functions. This is closely connected to the representation of the controllers as individuals. This leads to an encoding scheme of the controller for an evolutionary algorithm.

Fuzzy Controller Concept

For the representation of a controller at each site, we apply the method of fuzzy inference proposed by Takagi and Sugeno (1985), which is known as the Takagi-SugenoKang (TSK) model in fuzzy systems literature. In our approach, each individual represents one fuzzy controller, which is based on a set of rules. Each specic rule determines a system state in which decisions about the acceptance or refusal of jobs must be made. These system states are described by certain features, for example, the current system utilization or the submitted jobs degree of parallelism. Within a fuzzy rule, a feature represents the conditional part, that is, it describes the state in which a certain rule may or may not apply, and the decision whether to consent to or decline the migration of a job. 552

Evolutionary Computation

Volume 17, Number 4

CCA Learning of Fuzzy Systems for Computational Grids

5.1

Controller Architecture

The general TSK model consists of Nr IF-THEN rules Ri such that Ri := IF x1 is gi and x2 is gi and . . . and xNf is gi THEN yi = bi0 + bi1 x1 + . . . + biNf xNf
(1) (2) (Nf )

(1)

where x1 , x2 , . . . , xNf are input variables and elements of a vector x and yi are local output variables. Further, gi is the h-th input fuzzy set that describes the membership for a feature h. The degree of membership draws a functional relation between each point in the feature space and the interval [0 . . . 1]. It is realized as a function that maps each feature value into this interval, where the value 1 represents the highest membership and the value 0 represents nonmembership, respectively. For these functions, trapezoidal, triangular, or Gaussian-like shapes are conceivable. In the above described general model, bih are real-valued parameters that specify the local output variable yi as a linear combination of the input variables x. Each rules recommendation is weighted by its degree of membership i (x) with respect to the input vector x. The corresponding output value yD (x) of the TSK system is then computed by the weighted average output recommendation over all rules, see Equation (2). yD (x) =
Nr i=1 i (x)yi Nr i=1 i (x) (h)

Nr i=1

i (x) bi0 + bi1 x1 + . . . + biNf xNf


Nr i=1

i (x)

(2)

Therefore, the aforementioned degree of membership is computed for a single rule Ri as the superposition of all function values gi(h) (xh ) for all h for given xh . According to the general model, the multiplicative superposition of all these values as an AND operation leads to an overall degree of membership i (x) for rule Ri , see Equation (3).
Nf

i (x) = gi (x1 ) gi (x2 ) . . . gi

(1)

(2)

(Nf )

(xNf ) =
h=1

gi (xh )

(h)

(3)

For our scenario, however, we can simplify this general model as follows: Depending on the current system state, the grid scheduler only deals with binary decisions, that is, to accept or to decline an offered job. Consequently, the corresponding output part need not be represented as a linear combination of the input variables. Thus, we describe the consent to migration by an output value of 1 and the decline to migration by 1. Then, all weights except bi0 in Equation (2) are set to zero and the TSK model output becomes yi = bi0 . Therefore, we dene the output of rule Ri as yi = 1, 1, if the job is accepted otherwise (4)

With this denition, the nal controller output YD can be computed by considering the leading sign only: YD = sgn(yD (x)) (5) 553

Evolutionary Computation

Volume 17, Number 4

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

Here, a positive number again represents the acceptance of the job and a negative value the decline. Note that, by denition, the value zero corresponds to a decline as well.

Encoding Scheme Based on Gaussian Membership Functions

Although the TSK model allows describing fuzzy sets by any sort of membership function, we chose to use Gaussian membership functions (GMFs) to encode schemes for rules and entire rule bases. Because of their smoothness and concise notation, the GMFs have become very popular for specifying fuzzy sets. They have the advantage of requiring only two parameters (mean and variance) to specify their shape, which makes them particularly suited to represent a rule base with a minimum number of parameters. (h) (h) In detail, every feature h of all Nf features is modeled by a (i , i ) GMF2 with no normalization: x (h) 2 (h) i gi (x) = exp (6) (h) 2 i This function is completely described by dening the i and i values. The i value (h) adjusts the center of the feature value, while i models the region of inuence for this (h) rule in the feature domain. In other words, for increasing i values, the GMF becomes wider, while the peak value remains constant at one. Using this property of a GMF, we are able to steer the inuence of a rule for a certain feature by increasing or decreasing (h) i . Using this GMF as a membership function, a feature can be coded as a pair of real (h) (h) values i and i following the approach of Juang et al. (2000). For the consequence part we have to add a binary number to the encoded consequence part as yi 1, 1. This scheme allows the encoding of a single rule by a string of 2 Nf real-valued variables and one integer variable (see Figure 3). The whole rule base is encoded by concatenation of single rules. A rule base consisting of Nr rules is therefore entirely described by a set of l = Nr (2 Nf + 1) (7)
(h) (h) (h)

parameters. Obviously, this encoding scheme is perfectly suited as individual representation within an evolutionary algorithm where individuals have length l.

Experimental Setup

For the evaluation of our approach, we conducted simulations in order to analyze the performance of a CCA-learned genetic fuzzy system. To this end, we considered different setups of computational grids, each was based on real-world installations and used workload trace recordings from the original HPC systems.

2 In contrast to the common notation, we denote the mean of the GMF by to avoid conicts with the parental population size of evolution strategies which we denote by .

554

Evolutionary Computation

Volume 17, Number 4

CCA Learning of Fuzzy Systems for Computational Grids

Figure 3: Rule encoding concept for single rules and a whole rule base constructed by concatenation of basic rules. Table 1: Details of the Used Workloads from the Parallel Workload Archive and Results for the Average Weighted Response Time Performance Metric (See Section 7.4) in Seconds; Single Site Execution and FCFS as Local Scheduling Strategy Has Been Applied
Identier KTH-5 KTH-6 SDSC00-5 SDSC00-6 CTC-5 CTC-6 SDSC03-5 SDSC03-6 SDSC05-5 SDSC05-6 Number of Jobs 11,780 16,699 13,494 16,316 35,360 41,839 32,606 32,978 28,184 46,719 Duration (months) 5 6 5 6 5 6 5 6 5 6 mk 100 100 128 128 430 430 1,152 1,152 1,664 1,664 Original AWRT(s) 488,387.49 99,236.27 67,717.23 413,957.04 57,897.77 59,118.15 83,718.68 62,825.76 56,925.10 77,463.52

7.1

Input Data

The Parallel Workloads Archive3 provides job submission and execution traces recorded on real-world HPC sites, each of which contains information on relevant job characteristics. We selected ve well-known workloads for our evaluation (see Table 1). In order to be able to analyze different grid scenarios, we shortened and split the workloads to sets of ve and six months. We used the ve month sets as training sequences for the genetic fuzzy systems, and validated their performance on the basis of the six month sets. For reference, we simulated the workload on their original machines without any job exchange, using FCFS for the local scheduling system. This simple dispatch heuristic
3

See http://www.cs.huji.ac.il/labs/parallel/workload/.

Evolutionary Computation

Volume 17, Number 4

555

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

starts the rst job of the waiting queue as soon as enough idle resources are available (see Schwiegelshohn and Yahyapour, 2000). The resultsamong other relevant characteristicsare listed in Table 1. We will refer to this noncooperative case for the matter of comparison throughout the rest of this paper. 7.2 Feature Selection and System State Description

For the description of the current system states, we rely on only Nf = 2 different features that constitute the conditional part of a rule. To this end, we denote jobs that have been inserted into the waiting queue at site k as j k . In order to cover comprehensive system information with only a single feature, we consider the normalized waiting parallelism at site k (NWPk ) as the rst feature, see Equation (8). NWPk = 1 mk mj
j k

(8)

This feature indicates how many processors are expected to be occupied by all submitted jobs (note that the number of required processors mj is known at release time) related to the maximum number of available processors mk at site k. It reects the efciency of the local resource management system and indicates the expected load of the machine in the near future. The second feature focuses on the actual job that has to be decided upon. The percentaged ratio of a jobs parallelism mj and the maximum number of available nodes mk at site k is expressed by normalized job parallelism (NJP), see Equation (9). NJPj = mj 100 mk (9)

Using these features, we approximate every possible system state. Further, both features can be calculated simply by relying solely on information that is locally available, without having to exchange additional data with other sites. Note that RWP is not limited in range; however, values greater than 10 occur very rarely in practice. 7.3 CCA and Genetic Fuzzy System Conguration

We generate our genetic fuzzy systems with a xed number of Nr = 10 rules, since previous studies by Franke et al. (2008) revealed that rule bases consisting of 5 to 10 rules yield good results. Because the whole rule base is encoded in one individual of length l, we optimize a problem with l = Nr (Nf 2 + 1) = 10 (2 2 + 1) = 50 parameters. The tuning process is conducted using a ( + )-evolution strategy with a parent population of = 13 individuals which results in a children population of = 91 individuals, obeying Schwefels / = 1/7 ratio (see Schwefel, 1995). For the mutation, we distinguish two cases: The real-coded conditional part of the rule is mutated using a normal distribution with a step size of 0.01 for NWP and 0.1 for NJP. This stems from the observation that the two features vary in the expected value range by a ratio of 1 : 10 (see Section 7.2). The binary-coded consequence part of the rule is mutated by random ips from 1 to 1 and vice versa. Further, we apply discrete recombination with a probability of 0.1. 556

Evolutionary Computation

Volume 17, Number 4

CCA Learning of Fuzzy Systems for Computational Grids

The population is uniformly initialized within the range of [0, 10] for the ( , ) values of NWP and [0, 100] for NJP. As the tness evaluation of an individual is timeconsuming, we evaluated the whole population in parallel on a 200 node cluster with Pentium IV, 2.4 GHz processors. 7.4 Performance Metrics

For the measurement of the achieved schedule quality, we selected an objective that mainly reects the users point of view. The total response time is a user oriented objective frequently applied in theoretical scheduling problems: the sum of the response times Cj (S) rj , that is, the span between the completion time and the release time, of all jobs in a schedule. In particular, we use the average weighted response time (AWRT) computed for all jobs j k that have been initially submitted to site k, see Equation (10). AWRTk =
j k

pj mj (Cj (S) rj )
j k

pj m j

(10)

Following Schwiegelshohn and Yahyapour (2000), we use the resource consumption (pj mj ) of each job as weight to ensure that neither splitting nor combination of jobs can inuence the objective function in a benecial way. Further, the denition in Equation (10) also respects the execution on remote sites and, as such, the completion time Cj (S) refers to the site that executed job j . For evaluation, we present the obtained AWRT values after optimization with the CCA. We compare those AWRT values for computational grid interaction to the noncooperative exclusive single site execution results (see Table 1). Therefore, the comparison quanties the advantages for each HPC site of participating in the grid, using the optimized controller for job exchange.

Evaluation

For verication purposes, we evaluated two different grid scenarios: Our rst setup consists of three sites, namely CTC with 430 processors, SDSC03 with 1,152 processors, and SDSC05 with 1,664 processors. It represents the interconnection of a fairly standardsized university computing center with two large HPC installations. Remember that the fuzzy controllers are trained on ve month traces and compared with AWRT values achieved without any job exchange. From a practical point of view, this sequence length is sufcient for optimizing the grid setup. The improvements in AWRT are depicted in Figure 4(a). It is apparent that the AWRT improves by at least 10% for all sites. Naturally, the smaller site can benet from the high amount of resources shared with large installations. However, our controlled job exchange approach prevents an overuse of larger sites by the smaller ones. In fact, the assumed relaxation of egoism does take place: this can be interpreted from the number of transferred jobs (see Table 2). In this table, rows indicate the initial submission site and columns denote the actual execution site. Obviously, a large number of jobs is exchanged during the collaboration period; this indicates that all partners behave cooperatively. It is also remarkable that the relatively small CTC site accepts a large number of jobs and pushes only a few jobs to larger sites. However, the job characteristics selected by the controllers seem to best t into each sites schedule such that the overall response times are decreased at all sites. 557

Evolutionary Computation

Volume 17, Number 4

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

Figure 4: Improvements of the average weighted response time objective for a 3-site grid scenario. (a) Learning on ve month traces; (b) application to new six month traces. Table 2: Number of Migrated Jobs in the Three-Site Grid; The Rows Denote the Submitted Site and the Columns Denote the Executed Site
Site CTC SDSC03 SDSC05 CTC 3,798 13,371 SDSC03 3,904 6,098 SDSC05 162 1,313

Figure 5: Improvements of the average weighted response time objective for a ve-site grid scenario. (a) Learning on ve month traces; (b) application to new six month traces. Our second setup represents a larger grid composed of ve sites: Here, we added two additionalmedium-sizedHPC systems to our rst setup, namely KTH with 100 processors and SDSC00 with 128 processors. Again, clear improvements in AWRT ranging from 15% to 90% are obtained for all participating sites, see Figure 5(a). In Figure 6, as an example, we depict the rule base of the trained SDSC03 site. This characteristic curve of the fuzzy system, which is composed of the superposition of all its GMFs, depicts the behavior of the decision maker, namely for the NJP and NWP features. The controller output is given at the z axis, where a positive number denotes job acceptance and a negative value job refusal. Further, the black spots denote the activated system states during the training setup. In this gure, it is clearly shown that a controller contains relevant parts (i.e., black spotted parts of the surface which show that they have really been active) with positive controller output value. 558

Evolutionary Computation

Volume 17, Number 4

CCA Learning of Fuzzy Systems for Computational Grids

Figure 6: Example depiction of the best rule base generated for the SDSC03 site. Finally, we investigate the controllers robustness to estimate their applicability in practice. To this end, the optimized fuzzy controllers are engaged in collaborations similar to the ones from the previously described setups. However, we use the six month traces, see Section 7.1, for submission. These jobs have not been part of the training sequences and are thus completely unknown to the controllers. The results presented in Figure 4(b) still show signicant improvements for the AWRT of at least 20% in the rst setup. Thus, the fuzzy controllers show a high capability to deal with another related job set. Similar results are shown in Figure 5(b) for the second setup. Afterward, the optimized fuzzy schedulers are applied without any further adjustments. In all cases, we are able to show that the resulting schedules yield signicantly better objective values for each participating site in the grid.

Conclusion

In the present work, we applied a competitive coevolutionary algorithm (CCA) to optimize the parameter set of a fuzzy system that steers the job exchange in decentralized computational grids with restricted information policies. Within the CCA, each grid participant is modeled as a dedicated species that evolves in a mating-restricted population. The interaction within the common ecosystem (which represents the computational grid) then yields the tness evaluation. We show for two exemplary real-world grid setups that the optimized fuzzy systems establish job exchange policies that lead to signicantly improved objective values for all user communities. We further show that the grid schedulers behave robustly with respect to uctuations in the workload patterns and lead to objective value improvements even for unknown submission characteristics. As such, CCA-optimized genetic fuzzy systems as a basis for workload distribution and interchange in computational grids seem to represent a promising technology for future e-science infrastructures.

References
Cordon, O., Herrera, F., Hoffmann, F., and Magdalena, L. (2001). Evolutionary tuning and learning of fuzzy knowledge bases. In Genetic fuzzy systems, Vol. 19 of Advances in Fuzzy SystemsApplications and Theory. Singapore: World Scientic. Delgado, M. R., Zuben, F. V., and Gomide, F. (2004). Coevolutionary genetic fuzzy systems: A hierarchical collaborative approach. Fuzzy Sets and Systems, 141(1):89106.

Evolutionary Computation

Volume 17, Number 4

559

A. Folling, C. Grimme, J. Lepping, A. Papaspyrou, and U. Schwiegelshohn

Franke, C., Hoffmann, F., Lepping, J., and Schwiegelshohn, U. (2008). Development of scheduling strategies with genetic fuzzy systems. Applied Soft Computing, 8(1):706721. Grimme, C., Lepping, J., and Papaspyrou, A. (2007). Exploring the behavior of building blocks for multi-objective variation operator design using predator-prey dynamics. In D. Thierens et al. (Eds.), Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2007), pp. 805812. Grimme, C., Lepping, J., and Papaspyrou, A. (2008). Discovering performance bounds for grid scheduling by using evolutionary multiobjective optimization. In M. Keijzer et al. (Eds.), Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2008), pp. 1491 1498. Hillis, W. D. (1990). Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D, 42(13):228234. Jansen, T., and Wiegand, R. P. (2003). Exploring the explorative advantage of the cooperative coevolutionary (1 + 1) EA. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2003), Vol. 2723 of Lecture Notes in Computer Science (LNCS), pp. 310321. Juang, C.-F., Lin, J.-Y., and Lin, C.-T. (2000). Genetic reinforcement learning through symbiotic evolution for fuzzy controller design. IEEE Transactions on Systems, Man, and Cybernetics, 30(2):290302. Olsson, B. (2001). Co-evolutionary search in asymmetric spaces. Information SciencesInformatics and Computer Science: An International Journal, 133(34):103125. Paredis, J. (2000). Coevolutionary algorithms. In Evolutionary computation 2: Advanced algorithms and operators (pp. 224238). Bristol, UK: Institute of Physics Publishing. Potter, M. A., and Jong, K. A. D. (2000). Cooperative coevolution: An architecture for evolving coadapted subcomponents. Evolutionary Computation, 8(1):129. Rosin, C. D., and Belew, R. K. (1997). New methods for competitive coevolution. Evolutionary Computation, 5(1):129. Schwefel, H.-P. (1995). Evolution and optimum seeking. New York: John Wiley & Sons. Schwiegelshohn, U., and Yahyapour, R. (2000). Fairness in parallel job scheduling. Journal of Scheduling, 3(5):297320. Takagi, T., and Sugeno, M. (1985). Fuzzy identication of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, SMC-15(1):116132. Topcuoglu, H., Hariri, S., and Wu, M. (2002). Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3):260274.

560

Evolutionary Computation

Volume 17, Number 4