Vous êtes sur la page 1sur 44

7

Computational Learning Theory


http://rajakishor.co.cc Page 95
The study machine learning involves the following in generating the laws that
may govern machine learners.
1. Is it possible to identify classes of learning problems that are inherently
difficult or easy, independent of the learning algorithm?
2. Can one characterize the number of training examples necessary or sufficient
to assure successful learning?
3. How is this number affected if the learner is allowed to pose queries to the
trainer, versus observing a random sample of training examples?
4. Can one characterize the number of mistakes that a learner will make before
learning the target function?
5. Can one characterize the inherent computational complexity of classes of
learning problems?
The computational theory of learning provides answers to these questions by
presenting key results within particular problem settings.
We focus here on the problem of inductively learning an unknown target
function, given only training examples of this target function and a space of candidate
hypotheses. Within this setting, we will be chiefly concerned with questions such as
1. how many training examples are sufficient to successfully learn the target
function?
2. how many mistakes will the learner make before succeeding?
We can set quantitative bounds on these measures, depending on attributes of
the learning problem such as:
 the size or complexity of the hypothesis space considered by the learner
 the accuracy to which the target concept must be approximated
 the probability that the learner will output a successful hypothesis
 the manner in which training examples are presented to the learner

The goal of computational learning is to answer questions such as:


 Sample complexity. How many training examples are needed for a learner to
converge (with high probability) to a successful hypothesis?
 Computational complexity. How much computational effort is needed for a
learner to converge (with high probability) to a successful hypothesis?
 Mistake bound. How many training examples will the learner misclassify before
converging to a successful hypothesis?

http://rajakishor.co.cc Page 96
P robably L earning an A pproximately C orrect H ypothesis
Here, we consider a particular setting for the learning problem, called the
probably approximately correct (PAC) learning model. We begin by specifying the
problem setting that defines the PAC learning model, then consider the questions of
how many training examples and how much computation are required in order to learn
various classes of target functions within this PAC model. For the sake of simplicity, we
restrict the discussion to the case of learning Boolean-valued concepts from noise-free
training data.

The Problem Setting


Let X refers to the set of all possible instances over which target functions is to be
defined. For example, X might represent the set of all people, each described by the
attributes age (e.g., young or old) and height (short or tall).
Training examples are generated by drawing an instance x at random according
to D, then presenting x along with its target value, c(x), to the learner. We assume
instances are generated at random from X according to some probability distribution D.
For example, D might be the distribution of instances generated by observing people
who walk out of the largest sports store in Switzerland. In general, D may be any
distribution, and it will not generally be known to the learner. All that we require of D is
that it be stationary; that is, that the distribution not change over time.
Let C refer to some set of target concepts that our learner might be called upon to
learn. Each target concept c in C corresponds to some subset of X, or equivalently to
some boolean-valued function c : X  {0, 1}. For example, one target concept c in C
might be the concept "people who are skiers." If x is a positive example of c, then we will
write c(x) = 1; if x is a negative example, c(x) = 0.
The learner L considers some set H of possible hypotheses when attempting to
learn the target concept. For example, H might be the set of all hypotheses describable by
conjunctions of the attributes age and height. After observing a sequence of training
examples of the target concept c, L must output some hypothesis h from H, which is its
estimate of c. To be fair, we evaluate the success of L by the performance of h over new
instances drawn randomly from X according to D, the same probability distribution
used to generate the training data.

http://rajakishor.co.cc Page 97
Error of a Hypothesis
Because we are interested in how closely the learner's output hypothesis h
approximates the actual target concept c, let us begin by defining the true error of a
hypothesis h with respect to target concept c and instance distribution D.
Definition
The true error (denoted errorD(h)) of hypothesis h with respect to target
concept c and distribution D is the probability that h will misclassify an instance drawn
at random according to D.

errorD ( h) = Pr[c ( x ) ≠ h( x)]


x∈D

Here, the notation Pr indicates that the probability is taken over the instance
x∈D
distribution D.
Informally, the true error of h is just the error rate we expect when applying h to
future instances drawn according to the probability distribution D.

Probably Approximately Correct (PAC) Learnability


The aim of computational learning is to characterize classes of target concepts
that can be reliably learned from a reasonable number of randomly drawn training
examples and a reasonable amount of computation.
We may try to characterize the number of training examples needed to learn a
hypothesis h for which errorD(h) = 0.
Unfortunately, it turns out this is futile in the setting we are considering, for two
reasons.
1. Unless we provide training examples corresponding to every possible
instance in X (an unrealistic assumption), there may be multiple hypotheses
consistent with the provided training examples, and the learner cannot be
certain to pick the one corresponding to the target concept.
2. Given that the training examples are drawn randomly, there will always be
some nonzero probability that the training examples encountered by the
learner will be misleading. (For example, although we might frequently see
skiers of different heights, on any given day there is some small chance that
all observed training examples will happen to be 2 meters tall.)

http://rajakishor.co.cc Page 98
To accommodate these two difficulties, we weaken our demands on the learner
in two ways.
1. We will not require that the learner output a zero error hypothesis - we will
require only that its error be bounded by some constant, є, that can be made
arbitrarily small.
2. We will not require that the learner succeed for every sequence of randomly
drawn training examples - we will require only that its probability of failure
be bounded by some constant, δ, that can be made arbitrarily small.
In short, we require only that the learner probably learn a hypothesis that is
approximately correct - hence the term probably approximately correct learning, or
PAC learning.
Consider some class C of possible target concepts and a learner L using
hypothesis space H. We can say that the concept class C is PAC-learnable by L using H if,
for any target concept c in C, L will with probability (1 - δ) output a hypothesis h with
errorD(h) < ε, after observing a reasonable number of training examples and performing
a reasonable amount of computation.
More precisely,
Consider a concept class C defined over a set of instances X of length n and a
learner L using hypothesis space H. C is PAC-learnable by L using H if for all
c є C, distributions D over X, ε such that 0 < ε < ½, and δ such that 0 < δ < ½,
learner L will with probability at least (1 - δ) output a hypothesis h є H such
that errorD(h) ≤ ε, in time that is polynomial in 1/ε , 1/δ, n, and size(c).

S ample C omplexity for F inite H ypothesis S paces


PAC-learnability is largely determined by the number of training examples
required by the learner. The growth in the number of required training examples with
problem size is called the sample complexity of the learning problem.
Here, we present a general bound on the sample complexity for a very broad
class of learners, called consistent learners. A learner is consistent if it outputs
hypotheses that perfectly fit the training data, whenever possible. It is quite reasonable
to ask that a learning algorithm be consistent, given that we typically prefer a
hypothesis that fits the training data over one that does not.
We can derive a bound on the number of training examples required by any
consistent learner, independent of the specific algorithm it uses to derive a consistent
hypothesis. To accomplish this, we use the version space. We defined the version space,
VSH,D, to be the set of all hypotheses h є H that correctly classify the training examples D.

http://rajakishor.co.cc Page 99
VS H , D = {h ∈ H | (∀ < x, c ( x) >∈ D )( h( x) = c ( x ))}
The significance of the version space here is that every consistent learner
outputs a hypothesis belonging to the version space, regardless of the instance space X,
hypothesis space H, or training data D. The reason is simply that by definition the
version space VSH,D contains every consistent hypothesis in H. Therefore, to bound the
number of examples needed by any consistent learner, we need only bound the number
of examples needed to assure that the version space contains no unacceptable
hypotheses.
The following definition states this condition precisely.
Consider a hypothesis space H, target concept c, instance distribution D,
and set of training examples D of c. The version space VSH,D, is said to be
ε-exhausted with respect to c and D, if every hypothesis h in VSH,D, has
error less than ε with respect to c and D.

(∀h ∈ VS H , D )errorD (h) < ε


The version space VSH,D is the subset of hypotheses h є H, which have zero
training error (say, r = 0). Of course, the true errorD(h) may be nonzero, even for
hypotheses that commit zero errors over the training data. The version space is said to
be ε -exhausted when all hypotheses h remaining in VSH,D have errorD(h) < ε .

T M he istake B ound of L earning


The theory of computational learning considers a variety of different settings.
They are
1. how the training examples are generated (e.g., passive observation of random
examples, active querying by the learner)
2. noise in the data (e.g., noisy or error-free)
3. the definition of success (e.g., the target concept must be learned exactly, or
only probably and approximately)
4. assumptions made by the learner (e.g., regarding the distribution of instances
and whether C ⊆ H)
5. the measure according to which the learner is evaluated (e.g., number of
training examples, number of mistakes, total time).

http://rajakishor.co.cc Page 100


In the case of mistake bound model of learning, the learner is evaluated by the
total number of mistakes it makes before it converges to the correct hypothesis. As in
the PAC setting, we assume the learner receives a sequence of training examples.
However, here we demand that upon receiving each example x, the learner must predict
the target value c(x), before it is shown the correct target value by the trainer.
The question considered is "How many mistakes will the learner make in its
predictions before it learns the target concept?” This question is significant in
practical settings where learning must be done while the system is in actual use, rather
than during some off-line training stage. For example, if the system is to learn to predict
which credit card purchases should be approved and which are fraudulent, based on
data collected during use, then we are interested in minimizing the total number of
mistakes it will make before converging to the correct target function. Here the total
number of mistakes can be even more important than the total number of training
examples.
This mistake bound learning problem may be studied in various specific settings.
For example, we might count the number of mistakes made before PAC learning the
target concept.
In the examples below, we consider instead the number of mistakes made before
learning the target concept exactly. Learning the target concept exactly means
converging to a hypothesis such that ( ∀ x)h(x) = c(x).

Mistake Bound for the FIND-S Algorithm


Consider the hypothesis space H consisting of conjunctions of up to n boolean
literals l1, …, ln and their negations (e.g., Rich ^ ¬Handsome). The FIND-S algorithm
incrementally computes the maximally specific hypothesis consistent with the training
examples.
A straightforward implementation of FIND-S for the hypothesis space H is as
follows:

Find-S
 Initialize h to the most specific hypothesis l1 ^ ¬l1 ^ l2 ^ ¬l2 … ln ^ ¬ln
 For each positive training instance x
o Remove from h any literal that is not satisfied by x
 Output hypothesis h.
FIND-S converges in the limit to a hypothesis that makes no errors, provided C ⊆
H and provided the training data is noise-free. FIND-S begins with the most specific
hypothesis (which classifies every instance a negative example), then incrementally
generalizes this hypothesis as needed to cover observed positive training examples.

http://rajakishor.co.cc Page 101


Can we prove a bound on the total number of mistakes that FIND-S will make
before exactly learning the target concept c? The answer is yes. To see this, note first
that if c ∈ H, then FIND-S can never mistakenly classify a negative example as positive.
The reason is that its current hypothesis h is always at least as specific as the target
concept e. Therefore, to calculate the number of mistakes it will make, we need only to
count the number of mistakes it will make misclassifying truly positive examples as
negative.
How many such mistakes can occur before FIND-S learns c exactly? Consider the
first positive example encountered by FIND-S. The learner will certainly make a mistake
classifying this example, because its initial hypothesis labels every instance negative.
However, the result will be that half of the 2n terms in its initial hypothesis will be
eliminated, leaving only n terms. For each subsequent positive example that is
mistakenly classified by the current hypothesis, at least one more of the remaining n
terms must be eliminated from the hypothesis. Therefore, the total number of mistakes
can be at most n + 1. This number of mistakes will be required in the worst case,
corresponding to learning the most general possible target concept ( ∀ x)c(x) = 1 and
corresponding to a worst case sequence of instances that removes only one literal per
mistake.

http://rajakishor.co.cc Page 102


8

Instance-
stance-Based Learning

http://rajakishor.co.cc Page 103


Instance-based learning methods such as nearest neighbor and locally weighted
regression are conceptually straightforward approaches to approximating real-valued
or discrete-valued target functions. Learning in these algorithms consists of simply
storing the presented training data. When a new query instance is encountered, a set of
similar related instances is retrieved from memory and used to classify the new query
instance.
Instance-based methods can use more complex, symbolic representations for
instances. In case-based learning, instances are represented in this fashion and the
process for identifying "neighboring" instances is elaborated accordingly. Case-based
reasoning has been applied to tasks such as storing and reusing past experience at a
help desk, reasoning about legal cases by referring to previous cases, and solving
complex scheduling problems by reusing relevant portions of previously solved
problems.
One disadvantage of instance-based approaches is that the cost of classifying
new instances can be high. This is due to the fact that nearly all computation takes place
at classification time rather than when the training examples are first encountered.
Therefore, techniques for efficiently indexing training examples are a significant
practical issue in reducing the computation required at query time.
A second disadvantage to many instance-based approaches, especially nearest-
neighbor approaches, is that they typically consider all attributes of the instances when
attempting to retrieve similar training examples from memory. If the target concept
depends on only a few of the many available attributes, then the instances that are truly
most "similar" may well be a large distance apart.

k-N earest N eighbor L earning


The most basic instance-based method is the k-NEARESNT NEIGHBOR algorithm.
This algorithm assumes all instances correspond to points in the n-dimensional space
Rn. The nearest neighbors of an instance are defined in terms of the standard Euclidean
distance.
More precisely, let an arbitrary instance x be described by the feature vector
(a1(x), a2(x), …, an(x))
where ar(x) denotes the rth attribute of instance x. then the distance between two
instances xi and xj is defined to be d(xi, xj), where
n
d ( xi , x j ) = ∑ (a ( x ) − a ( x ))
r =1
r i r j
2

http://rajakishor.co.cc Page 104


In nearest-neighbor learning the target function may be either discrete-valued or
real-valued.
Let us first consider learning discrete-valued target functions of the form f: Rn 
V, where V is the finite set {v1, …, vs}. The k-NEAREST NEIGHBOR algorithm for
approximating a discrete-valued target function is given below.

Training algorithm:
 For each training example (x, f(x)), add the example to the list training-examples
Classification algorithm:
 Given a query instance xq to be classified,
• Let x1, …, xk denote the k instances from training-examples that are
nearest to xq.
• Return
k
f ( x ) ← arg max δ (v, f ( x ))
q ∑v∈V i =1
i

where d(a, b) = 1 if a=b and where d(a, b) = 0 otherwise.


As shown here, the value f(xq) returned by this algorithm as its estimate of f(xq)
is just the most common value of f among the k training examples nearest to xq. If we
choose k = 1, then the 1-NEAREST NEIGHBOR algorithm assigns to f ( xq ) the value f(xi)
where xi is the training instance nearest to xq. For larger values of k, the algorithm
assigns the most common value among the k nearest training examples.
The k-NEAREST NEIGHBOR algorithm is easily adapted to approximating
continuous-valued target functions. To accomplish this, we have the algorithm calculate
the mean value of the k nearest training examples rather than calculate their most
common value. More precisely, to approximate a real-valued target function f: Rn  R
we replace the final line of the above algorithm by the line
k
f ( x ) ← ∑ i =1
f ( xi )
q
k
Distance Weighted Nearest Neighbor Algorithm
One obvious refinement to the k-NEAREST NEIGHBOR algorithm is to weight the
contribution of each of the k neighbors according to their distance to the query point xq,
giving greater weight to closer neighbors. For example, in the above algorithm, which
approximates discrete-valued target functions, we might weight the vote of each
neighbor according to the inverse square of its distance from xq.

http://rajakishor.co.cc Page 105


k
f ( x ) ← arg max w δ (v, f ( x ))
q ∑ i
v∈V i =1
i

where

1
wi ≡
d ( xq , xi )2
To accommodate the case where the query point x, exactly matches one of the
training instances xi and the denominator d(xq, xi)2 is therefore zero, we assign f ( xq ) to
be f (xi) in this case. If there are several such training examples, we assign the majority
classification among them.
We can distance-weight the instances for real-valued target functions in a similar
fashion, replacing the final line of the algorithm in this case by
k
f ( x ) ← ∑ i =1 wi f ( xi )
q k
∑ i =1 wi
where wi is the inverse square of its distance from xq.

L ocally W eighted R egression


The nearest-neighbor approaches described in the previous section can be
thought of as approximating the target function f(x) at the single query point x = xq.
Locally weighted regression is a generalization of this approach. It constructs an explicit
approximation to f over a local region surrounding xq.
Locally weighted regression uses nearby or distance-weighted training examples
to form this local approximation to f. For example, we might approximate the target
function in the neighborhood surrounding xq using a linear function, a quadratic
function, a multilayer neural network, or some other functional form.
The phrase "locally weighted regression" is called local because the function is
approximated based a only on data near the query point, weighted because the
contribution of each training example is weighted by its distance from the query point,
and regression because this is the term used widely in the statistical learning
community for the problem of approximating real-valued functions.

http://rajakishor.co.cc Page 106


Given a new query instance xq, the general approach in locally weighted
regression is to construct an approximation f that fits the training examples in the
neighborhood surrounding xq. This approximation is then used to calculate the value
f ( x ) , which is output as the estimated target value for the query instance. The
q

description of f may then be deleted, because a different local approximation will be


calculated for each distinct query instance.

Locally Weighted Linear Regression


Let us consider the case of locally weighted regression in which the target
function f is approximated near xq using a linear function of the form
f ( x ) = w + w a ( x ) + ... + w a ( x )
q 0 1 1 n n

ai(x) denotes the value of the ith attribute of the instance x.


For a global approximation to the target function, we derive methods to choose
weights that minimize the squared error summed over the set D of training examples

1
E≡ ∑ ( f ( x) − f ( x)) 2
2 x∈D
which led us to the gradient descent training rule

∆w j = η ∑ ( f ( x) − f ( x))a j ( x)
x∈D

where η is a constant learning rate


How shall we modify this procedure to derive a local approximation rather than
a global one? The simple way is to redefine the error criterion E to emphasize fitting the
local training examples.
Three possible criteria are given below. Note we write the error E(xq) to
emphasize the fact that now the error is being defined as a function of the query point
xq.
1. Minimize the squared error over just the k nearest neighbors:

1
E1 ( xq ) ≡ ∑ (f(x)- f (x)) 2
2 x∈k nearest nbrs of x q

http://rajakishor.co.cc Page 107


2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq:

1
E2 ( xq ) ≡ ∑ (f(x)- f (x)) 2 K ( d ( xq , x))
2 x∈D
3. Combine 1 and 2:
1
E3 ( xq ) ≡ ∑ (f(x)- f (x)) 2 K (d ( xq , x))
2 x∈k nearest nbrs of x q

http://rajakishor.co.cc Page 108


9

Genetic Algorithms
http://rajakishor.co.cc Page 109
Genetic Algorithms are nondeterministic stochastic search or optimization
methods that utilize the theories of evolution and natural selection to solve a problem
within a complex solution space. They are computer-based problem solving systems,
which use computational models of some of the known mechanisms in evolution as key
elements in their design and implementation. Genetic algorithms are loosely based on
natural evolution and use a “survival of the fittest” technique, wherein the best solutions
survive and are varied until we get a good result.
In a genetic algorithm, the performance of a set of candidate solutions to a
problem called ‘chromosomes’ are evaluated and ordered, then new candidate solutions
are produced by selecting candidates as ‘parents’ and applying mutation or crossover
operators which combine bits of two parents to produce one or more children. The new
set of candidates is evaluated, and this cycle continues until an adequate solution is
found.

P reliminaries
Genetic Algorithms are nondeterministic stochastic search or optimization
methods that utilize the theories of evolution and natural selection to solve a problem
within a complex solution space. They are computer-based problem-solving systems,
which use computational models of some of the known mechanisms in evolution as key
elements in their design and implementation. They are a member of a wider population
of algorithms, “Evolutionary Algorithms”.
Genetic Algorithms perform a multi-point search in the problem space. On one
hand it ensures the robustness, searching in a local minimum does not mean that the
whole algorithm fails, while on the other it may give not just one, but more nearly
optimal solutions for the problem from which the user can select.
Due to robustness, flexibility and efficiency of Genetic Algorithms, costly
redesigns of artificial systems that are based on them can be avoided. Genetic
Algorithms are theoretically and empirically proven to provide robust search in
complex problem spaces.

Biological Background
Genetic algorithms are inspired by Darwin's theory. Solutions to a problem can
be obtained through evolutions. All living organisms consist of cells. In each cell there is
the same set of chromosomes. Chromosomes are strings of DNA and serves as a model
for the whole organism. The genes determine a chromosome’s characteristic. Each gene
has several forms or alternatives, which are called alleles, producing differences in the

http://rajakishor.co.cc Page 110


set of characteristics associated with that gene. The set of chromosome is called the
genotype, which defines a phenotype (the individual) with certain fitness.
During reproduction, first occurs recombination (or crossover). Genes from
parents form in some way the whole new chromosome. The new created offspring can
then be mutated. Mutation means, that the elements of DNA are a bit changed. These
changes are mainly caused by errors in copying genes from parents. The fitness of an
organism is measured by success of the organism in its life. According to Darwinian
theory, the highly fit individuals are given opportunities to “reproduce” whereas the
least fit members of the population are less likely to get selected for reproduction, and
so “die out”.
In nature, the genes of living creature are stored as pairs and each parent only
presents only one gene from each pair. This differs from Genetic Algorithms, in which
genes are not stored in pairs. But, in both Genetic Algorithms and biological life forms,
only a fraction of parents’ genes are passed to each offspring.

E ncoding of C hromosome
The first step in genetic algorithm is to “translate” the real problem into
“biological terms”. Format of chromosome is called encoding. There are four commonly
used encoding methods: binary encoding, permutation encoding, direct value encoding
and tree encoding.

i. Binary Encoding:
Binary encoding is the most common and simplest one. In binary encoding, every
chromosome is a string of bits, 0 or 1. For example:
Chromosome A: 0101101100010011
Chromosome B: 1011010110110101

ii. Permutation Encoding:


Permutation encoding can be used in “ordering problems”, such as traveling
salesman problem or task ordering problem. In permutation encoding, every
chromosome is a string of numbers, which represents number in a sequence. For
example:
Chromosome A: 8549102367
Chromosome B: 9102438576

http://rajakishor.co.cc Page 111


iii. Direct Value Encoding:
Direct value encoding can be used in problems where some complicated values
such as real numbers are used. Use of binary encoding for this type of problems would
be very difficult. In value encoding, every chromosome is a string of some values. Values
can be anything connected to problem, form numbers, real numbers or chars to some
complicated objects. For example:
Chromosome A: [red], [black], [blue], [yellow], [red], [green]
Chromosome B: 1.8765, 3.9821, 9.1283, 6.8344, 4.116, 2.192
Chromosome C: ABCKDEIFGHNWLSWWEKPOIKNGVCI

iv. Tree Encoding:


Tree encoding is used mainly for evolving programs or expressions, for genetic
programming. In tree encoding every chromosome is a tree of some objects, such as
functions or commands in programming language.

G enetic A lgorithm O perators


Various operators are available in literature for genetic algorithms. Most of them
can be classified into fundamental categories of reproduction, crossover and mutation.

i. Initialization
There are many ways to initialize and encode the initial population. It can be
binary or non-binary, fixed or variable length strings and so on. This operator is not of
much significance if the system random generates valid chromosomes and evaluates
each one.

ii. Reproduction
Between successive generations, the process by which chromosomes of the
previous generations are retained in the next generations is reproduction. The two
types of reproduction are Generational Reproduction and steady-state reproduction.
 Generational Reproduction
In Generational Reproduction, the whole population is potentially replaced at
each generation. The most often used method is to randomly generate a
population of chromosomes. The next step is to decode chromosomes into

http://rajakishor.co.cc Page 112


individuals and evaluate fitness of all individuals, select fittest individuals,
generate new population by selection, crossover and mutation. Repeat this step
until the termination condition is not met.
 Steady State Reproduction
In steady state reproduction, population of chromosomes is randomly generated.
The next step is to decode chromosomes into individuals and evaluate fitness of
all the individuals, put the fittest individuals into a mating pool, produce a
number of offspring by crossover and mutation and replace the weakest
individuals by offspring. Repeat this step until the termination condition is not
met.

iii. Selection
According to Darwin's evolution theory the best ones should survive and create
new offspring. There are many methods how to select the best chromosomes, for
example roulette wheel selection, Boltzman selection, tournament selection, rank
selection, spatially oriented selection.
 Roulette Wheel Selection
Parents are selected according to their fitness. The better the chromosomes are,
the more chances to be selected they have. Imagine a roulette wheel (pie chart)
where all chromosomes in the population are placed in according to their
normalized fitness. Then a random number is generated which decides the
chromosome to be selected. Chromosomes with bigger fitness values will be
selected more times since they occupy more space on the pie.
 Rank Selection
The previous selection will have problems when the fitnesses differ very much.
For example, if the best chromosome fitness is 90% of the entire roulette wheel
then the other chromosomes will have very few chances to be selected. Rank
selection first ranks the population and then every chromosome receives fitness
from this ranking. The worst will have fitness 1, second worst 2 etc. and the best
will have fitness N (number of chromosomes in population). After this, all the
chromosomes have a chance to be selected. However, this method can lead to
slower convergence, because the best chromosomes do not differ so much from
other ones.
 Elitism
When creating new population by crossover and mutation, we have a big chance
that we will loose the best chromosome. Elitism is a method, which first copies
the best chromosome (or a few best chromosomes) to new population. The rest

http://rajakishor.co.cc Page 113


is done in classical way. Elitism can very rapidly increase performance of Genetic
Algorithms, because it prevents losing the best-found solution.
 Tournament Selection
The tournament selection chooses K parents at random and returns he fittest
one among these. Some other forms of tournament selections exist like the
Boltzmann tournament selection. Marriage tournament selection chooses one
parent, has upto K tries to find one fitter and stops at the first of these tries,
which finds a fitter one. If none is better than the initial choice, then the initial
choice is returned.
 Spatially-oriented selection
Spatially-oriented selection is a local selection method rather than a
global one. That is, the selection competition is between several small
neighboring chromosomes instead of the whole population. This method is based
on the Wright’s shift-balance model of evolution and is termed as “Cellular
Genetic Algorithms”.

iv. Crossover
The crossover operator is the most important operation in genetic algorithms.
Crossover is a process of yielding recombination of bit strings via an exchange of
segments between pairs of chromosomes. There are many kinds of crossovers. Certain
crossover operators are applicable for binary chromosomes and some other for
permutation chromosomes.
 One-point crossover
A randomly chosen point is taken within the length of the chromosomes .The
chromosomes are cut at that point. The first child consists of sub chromosome of
Parent1 up to the cut point concatenated with the sub chromosome of parent2
after the cut point. The second child is constructed in a similar way. For example:

P1 = 1010101| 010
P2 = 1110001|110
The crossover point is between 7th and 8th bits. Then the offspring will be
O1 = 1010101|110
O2 = 1110001| 010

http://rajakishor.co.cc Page 114


 Two point crossover
The chromosomes are thought of as rings with the first and the gene connected
that is a wrap around structure. The rings are cut at two randomly chosen sites
and the resulting sub rings are swapped. For example,
P1 = 10|1010|1010
P2 = 11|1000|1110
The crossover points are between 2nd and 3rd bits, 6th and 7th bits. The
offspring’s will be
O1 = 11|1010|1110
O2 = 10|1000|1010
 Uniform crossover
Each gene of the chromosome is selected randomly from the corresponding gene
of the parents. For example,
P1 = 11111111
P2 = 00000000
Here the eight random numbers generated are 2,2,1,1,1,2,1,2. The random
number 2 selects P2 for the corresponding bit position, so 0 is selected for
random number 2 generated in this example. Then the offspring is
O = 00111010
 N-point crossover
This is similar to one point and two point crossover except that we must select n
positions and only the bits between odd and even crossover positions are
swapped. The bits between even and odd crossovers are unchanged.
 Arithmetic crossover
Here an arithmetic operation like OR, AND are performed between the two
parents. The resulting chromosome is the offspring. For example:
P1 = 11001011
P2 = 11011111
The arithmetic operation AND is performed between the two parents and the
resulting offspring is
O = 11001001

http://rajakishor.co.cc Page 115


 Partially matched crossover (PMX):
Useful for permutation GA’s. Similar to two-point crossover with swaps done so
that a permutation is finally obtained. For example, a two-point crossover
performed on:
P1: 1234 | 567 | 8
P2: 8521 | 364 | 7
We would get
O1: 1234 | 364 | 8
O2: 8521 | 567 | 7
Which is illegal because in O1 3 and 4 are repeated and 5 and 7 do not occur at
all. In O2 3 and 4 do not occur but 5 and 7 occur twice. PMX fixes this problem by
noting that we made the swaps 3 -> 5, 6 -> 6 and 4 -> 7 and then repeating these
swaps on the genes outside the crossover points, giving us
O1’: 12573648
O2’: 83215674
 Cycle crossover
This time we do not pick a crossover point at all. We choose the first gene from
one of the parents
P1: 12345678
P2: 85213647
Say, we pick 1 from P1:
O1 = 1 - - - - - --
We must pick every element from one of the parents and place it in the position
it was previously in. Since the first position is occupied by 1, the number 8 from
P2 cannot go there. So, we must now pick the 8 from P1.
O1 = 1 - - - - - -8
This forces us to put the 7 in position 7 and the 4 in position 4, as in P1.
O1 = 1 - -4 - -78
Since 1, 4, 7 occupy the same set of positions, 8 in P1 and P2, we finish by filling
in the blank positions with the elements of those positions in v2. Thus,
O1 = 15243678
And we get O2 from the complement of O1
O = 82315647

http://rajakishor.co.cc Page 116


This process ensures that each chromosome is legal. Notice that it is possible for
us to end up with the offspring being the same as the parents. This is not a
problem since it will usually only occur if the parents have high fitnesses, in
which case, it could still be a good choice.
 Order crossover (OX)
This is more like PMX in that we choose two crossover points and crossover the
genes between the two points. However instead of repairing the chromosome by
swapping the repeats of each node also, we simply rearrange the rest of the
genes to give a legal permutation. With the chromosomes
P1 = 135 | 762 | 48
P2 = 563 | 821 | 47
We would start by switching the genes between the two crossover points.
O1 = - - - | 821 | - -
O2 = - - - | 762 | --
We then write down the genes from each parent chromosome starting from the
second crossover point.
O1: 48135762
O2: 47563821
Then the genes that were between the crossover points are deleted. That is, we
would delete 8, 2 and 1 from the v1 list and 7, 6 and 2 from the v2 list to give
O1: 43576
O2: 45381
which are then replaced into the child chromosomes, starting at the second
crossover point.
O1 = 57682143
O2 = 38176245
 Matrix crossover (MX)
For this we have a matrix representation where the element i, j is 1 if there is an
edge from node i to node j and 0 otherwise. This is useful in solving the traveling
salesperson problem. Matrix crossover is the same as one- or two-point
crossover with the operation done on matrices instead of linear chromosomes.

http://rajakishor.co.cc Page 117


 Modified order crossover (MOX)
This is similar to order crossover. We randomly choose one crossover point in
the parents and as usual, leave the genes before the crossover point as they are.
We then reorder the genes after the crossover point in the order that they appear
in the second parent chromosome. This operator is used in our implementation.
If we have,
P1 = 123 | 456
P2 = 364 | 215
We would get
O1 = 123 | 645
O2 = 364 | 125

v. Mutation
Mutation has the effect of ensuring that all possible chromosomes are reachable.
For example, the position of a chromosome can be any value from one to twenty, if in
the initial population there is no chromosome having a value of 6 in any of its gene
positions. Then with only crossover and reproduction operators then the value 6 will
never occur in any future chromosomes. The mutation operator can overcome this way
randomly selecting a bit position and changing its value.
Mutation is useful in escaping local minima ass it helps explore new regions of
the multidimensional solution space. If mutation rate is too high, it can cause well-bred
chromosomes to be lost and thus decrease the exploitation of high fitness regions of the
solution space.
Some systems that use random populations (noisy) created at initialization
phase do not use mutation operators at all. Some mutation operators are:
 Bit Inversion
This operator is discussed on binary chromosomes. In this the bits in the
chromosome are inverted (0’s are made as 1’s and 1’s are made as 0’s)
depending on the probability of mutation. For example, 1000000001 =
1010000000 where the third and 10th bits have been (randomly) mutated.
 Order Changing
This operator can be used on both binary and non-binary gene representations.
In this a portion of the chromosome is selected and the genes in that region are
randomly permuted. For example, (5 6 3 4 7 3) = (5 3 4 6 7 3), where the second,
third and fourth values have been randomly scrambled.

http://rajakishor.co.cc Page 118


 Value Changing
A value of a gene is changed within a specific range. For example,
(3.4 4.2 4.6 6.4 3.2) = (3.4 4.2 4.5 6.4 3.2)
Where one value has been changed within a specific range
 Reciprocal Exchange
Two randomly selected positions in the chromosome are selected and the values
in the chromosome in those positions are swapped. For example, (5 6 2 4 7 3)
gives us (5 6 7 4 2 3) if the randomly selected positions are 3 and 5.

vi. Inversion
In Holland’s founding work on Genetic Algorithms he made mention of another
operator, besides selection, crossover and mutation which takes place in biological
reproduction. This is known as the inversion operator. An inversion is where a portion
of a chromosome detaches from the rest of the chromosome, then changes direction and
recombines with the chromosome. The process of inversion is decidedly more complex
to implement than the other operators involved in genetic algorithms. Inversion has
also attracted a substantial amount of research. For example, consider the Chromosome

Before Inversion: 001001011010100101011010010100101010001010


During Inversion: 00100101101010 01010110100101 00101010001010
01010110100101
One portion inverts: becomes
(order is reversed) 10100101101010
Recombination: 00100101101010 10100101101010 00101010001010
After Inversion: 001001011010101010010110101000101010001010

http://rajakishor.co.cc Page 119


G enetic A lgorithms and T raditional O ptimization and

S earch M ethods
There are three main types of traditional or conventional search method:
calculus-based, enumerative, and random.
i. Calculus-based methods
Calculus-based methods are also referred to as gradient methods. These
methods use the information about the gradient of the function to guide the direction of
search. If the derivative of the function cannot be computed, because it is discontinuous,
for example, these methods often fail.
Hill Climbing is one method that works using gradient information to find the
local best by moving in the direction of steepest permissible direction.
Calculus-based methods also depend upon the existence of derivatives or well-
defined slope values. But, the real world of search is fraught with discontinuities, vast
multimodal noisy search spaces.
ii. Enumerative methods
Enumerative methods work within a finite search space, or at least a discretized
infinite search space. The algorithm then starts looking at objective function values at
every point in the space, one at a time.
Enumerative methods search every point in the space. So, if the search space is
increased exponentially or if the problem is NP-Hard like the Traveling Salesman
Problem, this method becomes inefficient.
iii. Random search methods
Random search methods are strictly random walks through the search space
while saving the best.
iv. Differences between Genetic Algorithms and conventional search
procedures
Genetic Algorithms and conventional optimization/search procedures in that:
 They work with a coding of the parameter set, not the parameters themselves.
 They search from a population of points in the problem domain, not a singular
point.
 They use payoff information as the objective function rather than derivatives of
the problem or auxiliary knowledge.

http://rajakishor.co.cc Page 120


 They utilize probabilistic transition rules based on fitness rather than
deterministic one.
We can see that both the enumerative and random methods are not efficient
when there is a significantly large search space or significantly difficult problem, as in
the realm of NP-Complete problems. The calculus-based methods are inadequate when
searching a "noisy" search space (one with numerous peaks).
Taken together, these four differences – direct use of coding, search from a
population, blindness to auxiliary information, and randomized operators- contribute to
a genetic algorithm’s robustness and resulting advantage over other more commonly
used techniques.

S imilarity T emplates (Schemata)


Similarity template describes a subset of strings with similarities at certain string
positions. These are helpful in answering questions like, how one string can be similar
to its fellow strings. Schemata encode useful or promising characteristics found in the
population.
For the binary alphabet {0,1}, the similarity template can be described as
operating upon the extended alphabet {0,1, *} where * denotes the don’t care symbol.
For example the schema *11* describes a subset with four members
{0110,0111,1110,1111}. The number of non-* symbols is called the order of the schema.
A bit string of length N includes 2^N schemata and there are 3^N different
schemata of length N. A population of P bit strings of length N contains between 2^N
and min(P.2^N, 3^N) schemata, so the GA operates implicitly on a number of schemata
much larger than the size of the population.
Chromosomes of length N can be viewed as points in a discrete N-Dimensional
search space. (i.e. vertices of a hypercube). Schemata can be seen as hyperplanes in
such a search space. Low order (i.e. small number of non-* symbols) hyperplanes
include more vertices.
Highly fit, short defining length schemata (called building blocks) are propagated
generation to generation by giving exponentially increasing numbers of representatives
to the observed best; all this goes in parallel with no special book keeping or special
memory other than or population of n strings. This processing leverage is important and
we give it a special name, “Implicit Parallelism”.

http://rajakishor.co.cc Page 121


W orking of G enetic A lgorithms

Pseudo-Code for Genetic Algorithms


The following is a pseudo-code for general genetic algorithm approach:
0. [Representation] Define a genetic representation of the system.
1. [Start] Generate random population of n chromosomes (suitable solutions for the
problem)
2. [Fitness] Evaluate the fitness of each chromosome in the population
3. [New population] Create a new population by repeating following steps until the
new population is complete
3.1. [Selection] Select two parent chromosomes from a population according to
their fitness (the better fitness, the bigger chance to be selected)
3.2. [Crossover] With a crossover probability cross over the parents to form a new
offspring (children). If no crossover was performed, offspring is an exact copy
of parents.
3.3. [Mutation] With a mutation probability mutate new offspring at each locus
(position in chromosome).
3.4. [Accepting] Place new offspring in a new population
4. [Replace] Use new generated population for a further run of algorithm
5. [Test] If the end condition is satisfied, stop, and return the best solution in current
population
6. [Loop] Go to step 2

http://rajakishor.co.cc Page 122


Figure: Working of Genetic Algorithms
The pseudo-code is very general. Many things can be implemented differently in
various problems. First question is how to create chromosomes, what type of encoding
to choose. In connection with this is the choice of the two basic operators of Genetic
Algorithms, which are crossover and mutation. Furthermore, selection of parents from
the current solution is also to be clearly defined.

Fitness Function
The fitness function is a non-negative figure of merit of the chromosome. The
objective function is the basis for the computation of the fitness function, which
provides the Genetic Algorithms with feedback from the environment, feedback used to
direct the population towards areas of the search space characterized by better
solutions.
Generally, the only requirement of a fitness function is to return a value
indicating the quality of the individual solution under evaluation. This gives the modeler
almost unlimited freedom in building the model, therefore a diverse range of modeling
structures can be incorporated into the Genetic Algorithms.
At every evolutionary step, known as a generation, the individuals in the current
population are decoded and evaluated according to some predefined quality criterion,

http://rajakishor.co.cc Page 123


referred to as the fitness function. To form a new population (the next generation),
individuals are selected according to their fitness.
In many problems, the objective is more naturally stated as the minimization of
some cost function g(x), rather than the maximization of some utility or profit function
u(x). Even if the problem is naturally stated in maximization form, this alone does not
guarantee that the utility function will be non-negative for all x, as we require in the
fitness function. As a result, it is often necessary to map the underlying natural objective
function to a fitness function form through one or more mappings.
In normal Operations Research work, to transform a minimization problem to a
maximization problem is to simply multiply the cost function by a minus one. In genetic
algorithm work, the operation alone is insufficient because the measure thus obtained is
not guaranteed to be non-negative in all instances. With GA’s, the following cost-to-
fitness transformation is commonly used:
f(x) = Cmax – g(x), when g(x) < Cmax.
= 0, otherwise.
There are varieties of ways to choose the co-efficient Cmax. Cmax may be taken as
an input co-efficient, as the largest g value observed thus far, as the largest g value in the
current population or the largest of the last k - generations.
When the natural objective function formulation is a profit or utility function, we
have no difficulty with the direction of the function: maximized profit or utility leads to
desired performance. We may still have a problem with negative utility function u(x)
values. To overcome this, we simply transform fitness according to the equation:
f(x) = u(x) + Cmin, when u(x) + Cmin > 0
= 0, otherwise
We may choose Cmin as an input co-efficient, as the absolute value of the worst u
value in the current or last k – generations or as a function of the population variance.
To avoid premature convergence, wherein the best chromosomes have a large
number of copies right from the initial population and to prevent a random walk
through mediocre chromosomes when the population average fitness is close to the
best fitness, we perform fitness scaling. One way is linear scaling. If raw fitness is f and
scaled fitness is f’, then the linear relationship between f and f’ is as follows:
f’ = a * f + b, where a and b can be chosen in a number of ways.

http://rajakishor.co.cc Page 124


S ystem D esign
Genetic Algorithms Unit
This subsystem is the actual timetable generation unit. However, this unit has no
knowledge of constraints. It accepts a fitness value based on constraints which prop up
based on the data entered.
Some design considerations for genetic algorithms are:
 Encoding
Permutation based encoding is chosen. The representation satisfies some hard
constraints by nature of itself. Advanced genetic algorithm operations are in
place for these forms of encoding that motivated to choose tyhis representation.
 Reproduction
Between successive generations the process by which chromosomes of the
previous generations are retained in the next generations is reproduction. The
reproduction which we have used in this system, is Generational Reproduction in
which the whole of a population is potentially replaced at each generation. .
 Selection
The selection of the chromosome from a given population is based on Roulette
Wheel Selection. In this type of selection the parents are selected according to
their fitness. The better the chromosomes are, the more chances to be selected
from the population.
 Crossover
The crossover method which is used in the system is Modified order
crossover(MOX). is Modified order crossover is used due to ease of
implementation and simplicity yet powerful.
For example: If we have the parents P1,P2 as
P1 = 123 | 456
P2 = 364 | 215
We would get the offspring’s as
O1 = 123 | 645
O2 = 364 | 125

http://rajakishor.co.cc Page 125


 Mutation
The mutation operator which is used in the system is Reciprocal Exchange.
In Reciprocal exchange two randomly selected positions in the chromosome are
selected and the values in the chromosome in those positions are swapped.
 Decoding
The chromosome is in an encoded form. To evaluate that chromosome it is
decoded and is evaluated.
 Evaluation
Fitness of a chromosome is evaluated by calculating:
F(chr) = Cmax - ( (wh * H(chr)) + ( ws * S(chr)) )
Where chr = chromosome,
F = Function that returns fitness for the chromosome
Cmax = A large integer value (to make the fitness positive)
wh = weight foe each hard constraint violation.
ws = weight for each soft constraint violation.
H(chr) = Number of hard constraints that are violated.
S(chr) = Number of soft constraints that are violated.
All the above values are positive.
To choose wh and ws :
wh = ( ws * Smax ) + 1
where wh = the weight of each hard constraint violated.
Smax = total number of soft constraints.
ws = is the weight of each soft constraint.
The weight of each hard constraint violated is the total number of soft
constraints multiplied by the weight of each soft constraint added by one.
This is done to ensure that chromosomes that satisfy hard constraints are
generated first, followed by better ones that satisfy soft constraints also. Also the
evaluation becomes easier: it is easy to say when a chromosome is fit.
Example: If Smax = 10 , ws = 1, Then wh = 11.

http://rajakishor.co.cc Page 126


GeneticAlgorithm

http://rajakishor.co.cc Page 127


G enetic A lgorithms I mplementation
Genetic Algorithm operators:
These form the core function of searching through the problem space for a
solution. For the GA’s to work, some parameters have to be set which are as follows, for
example:
Maximum number of Generations = 10000
Maximum population size = 200
Maximum Fitness, Cmax = 2147483647
Fitnessvalue(threshold) = 2147483647
Probability of Crossover, pcross = 0.7
Probability of mutation, pmutation = 0.01
Initial population size = 10
Scaling factor for hard constraint violation = 1000
Scaling factor for soft constraint violation = 1
Selection operator: Roulette Wheel Selection
Crossover operator: Modified Order Crossover
Mutation operator: Reciprocal Exchange

http://rajakishor.co.cc Page 128


10

Learning Sets of Rules


http://rajakishor.co.cc Page 129
In many cases it is useful to learn the target function represented as a set of if-
then rules that jointly define the function. One way to learn sets of rules is to first learn
a decision tree, then translate the tree into an equivalent set of rules-one rule for each
leaf node in the tree. A second method is to use a genetic algorithm that encodes each
rule set as a bit string and uses genetic search operators to explore this hypothesis
space.
Here, we explore a variety of algorithms that directly learn rule sets and that
differ from these algorithms in two key respects.
1. First, they are designed to learn sets of first-order rules that contain
variables. This is significant because first-order rules are much more
expressive than propositional rules.
2. Second, the algorithms discussed here use sequential covering algorithms
that learn one rule at a time to incrementally grow the final set of rules.
As an example of first-order rule sets, consider the following two rules that
jointly describe the target concept Ancestor. Here we use the predicate Parent(x, y) to
indicate that y is the mother or father of x, and the predicate Ancestor(x, y) to indicate
that y is an ancestor of x related by an arbitrary number of family generations.
IF Parent(x, y) THEN Ancestor(x, y)
IF Parent(x, z) ^ Ancestor(z, y) THEN Ancestor(x, y)
These two rules compactly describe a recursive function that would be very
difficult to represent using a decision tree or other propositional representation.

S equential C overing A lgorithms


Here we consider a family of algorithms for learning rule sets based on the
strategy of learning one rule, removing the data it covers, then iterating this process.
Such algorithms are called sequential covering algorithms.
To elaborate, imagine we have a subroutine LEARN-ONE-RULE that accepts a set
of positive and negative training examples as input, then outputs a single rule that
covers many of the positive examples and few of the negative examples. We require that
this output rule have high accuracy, but not necessarily high coverage. By high accuracy,
we mean the predictions it makes should be correct. By accepting low coverage, we
mean it need not make predictions for every training example.
Given this LEARN-ONE-RULE subroutine for learning a single rule, one obvious
approach to learning a set of rules is to invoke LEARN-ONE-RULE on all the available
training examples, remove any positive examples covered by the rule it learns, then
invoke it again to learn a second rule based on the remaining training examples. This

http://rajakishor.co.cc Page 130


procedure can be iterated as many times as desired to learn a disjunctive set of rules
that together cover any desired fraction of the positive examples. This is called a
sequential covering algorithm because it sequentially learns a set of rules that together
cover the full set of positive examples. The final set of rules can then be sorted so that
more accurate rules will be considered first when a new instance must be classified.
A prototypical sequential covering algorithm is described below.
SEQUENTIAL_COVERING(Target_attribute, Attributes, Examples, Threshold)
 Learned_rules  {}
 Rule  LEARN_ONE_RULE(Target_attribute, Attributes, Examples)
 while PERFORMANCE(Rule, Example) > Threshold, Do
• Learned_rules Learned_rules + Rule
• Examples  Examples – {examples correctly classified by Rules}
• Rule  LEARN_ONE_RULE(Target_attribute, Attributes, Examples)
 Learned_rule  sort Learned_rules according to PERFORMANCE over Examples
 return Learned_rules

This sequential covering algorithm is one of the most widespread approaches to


learning disjunctive sets of rules. It reduces the problem of learning a disjunctive set of
rules to a sequence of simpler problems, each requiring that a single conjunctive rule be
learned. Because it performs a greedy search, formulating a sequence of rules without
backtracking, it is not guaranteed to find the smallest or best set of rules that cover the
training examples.
How shall we design LEARN-ONE-RULE to meet the needs of the sequential
covering algorithm? We require an algorithm that can formulate a single rule with high
accuracy, but that need not cover all of the positive examples.

General to Specific Beam Search


One effective approach to implementing LEARN-ONE-RULE is to organize the
hypothesis space search in the same general fashion as the ID3 algorithm, but to follow
only the most promising branch in the tree at each step.

http://rajakishor.co.cc Page 131


As illustrated in the above search tree, the search begins by considering the most
general rule precondition possible (the empty test that matches every instance), then
greedily adding the attribute test that most improves rule performance measured over
the training examples. Once this test has been added, the process is repeated by greedily
adding a second attribute test, and so on. Like ID3, this process grows the hypothesis by
greedily adding new attribute tests until the hypothesis reaches an acceptable level of
performance.
Unlike ID3, this implementation of LEARN-ONE-RULE follows only a single
descendant at each search step-the attribute (value pair yielding the best performance)
rather than growing a subtree that covers all possible values of the selected attribute.
This approach to implementing LEARN-ONE-RULE performs a general-to-
specific search through the space of possible rules in search of a rule with high accuracy,
though perhaps incomplete coverage of the data.
As in decision tree learning, there are many ways to define a measure to select
the "best" descendant. To follow the lead of ID3 let us for now define the best
descendant as the one whose covered examples have the lowest entropy.
The general-to-specific search suggested above for the LEARN-ONE-RULE
algorithm is a greedy depth-first search with no backtracking. As with any greedy
search, there is a danger that a suboptimal choice will be made at any step. To reduce
this risk, we can extend the algorithm to perform a beam search.

http://rajakishor.co.cc Page 132


In beam search, the algorithm maintains a list of the k best candidates at each
step, rather than a single best candidate. On each search step, descendants
(specializations) are generated for each of these k best candidates, and the resulting set
is again reduced to the k most promising members. Beam search keeps track of the most
promising alternatives to the current top-rated hypothesis, so that all of their
successors can be considered at each search step.
This general to specific beam search algorithm is used by the CN2 program
described by Clark and Niblett (1989).
The algorithm is described below (generate-and-test approach).
LEARN-ONE-RULE(Target_Attribute, Attributes, Examples, k)
Returns a single rule that covers some of the Examples. Conduct a
general_to_specific greedy beam search for the best r ule, guided by the
PERFORMANCE metric.
 Initialize Best_hypothesis to the most general hypothesis φ
 Initialize Candidate_hypotheses to the set {Best_hypothesis}
 While Candidate_hypotheses is not empty, Do
1. Generate the next more specific Candidate_hypothess
• All_constraints  the set of all constraints of the form (a = v), where a is a
member of Attributes and v is a value of a that occurs in the current set of
Examples
• New_candidate_hypotheses 
for each h in Candidate_hypotheses
• create a specialization of h by adding the constraint c
• Remove from New_candidate_hypotheses any hypotheses that are duplicates,
inconsistent, or not maximally specific
2. Update Best_hypothesis
• For all h in New_candidate_hypotheses Do
• If (PERFORMANCE(h, Examples, Target_attribute)
> PERFORMANCE(Best_hypothesis,Examples, Target_attribute))
Then Best_hypothesis  h
3. Update Candidate_hypothesis
• Candidate_hypotheses  the k best members of New_candidate_ hypotheses,
according to the PERFORMANCE measure.
 Return a rule of the form
“IF Best_hypothesis THEN prediction”
where prediction is the most frequent value of Target_attribute among those
Examples that match Best_hypothesis.

PERFORMANCE(h, Examples, Target_attribute)


 h_examples  the subset of Examples that match h
 return Entropy(h_examples), where entropy is with respect to Target_attribute

http://rajakishor.co.cc Page 133


Remarks:
1. Each hypothesis considered in the main loop of the algorithm is a conjunction
of attribute-value constraints.
2. Each of these conjunctive hypotheses corresponds to a candidate set of
preconditions for the rule to be learned and is evaluated by the entropy of the
examples it covers.
3. The search considers increasingly specific candidate hypotheses until it
reaches a maximally specific hypothesis that contains all available attributes.
4. The rule that is output by the algorithm is the rule encountered during the
search whose PERFORMANCE is greatest, not necessarily the final hypothesis
generated in the search.
5. The post-condition for the output rule is chosen only in the final step of the
algorithm, after its precondition (represented by the variable
Best_hypothesis) has been determined.
6. The algorithm constructs the rule post-condition to predict the value of the
target attribute that is most common among the examples covered by the
rule precondition.
Finally, note that despite the use of beam search to reduce the risk, the greedy
search may still produce suboptimal rules. However, even when this occurs, the
SEQUENTIAL-COVERING algorithm can still learn a collection of rules that together
cover the training examples, because it repeatedly calls LEARN-ONE-RULE on the
remaining uncovered examples.

L earning R S Sule ets: ummary


There are several key dimensions in the design space of rule learning algorithms.

Dimension-1
Sequential covering algorithms learn one rule at a time, removing the covered
examples and repeating the process on the remaining examples.
In contrast, decision tree algorithms such as ID3 learn the entire set of disjuncts
simultaneously as part of the single search for an acceptable decision tree. We might,
therefore, call algorithms such as ID3 simultaneous covering algorithms, in contrast to
sequential covering algorithms such as CN2.
Which should we prefer? The key difference occurs in the choice made at the
most primitive step in the search. At each search step ID3 chooses among alternative

http://rajakishor.co.cc Page 134


attributes by comparing the partitions of the data they generate. In contrast, CN2
chooses among alternative attribute-value pairs, by comparing the subsets of data they
cover.
One way to see the significance of this difference is to compare the number of
distinct choices made by the two algorithms in order to learn the same set of rules. To
learn a set of n rules, each containing k attribute-value tests in their preconditions,
sequential covering algorithms will perform (n * k) primitive search steps, making an
independent decision to select each precondition of each rule. In contrast, simultaneous
covering algorithms will make many fewer independent choices, because each choice of
a decision node in the decision tree corresponds to choosing the precondition for the
multiple rules associated with that node. In other words, if the decision node tests an
attribute that has m possible values, the choice of the decision node corresponds to
choosing a precondition for each of the m corresponding rules.
Thus, sequential covering algorithms such as CN2 make a larger number of
independent choices than simultaneous covering algorithms such as ID3.
Still, the question remains, which should we prefer? The answer may depend on
how much training data is available. If data is plentiful, then it may support the larger
number of independent decisions required by the sequential covering algorithm,
whereas if data is scarce, the "sharing" of decisions regarding preconditions of different
rules may be more effective.
An additional consideration is the task-specific question of whether it is
desirable that different rules test the same attributes. In the simultaneous covering
decision tree learning algorithms, they will. In sequential covering algorithms, they need
not.

Dimension-2
Sequential covering algorithms learn one rule at a time, removing the covered
examples and repeating the process on the remaining examples.
In the LEARN-ONE-RULE algorithm described above, the search is from general-
to-specific hypotheses. Other algorithms we have discussed (e.g., FIND-S) search from
specific-to-general.
One advantage of general to specific search here is that there is a single
maximally general hypothesis from which to begin the search, whereas there are very
many specific hypotheses in most hypothesis spaces (i.e., one for each possible
instance).

http://rajakishor.co.cc Page 135


Dimension-3
A third dimension is whether the LEARN-ONE-RULE search is a generate-then-
test search through the syntactically legal hypotheses, as it is in our suggested
implementation, or whether it is example-driven so that individual training examples
constrain the generation of hypotheses.
This contrasts to the generate-and-test search of LEARN-ONE-RULE (discussed
earlier), in which successor hypotheses are generated based only on the syntax of the
hypothesis representation. The training data is considered only after these candidate
hypotheses are generated and is used to choose among the candidates based on their
performance over the entire collection of training examples.
One important advantage of the generate-and-test approach is that each choice in
the search is based on the hypothesis performance over many examples, so that the
impact of noisy data is minimized.
Prototypical example-driven search algorithms include the FIND-S and
CANDIDATE-ELIMINATION algorithms. In each of these algorithms, the generation or
revision of hypotheses is driven by the analysis of an individual training example, and
the result is a revised hypothesis designed to correct performance for this single
example.
However, example-driven algorithms that refine the hypothesis based on
individual examples are more easily misled by a single noisy training example and are
therefore less robust to errors in the training data.

Dimension-4
A fourth dimension is whether and how rules are post-pruned. As in decision
tree learning, it is possible for LEARN-ONE-RULE to formulate rules that perform very
well on the training data, but less well on subsequent data. As in decision tree learning,
one way to address this issue is to post-prune each rule after it is learned from the
training data. In particular, preconditions can be removed from the rule whenever this
leads to improved performance over a set of pruning examples distinct from the
training examples.

http://rajakishor.co.cc Page 136


Dimension-5
A fifth dimension is the particular definition of rule PERFORMANCE used to
guide the search in LEARN-ONE-RULE.
Various evaluation functions have been used. Some common evaluation
functions include:

1. Relative Frequency
Let n denote the number of examples the rule matches and let nc denote the
number of these that it classifies correctly. The relative frequency estimate of rule
performance is n/nc.

2. m-estimate of accuracy
This accuracy estimate is biased toward the default accuracy expected of the
rule. It is often preferred when data is scarce and the rule must be evaluated based on
few examples.
Let n and nc denote the number of examples matched and correctly predicted by
the rule. Let p be the prior probability that a randomly drawn example from the entire
data set will have the classification assigned by the rule (e.g., if 12 out of 100 examples
have the value predicted by the rule, then p = .12). Finally, let m be the weight, or
equivalent number of examples for weighting this prior p. The m-estimate of rule
accuracy is (nc+mp)/(n+m).
Note if m is set to zero, then the m-estimate becomes the above relative
frequency estimate. As m is increased, a larger number of examples is needed to
override the prior assumed accuracy p.

3. Entropy
This is the measure used by the PERFORMANCE subroutine in the generate-
and-test algorithm. Let S be the set of examples that match the rule preconditions.
Entropy measures the uniformity of the target function values for this set of examples.
We take the negative of the entropy so that better rules will have higher scores.
c
− Entropy ( S ) = ∑ pi log 2 pi
i =1

where c is the number of distinct values the target function may take on, and where pi is
the proportion of examples from S for which the target function takes on the ith value.

http://rajakishor.co.cc Page 137


This entropy measure, combined with a test for statistical significance, is used in
the CN2 algorithm of Clark and Niblett (1989). It is also the basis for the information
gain measure used by many decision tree learning algorithms.

http://rajakishor.co.cc Page 138

Vous aimerez peut-être aussi