Vous êtes sur la page 1sur 71

# Machine Learning, Expert

## Systems and Fuzzy Logic

Prepared by Kristian Guillaumier
Dept. of Intelligent Computer Systems
University of Malta
2011
All content in these slides adapted from:
Most material in these slides from:
 Machine Learning: Tom Mitchell (get this book).
 Introduction to Machine Learning: Ethem Alpaydin.
 Introduction to Expert Systems: Peter Jackson.
 An introduction to Fuzzy Logic and Fuzzy Sets:
Buckley, Elsami.
 Pattern Recognition and Machine Learning:
Christopher Bishop.
 Grammatical Inference: Colin de la Higuera.
 Artificial Intelligence: Negnevitsky.
Miscellaneous web references.

Kristian Guillaumier, 2011 2
CONCEPT LEARNING, FIND-S,
CANDIDATE ELIMINATION
(main source: Mitchell )
Kristian Guillaumier, 2011 3
Note on Induction
If a large number of items I have seen so far all possess
some property, then ALL items possess that property.
So far the sun has always risen
Are all swans white

We never known whether our induction is true (we
have not proven it).

In machine learning:
Input: A number of training examples for some function.
Output: a hypothesis that approximates the function.
Kristian Guillaumier, 2011 4
Concept Learning
Learning: inducing general functions from specific training
examples (+ve and/or ve).
Concept learning: induce the definition of a general category (e.g.
cat) from a sample of +ve and ve training data.
Search a space of potential hypotheses (hypothesis space) for a
hypothesis that best fits the training data provided.
Each concept can be viewed as the description of some subset
(there is a general-to-specific ordering) defined over a larger set.
E.g.:
Cat _ Feline _ Animal _ Object
A boolean function over the larger set. E.g. the IsCat() function over
the set of animals.
Concept learning learning this function from training data.
Kristian Guillaumier, 2011 5
Example
Learn the concept: Good days when I like to swim.
We want to learn the function:
IsGoodDay(input) true/false.
Our hypothesis is represented as a conjunction of constraints on
attributes. Attributes:
Sky Sunny/Rainy/Cloudy.
AirTemp Warm/Cold.
Humidity High/Normal.
Wind Strong/Weak.
Water Warm cold.
Forecast Same/Change.
Other possible contraints (values) for an attribute:
? I dont care, any value is acceptable.
C no value is acceptable.
Sky Sunny/Rainy.

Attribute
Constraints
Kristian Guillaumier, 2011 6
Example
Our hypothesis, then, is a vector of constraints on
these attributes:
<Sky, AirTemp, Humidity, Wind, Water, Forecast>

An example of an hypothesis is (only on warm days
with normal humidity):
<?, Warm, Normal, ?, ?, ?>

The most general hypothesis is:
<?, ?, ?, ?, ?, ?>

The most specific hypothesis is:
< C, C, C, C, C, C >

Kristian Guillaumier, 2011 7
Example
Training Examples:
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Changes No
Sunny Warm High Strong Cool Changes Yes
Kristian Guillaumier, 2011 8
Notation
The set of all items over which the concept is called the set
of instances, X.
The set of all days represented by the attributes AirTemp,
Humidity,
The set of all animals, etc
An instance in X is denoted by x (x e X).
The concept to be learnt (e.g. cats over animals, good days
to swim over all days) is called the target concept denoted
by c (note that c is the target function).
c is a Boolean-valued function defined over the instances X.
i.e. the function c, takes an instance x e X and in our example
c(x) = 1 if IsGoodDay is Yes
c(x) = 0 if IsGoodDay is no.

Kristian Guillaumier, 2011 9
Notation
When learning the target concept, c, the learner
is given a training set that consists of:
A number of instances x from X.
For each instance, we have the value of the target
concept c(x).
Instances where c(x) = 1 are called +ve training
examples (members of target concept).
Instances where c(x) = 0 are called ve training
examples (non-members).
A training example is usually denoted by:
<x, c(x)>
An instance and its target concept value
Kristian Guillaumier, 2011 10
Notation
Given the set of training examples, we want to
learn (hypothesize) the target concept c.
H is the set of all hypotheses that we are
considering (all the possible combinations of
<Sky, AirTemp, Humidity, Wind, Water,
Forecast>).
Each hypothesis in H denoted by h and is usually
a boolean valued function h:X{0,1}.
Goal of the learner is to find an h such that:
h(x) = c(x) for all x e X.

Kristian Guillaumier, 2011 11
Concept Learning Task (Mitchell pg. 22)
Kristian Guillaumier, 2011 12
The Inductive Learning Hypothesis
Our main goal is to find a hypothesis h that is identical to c for all x
e X (for every instance possible).
The only information we have on c is the training data.
What about unseen instances (where we dont have training data).
At best we can guarantee that our learner will learn a hypothesis
that learns the training data exactly (not good).
We make an assumption the inductive learning hypothesis

Any hypothesis that approximates the target function well over a
sufficiently large set of training examples will also approximate
the target function well for other unobserved/unseen examples.

Kristian Guillaumier, 2011 13
Size and Ordering of the Search Space
We must search in the hypothesis space for the best one (that matches c).
The size of the hypothesis space is defined by the hypothesis representation.
Recall:
Sky Sunny/Rainy/Cloudy (3 options).
AirTemp Warm/Cold (2 options).
Humidity High/Normal (2 options).
Wind Strong/Weak (2 options).
Water Warm cold (2 options).
Forecast Same/Change (2 options).
Which means that we have 322222=96 distinct instances.
Due to the addition of the symbols ? and C we have 544444=5120
syntactically distinct instances.
However note that any instance that contains one or more Cs by definition always
means that it is classified vely. So the number of semantically distinct instances is:
1 + 433333 = 973

1 representing all the definitely
negative ones since they contain C
Distinct possibilities + 1 for the ?
Kristian Guillaumier, 2011 14
Size and Ordering of the Search Space
Consider the following hypotheses:
h1: <Sunny, ?, ?, Strong, ?, ?>
h2: <Sunny, ?, ?, ?, ?, ?>
Consider the sets of instances that are classified as +ve by h1 and
h2.
Clearly, since h2 has less constraints, it will classify more instances
as +ve than h1.
clearly, anything that is classified as +ve by h1 will also be classified
as +ve by h2.
We say that h2 is more general than h1.
This allows us to organize the search space (order the set) according
to this relationship between instances (hypotheses) in it.
This ordering concept is very important because there are concept
learning algorithms that make use of this ordering.
Kristian Guillaumier, 2011 15
Size and Ordering of the Search Space
For any instance xeX and hypothesis heH,
x satisfies h iff h(x) = 1.
The is more general or equal to relationship is
denoted by
g
.
Given two hypothesis h
j
and h
k
, h
j
>
g
h
k
iff any
instance that satisfies h
k
also satisfies h
j
.

h
j
>
g
h
k

iff
xeX, h
k
(x)=1 h
j
(x) = 1
Kristian Guillaumier, 2011 16
Size and Ordering of the Search Space
Just as we have defined >
g
, it is useful to
define: strictly more general than, more
specific or equal to,

Kristian Guillaumier, 2011 17
Size and Ordering of the Search Space
Kristian Guillaumier, 2011 18
Set of all instances
Each hypothesis corresponds
to a subset of X (arrows).
h2 contains h1 and h3 because it is more general.
h1 and h3 are not more general or specific than each other
Size and Ordering of the Search Space
Notes:
>
g
is a partial order over the hypothesis space H (it is reflexive, antisymmetric,
and transitive).

Reflectivity, or Reflexive Relation
A reflexive relation is a binary relation R over a set S where every element in S is related to
itself.
That is, x S, xRx holds true.
For example, the relation over Z
+
is reflexive because x Z
+
, x x.
Transitivity, or Transitive Relation
A transitive relation is a binary relation R over a set S where a, b, c S: aRb bRc aRc.
For example, the relation over Z
+
is transitive because a, b, c Z
+
, if a b and b c then a
c.
Antisymmetry, or Antisymmetric Relation
An antisymmetric relation is a binary relation R over a set S where a, b S, aRb bRa a =
b.
An equivalent way of stating this is that a, b S, aRb a b bRa.
For example, the relation over Z
+
is antisymmetric because a, b, c Z
+
, if a b and b a
then a = b.

Kristian Guillaumier, 2011 19
The Find-S Algorithm
Initialise h to the most specific hypothesis in H.

For each +ve training instance x
For each attribute constraint a
i
in h
If constraint a
i
is satisfied by x Then
Do Nothing
Else
Replace a
i
in h by the next more
general constraint that is
satisfied by x

Return h
Kristian Guillaumier, 2011 20
The Find-S Algorithm
Init h to most specific hypothesis:
h = < C, C, C, C, C, C >
Start with first +ve training example:
x = <Sunny, Warm, Normal, Strong, Warm, Same>
Consider the first attribute sky. Our hypothesis says C (most
specific) but our training example says Sunny (more general) so
the attribute in the hypothesis does not satisfy the attribute in x.
Pick the next more general attribute value from C which is Sunny.
Repeat for all attributes AirTemp, humidity, until we get:
h = <Sunny, Warm, Normal, Strong, Warm, Same>
After having covered the 1
st
training example, h is more general
than what we started with. However it is still very specifiy it will
classify all possible instances to ve except the one +ve training
example it has seen.
Continue with the next +ve training example.
Kristian Guillaumier, 2011 21
The Find-S Algorithm
The next example is:
x = <Sunny, Warm, High, Strong, Warm, Same>
Recall that so far:
h = <Sunny, Warm, Normal, Strong, Warm, Same>
Loop thru all the attributes. 1
st
satisfied do nothing, 2
nd
satisfied do
nothing, 3
rd
not satisfied pick next more general, etc
x = <Sunny, Warm, High, Strong, Warm, Same>
h = <Sunny, Warm, Normal, Strong, Warm, Same>

new h
h = <Sunny, Warm, ?, Strong, Warm, Same>

We complete the algorithm to get the hypothesis:
h = <Sunny, Warm, ?, Strong, ?, ?>

Kristian Guillaumier, 2011 22
The Find-S Algorithm - Observation
Remember that the algorithm skipped the 3
rd
training
example because it was ve.
However we observe that the hypothesis we had
generated so far was consistent with this ve training
example.
After considering the 2
nd
training example, our
hypothesis was:
h = <Sunny, Warm, ?, Strong, Warm, Same>
The 3
rd
training example (that was skipped) was:
x = <Rainy, Cold, High, Strong, Warm, Change>
Note that h is already consistent with the training
example.
Kristian Guillaumier, 2011 23
The Find-S Algorithm - Observation
As long as the hypothesis space H contains the hypothesis that describes
the target concept c and there are no errors in the training data, the
current hypothesis can never be inconsistent with a ve training example.

To see why:
h is the most specify hypothesis in H that is consistent with the currently
observed training examples.
Since we assume the target concept c is in H and is consistent (obviously) with
the positive training examples, then c must be >
g
than h.
But c will never cover a ve example so neither will h (by definition of >
g
).
Clearly, if the a more general hypothesis will not misclassify ve example, the
more specific one cannot misclassify it either.
Kristian Guillaumier, 2011 24
Consider Animal >
g
Cat

If cat then animal
If animal then maybe cat
If not cat then maybe not animal
If not animal then not cat (if ve is correctly classified by general then it is correctly classified by specific)
The Find-S Algorithm Issues Raised
Find-S is guaranteed to find the most specific heH that is
consistent with the +ve and ve training examples assuming
that the training examples are correct (no noise).
Issues:
We dont know whether we found the only hypothesis that is
consistent. There might be more hypotheses that are consistent.
Find-S will find the most specific hypothesis consistent with the
training data. Why not the most general hypothesis consistent?
Why not something in between.
How do we know if the training data is consistent? In real-life
cases, training data may contain noise or errors.
There might be more than one maximally specific hypothesis
which one do we pick?

Kristian Guillaumier, 2011 25
The Candidate Elimination (CE)
Algorithm
Note, that although the hypothesis output by Find-S is consistent
with the training data, it is one of the, possibly, many hypothesis
that is consistent.
CE will output (a description of) all the hypothesis consistent with
the training data.
Interestingly, it does so without enumerating the whole space.

CE finds all describable hypotheses that are consistent with the
observed training examples.

Defn: h is consistent with a set of training examples D iff h(x) = c(x)
for each example <x, c(x)> in D.

Consistent(h, D) <x, c(x)> e D, h(x) = c(x)
Kristian Guillaumier, 2011 26
The Candidate Elimination (CE)
Algorithm
The subset of all the hypotheses that are consistent
with the training data (what CE finds) is called the
version space WRT the hypothesis space H and the
training data D it contains all the possible, consistent
versions of the target concept.

Defn: the version space denoted VS
H,D
with respect to
the hypothesis space H and the training data D is the
subset of hypotheses from H consistent with the
training data in D.

VS
H,D
{heH|Consistent(h,D)}

Kristian Guillaumier, 2011 27
The List-Then-Eliminate (LTE)
Algorithm
A possible representation of a version space is a
listing of all the elements (hypotheses) in it.

List-Then-Eliminate:

VersionSpace = Generate all hypothesis in H
For each training example d:
remove from VersionSpace any hypothesis h
where h(x) <> c(x)

Return VersionSpace

Kristian Guillaumier, 2011 28
The List-Then-Eliminate (LTE)
Algorithm
LTE can be applied whenever the hypothesis
space is finite (not always the case).
It has the advantage of simplicity and the fact
that it will always work (guaranteed to output all
the hypotheses consistent with the training data).

However, enumerating all the hypotheses in H is
unrealistic for all but the most trivial cases.
We need a more compact representation.
Kristian Guillaumier, 2011 29
Compact Representation of a Version
Space
Recall that in our previous example, Find-S
found the hypothesis:
h = <Sunny, Warm, ?, Strong, ?, ?>
This is only one of 6 possible hypotheses that
are consistent with the training examples.
We can illustrate the 6 possible hypothesis in
the next diagram
Kristian Guillaumier, 2011 30
Compact Representation of a Version
Space
Kristian Guillaumier, 2011 31
1
2 3 4
6 5
Compact Representation of a Version
Space
Kristian Guillaumier, 2011 32
Most specific.
Most general.
Arrow represents the
>
g
relation.
Compact Representation of a Version
Space
Kristian Guillaumier, 2011 33
Given only the 2 sets S and G, we can generate all the hypotheses in between.

Try it!
Compact Representation of a Version
Space
Intuitively we see that by having these general and specific boundaries
we can generate the whole version space (check sketch proof in Mitchell).
A few definitions.
Defn: the general boundary G (remember that G is a set) WRT a
hypothesis space H and training data D is the set of maximally general
members of H consistent with D.

G {geH |Consistent(g, D) . (-geH)*(g>
g
g).Consistent(g,D)]}

Defn: the specific boundary S WRT a hypothesis space H and training data
D is the set of minimally general (maximally specific) members of H
consistent with D.

G {seH |Consistent(s, D) . (-seH)[(s>
g
s).Consistent(s,D)]}

Kristian Guillaumier, 2011 34
(Back to) The Candidate Elimination
(CE) Algorithm
CE computes the version space containing all
the hypotheses in H that are consistent with
the observed D.
First we initialise G (remember it is a set) to
contain the most general hypothesis possible.
G
0
= {<?,?,?,?,?,?>}
Then we initialise S to contain the most
specific hypothesis possible.
S
0
= {<C,C,C,C,C,C>}
Kristian Guillaumier, 2011 35
The Candidate Elimination (CE)
Algorithm
So far the two boundaries delimit the whole
hypothesis space (every h in H is between G
0
and
S
0
).
As each training example is considered the
boundary sets S and G are generalised and
specialised respectively to eliminate from the
version space any hypothesis in H that is
inconsistent.
At the end well end up with the correct
boundary sets.
Kristian Guillaumier, 2011 36
The Candidate Elimination (CE)
Algorithm
Kristian Guillaumier, 2011 37
The Candidate Elimination (CE)
Algorithm
After init:
S
0
= {<C,C,C,C,C,C>}
G
0
= {<?,?,?,?,?,?>}
Consider the first training example:

It is positive so:
Kristian Guillaumier, 2011 38
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
The Candidate Elimination (CE) Algorithm
Part 1: all hypotheses in G are consistent with the
training example, so we dont remove anything.
Part 2:
There is only one s in S (<C,C,C,C,C,C>) which is
inconsistent:
Remove <C,C,C,C,C,C> from S, leaving S={}.
Add to S all minimal generalisations h of s.
i.e. we add <Sunny, Warm, Normal, Strong, Warm, Same> to S.
Remove from S any hypothesis that is more general than any
other hypothesis in S.
There is only 1 hypothesis in S so we do nothing.
Kristian Guillaumier, 2011 39
Sky AirTemp Hum. Wind Water Forecast
Sunny Warm Normal Strong Warm Same Yes
So far we got:
Kristian Guillaumier, 2011 40
The Candidate Elimination (CE) Algorithm
Read the second training example. It is positive as well so.
Kristian Guillaumier, 2011 41
The Candidate Elimination (CE) Algorithm
Sky AirTemp Hum Wind Water Forecast
Sunny Warm High Strong Warm Same Yes
Part 1: all hypotheses in G are consistent with the
training example, so we dont remove anything.
Part 2:
There is only one s in S (<Sunny, Warm, Normal, Strong,
Warm, Same>) which is inconsistent:
Remove it from S, leaving S={}.
Add to S all minimal generalisations h of s.
i.e. we add <Sunny, Warm, ?, Strong, Warm, Same> to S.
Remove from S any hypothesis that is more general than any other
hypothesis in S.
There is only 1 hypothesis in S so we do nothing.
So far we got:
Kristian Guillaumier, 2011 42
The Candidate Elimination (CE) Algorithm
The Candidate Elimination (CE)
Algorithm
We notice that the role of +ve training examples is to
make the S boundary more general and the role of
the ve training examples is to make the G boundary
more specific.
Consider the 3
rd
training example, which is ve.
Kristian Guillaumier, 2011 43
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Rainy Cold High Strong Warm Changes No
Step 1: S contains <Sunny, Warm, ?, Strong, Warm,
Same> which is consistent because it labels the training
example as NO we do nothing.
Step 2:
The hypotheses in G that are not consistent with d is
<?,?,?,?,?,?> because it labels it as YES. Remove it, leaving
G = {}.
All to G all minimal specialisations of g.
Continued

Kristian Guillaumier, 2011 44
The Candidate Elimination (CE) Algorithm
Sky AirT Hum Wind Water Fore
Rainy Cold High Strong Warm Changes No
The Candidate Elimination (CE) Algorithm
The g we removed is <?,?,?,?,?,?>, all the minimal
specialisations of it would be (remember we want to
label the training example as NO):
Sky: <Sunny, ?, ?, ?, ?, ?>, <Cloudy, ?, ?, ?, ?, ?>
Air Temp: <?,Warm,?,?,?,?>
Humidity: <?,?,Normal,?,?,?>
Wind: <?,?,?,Weak,?,?>
Water: <?,?,?,?,Cold,?>
Forecast: <?,?,?,?,?,Same>
However, not all these minimal specialisations go
into the new G.

Kristian Guillaumier, 2011 45
Sky AirT Hum Wind Water Fore
Rainy Cold High Strong Warm Changes No
Only <Sunny, ?, ?, ?, ?, ?>, <?,Warm,?,?,?,?> and <?,?,?,?,?,Same>
go into the new G.
<Cloudy, ?, ?, ?, ?, ?>, <?,?,Normal,?,?,?>, <?,?,?,Weak,?,?> and
<?,?,?,?,Cold,?> are not part of the new G.
Why?
Because it is inconsistent with the previously encountered training
examples (so far we saw training items 1 and 2).
<Cloudy, ?, ?, ?, ?, ?> is inconsistent with training item 1 and 2.
<?,?,Normal,?,?,?> is inconsistent with training item 2.
<?,?,?,Weak,?,?> is inconsistent with training item 1 and 2.
<?,?,?,?,Cold,?> is inconsistent with training item 1 and 2.

Kristian Guillaumier, 2011 46
The Candidate Elimination (CE) Algorithm
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Changes No
Sunny Warm High Strong Cool Changes Yes
The Candidate Elimination (CE)
Algorithm
So far we got:

Kristian Guillaumier, 2011 47
The Candidate Elimination (CE)
Algorithm
After processing the 4
th
training item, we get:
Kristian Guillaumier, 2011 48
The Candidate Elimination (CE)
Algorithm
The entire version space derived from the
boundaries is:
Kristian Guillaumier, 2011 49
Converging
CE will converge towards the target concept if:
There are no errors in the training data.
The target concept is in H.
The target concept is exactly learnt when S and G converge
to a single and identical hypothesis.
If the training data contains errors e.g. a +ve example is
incorrectly labeled as ve:
The algorithm will remove the correct target concept from the
version space.
Eventually, if we are presented with enough training data, we
will detect an inconsistency because the G and S boundaries will
converge to an empty version space (i.e. there is no hypothesis
in H that is consistent with all the training examples).

Kristian Guillaumier, 2011 50
Requesting Training Examples
So far, our algorithm was given a set
containing labeled training data.
Suppose that our algorithm can come up with
an instance and ask (query) an external oracle
to label it.
What instance should the algorithm come up
with for an answer from the oracle?
Kristian Guillaumier, 2011 51
Requesting Training Examples
Consider the version space we got from the 4 fixed training examples we
had?
What training example would we like to have to further refine it?
We should come up with an instance that will classified as +ve by some
hypotheses and ve by others to further reduce the size of the version
space.
Kristian Guillaumier, 2011 52
Requesting Training Examples
Suppose we request the training example:
<sunny, warm, normal, light, warm, same>
3 hypothesis would classify it as +ve and 3 would classify it as
ve:
Kristian Guillaumier, 2011 53
Requesting Training Examples
So if we ask the oracle to classify:
<sunny, warm, normal, light, warm, same>
Wed either generalise the S boundary or
specialise the G boundary and shrink the size of
the version space (make it converge).
In general, the optimal instance wed like the
oracle to classify (the best training example to
have next) is the one that would half the size of
the version space.
If we have this option we can converge to the
target concept in Log
2
time.
Kristian Guillaumier, 2011 54
Partially Learned Concepts
Partially learned = we didnt converge to the
target concept (S and G are not the same).
Our previous example is a partially learned
concept:
Kristian Guillaumier, 2011 55
Partially Learned Concepts
It is possible to classify unseen examples with a degree of certainty.
Suppose we want to classify the instance (not in training data)

<sunny, warm, normal, strong, cool, change>

using our partially learned concept.
Kristian Guillaumier, 2011 56
Notice that every hypothesis in
the version space classifies the
unseen instance as +ve.

So all the hypothesis classified it
as +ve with the same confidence
as if there would have been only
the target concept remaining
(converged).
Partially Learned Concepts
Kristian Guillaumier, 2011 57
Sky AirTemp Humidity Wind Water Forecast
Sunny Warm Normal Strong Cool Change ?
Rainy Cold Normal Light Warm Same ?
Sunny Warm Normal Light Warm Same ?
Sunny Cold Normal Strong Warm Same ?
All hypothesis in version
space classify as +ve
All hypothesis in version
space classify as -ve
50/50. Need more training
examples.

Note: This is an optimal query
to request from an oracle.
2 +ve, 4-ve.

Possible take a majority vote
and output a confidence level.
Inductive Bias
Recall that our system, so far, works assuming the
target concept exists in our hypothesis space.
Also recall that our hypothesis space allowed only
for conjunctions (AND) of attribute values.
There is no way to allow for a disjunction of
values we cannot say Sky=Cloudy OR
Sky=Sunny.
Consider what would happen if in fact, I like
swimming if it is cloudy or sunny. Id get
something like
Kristian Guillaumier, 2011 58
Inductive Bias
Kristian Guillaumier, 2011 59
CE will converge to an empty version space: the target concept is not in
the hypothesis space. To see why:
The most specific hypothesis that classifies the first two examples as +ve
is:

<?, Warm, Normal, Strong, Cool, Change>

Although it is maximally specific for the frst 2 examples, it is already to
general: it will classify the 3
rd
example as +ve too.
The problem is that we biased our learner to consider only hypotheses
that are conjunctions.
Sky AirTemp Humidity Wind Water Forecast
Sunny Warm Normal Strong Cool Change Y
Cloudy Warm Normal Strong Cool Change Y
Rainy Warm Normal Strong Cool Change N
Unbiased Learning
Lets see what happens if to make sure that the target concept
definitely exists in the hypothesis space, we define the hypothesis
space to contain every possible concept.
This means that it is possible to represent every possible subset of X.
In our previous example (containing 6 attributes), the size of the
instance space is 96.
How many possible concepts can be defined over this set of
instances.
The powerset!!!
Recall that the size of a powerset (in general) is 2
|X|
.
So there are 2
96
(ouch) possible concepts that can be learnt from our
instance space.
We had seen that by introducing ? and C, we allowed for 973 possible
concepts which is <<<<<< 2
96
(we had a very strong bias) .

Kristian Guillaumier, 2011 60
Unbiased Learning
Lets define a new hypothesis space that can
represent every subset of instances. i.e.

= ().
To do this we allow H to allow for any combination of
disjunctions, conjunctions and negations. E.g. the
target concept Sky=Sunny OR Sky=Cloudy would be:
<sunny,?,?,?,?,?> v <cloudy,?,?,?,?,?>

So we can use CE knowing that our target concept will
definitely exist in the hypothesis space. But
We create a new problem. Our learner will learn how
to classify exactly the instances presented as training
examples and not generalise beyond them!
Kristian Guillaumier, 2011 61
Unbiased Learning
To see why, suppose, I have 5 training examples d
1
, d
2
,
d
3
, d
4
, d
5
. And that d
1
, d
2
, d
3
are +ve examples and d
4
,
d
5
are ve examples.
The S boundary will become a disjunction of the +ve
examples (since it is the most specific possible
hypothesis that covers the examples):
S = {(d
1
v d
2
v d
3
)}
The G boundary will become a negation (rule out) of
the negative training examples:
G = {(d
4
v d
5
)}
So the only unambiguously classifiable instances are
those that were provided as training examples.

Kristian Guillaumier, 2011 62
Unbiased Learning
What would happen if we use the partially learned concept
and take a vote.
Instances that were originally in the training data will be
classified unambiguously (obviously).
Any other instance not in the training data will be classified
as +ve by half of the hypothesis in the version space and as
ve by the other half of the hypothesis in the version space.
Note that H is the power set of X.
x is some unobserved instance (not in training data).
Then there is some h in the version space that covers x.
But there is also a corresponding h in the version space that
covers the same x except for its classification (h).

Kristian Guillaumier, 2011 63
More on Bias
Straight from:
A learner that makes no a priori
assumptions regarding the identity of the
target concept has no rational basis for
classifying any unseen instances.
(in fact, CE worked because we biased it with
the assumption that the target concept can be
represented by a conjunction of attribute
values).

Kristian Guillaumier, 2011 64
More on Bias
Consider:
L = a learning algorithm.
Has a set of training data D
c
= {<x, c(x)>}.
c = some target concept.
x
i
is some instance we wish to classify.
L(x
i
, D
c
) = the classification (+ve/-ve) that L assigns to x
i
after
learning from training data D
c
.
The inductive inference step is:
(D
c
. x
i
) L(x
i
, D
c
)
Where a b denotes that b is inductively inferred from a.
So the inductive inference step reads given the training
data D
c
and the instance x
i
, as inputs to L, we can
inductively infer the classification of the instance.

Kristian Guillaumier, 2011 65
More on Bias
Kristian Guillaumier, 2011 66
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Changes No
Sunny Warm High Strong Cool Changes Yes
(D
c
. x
i
) L(x
i
, D
c
)
More on Bias
Because L is an inductive learning algorithm, in
general, we cannot prove that the result L(x
i
, D
c
)
is correct. I.e. the classification of the example
does not necessarily deductively follow from the
training data (can be proven).
However, we can add a number of assumptions
to our system so that the classification would
follow deductively.
The inductive bias of L is defined as these
assumptions.
Kristian Guillaumier, 2011 67
More on Bias
Let B = these assumptions (e.g. the hypothesis
space is made up only of conjunctions of
attribute values).
Then the inductive bas of L is B, giving:
(B . D
c
. x
i
) L(x
i
, D
c
)

Where the notation a b denotes that b
follows deductively from a (b is provable from
a).
Kristian Guillaumier, 2011 68
Defn. of Inductive Bias
Consider a concept learning algorithm L for the set of instances X.

Let c be an arbitrary concept over X and let D
c
= {<x, c(x)>} be an
arbitrary set of training examples of c.

Let L(x
i
, D
c
) denote the classification assigned to the instance x
i
by L
after training on the data D
c
.

The Inductive Bias of L is any minimal set of assertions B such that for
any target concept c and corresponding training examples D
c

(x
i
e X)[(B . D
c
. x
i
) L(x
i
, D
c
)]
Kristian Guillaumier, 2011 69
The Inductive Bias of CE
Let us specify what L(x
i
, D
c
) means for CE (how
classification works).
Given training data D
c
, CE will compute the version space
VS
H,Dc
.
Then it will classify a new instance x
i
by taking a vote
amongst the hypothesis in this version space.
A classification will be output (+ve or ve) if all the
hypothesis in the version space unanimously agree.
Otherwise no classification is output (I cant tell from
training data).
The inductive bias of CE is that the target concept c is
contained in the hypothesis space. i.e. ceH.
Why?
Kristian Guillaumier, 2011 70
The Inductive Bias of CE
1:
Notice that if we assume that ceH, then it follows
deductively (we can prove) that ceVS
H,Dc
.
2:
Recall that we defined the classification L(x
i
, D
c
) to be
a unanimous vote amongst all hypothesis in VS
H,Dc
.
Thus, if L outputs the classification L(x
i
, D
c
), then so
does every hypothesis heVS
H,Dc
including the
hypothesis ceVS
H,Dc
.
Therefore c(x
i
) = L(x
i
, D
c
).

Kristian Guillaumier, 2011 71