Vous êtes sur la page 1sur 6

CS121

Introduction to Artificial Intelligence


Winter 2004

Homework #8
(Probabilistic Inference and Inductive Learning)
Out: 3/3/04 Due: 3/10/04
How to complete this HW: First copy this file; then insert your answers in the file
immediately below each question; finally submit the completed file to the submission webpage
https://www.stanford.edu/class/cs121/hwsub/hw8.fft before 3/3 at midnight.

Note on Honor Code: You must NOT look at previously published solutions of any of these
problems in preparing your answers. You may discuss these problems with other students in the
class (in fact, you are encouraged to do so) and/or look into other documents (books, web sites),
with the exception of published solutions, without taking any written or electronic notes. If you
have discussed any of the problems with other students, indicate their name(s) here:

Any intentional transgression of these rules will be considered an honor code violation.

General information: In the following, whenever an explanation is required (Why?),


there will not be full-credit without an explanation. Keep explanations short and to the point.
Excessive verbosity will be penalized. If you have any doubt on how to interpret a question,
either tell us in advance, so that we can help you understand the question, or tell us how you
understand it in your returned solution.

Grading:
Problem #
I
II
III
IV
Total

Max.
grade
5
5
5
5
20

Your
grade

I. (5 points) Logic sentence AB (read A implies B) is defined as the equivalent of AB


(read not A or B). So, AB is True if and only if either A is False or both A and B are True.

Consider the following three sentences: AB, BC, and AC. Given a sentence S, Pr(S) denotes
the probability that S is True. We are told that: Pr(AB) = 0.7 and Pr(BC) = 0.8. Given this
information, we want to determine Pr(AC). In fact, Pr(AC) is not uniquely determined by the
given information. Instead, it can only range over a sub-interval of [0,1]. The goal of this problem is
to determine this interval.
1. We define abc, ab, ac, , as follows
abc = Pr(ABC)
ab = Pr(ABC)
ac = Pr(ABC)
...
a = Pr(ABC)
x = Pr(ABC)
Express Pr(AB), Pr(BC), and Pr(AC) using abc, ab, ac, bc, a, b, c, and x.
2. Express all constraints on abc, ab, ac, bc, a, b, c, and x, using the axioms of probability and the
given information?
3. Derive from these constraints the range of values of Pr(AC).
1. We have:
Pr(AB) = Pr(AB).
We also have: AB (ABC)(ABC)(ABC)(ABC)(ABC)(ABC).
Hence:
Pr(AB) = abc + ab + bc + b + c + x.
Similarly :

Pr(BC) = abc + bc + ac + c + a + x.
Pr(AC) = abc + ac + bc + c + b + x.

2. Each is a real number that is greater than or equal to 0, and less than or equal to 1.
abc + ab + ac + bc + a + b + c + x = 1.
abc + ab + bc + b + c + x = 0.7
(equ. 1)
ac + a = 0.3.
(equ. 2)
abc + bc + ac + c + a + x = 0.8
(equ. 3)
ab + b = 0.2.
(equ. 4)
Equ. 2 and 4 are derived from the 2nd constraint and equ. 1 and 3. They are used below.
3. Pr(AC) = abc + ac + bc + c + b + x.
By substracting equ. 4 from equ. 1, we get:
abc + bc + c + x = 0.5.
So:
Pr(AC) = ac + b + 0.5.

Pr(AC) takes its smallest value when both ac + b is minimal, and its largest value when ac + b is
maximal. Equ. 1-4 allow ac and b to be both 0 (minimal values), and they constrain them to be less
than 0.3 (equ. 2) and 0.2 (equ. 4), respectively. Therefore, we have:
0.5 Pr(AC) 1.

II. (5 points) Suppose we generate a training set from a decision tree and then apply decisiontree learning to that training set. Is it the case that the learning algorithm will eventually return the
correct tree as the training set goes to infinity? Why or why not?
The algorithm may not return the same tree, but it will return a tree that is logically equivalent,
assuming that the method for generating the training set eventually generates all possible
combinations of attributes. For example, if the method picks the value of each attribute uniformly at
random, the probability that it generates all possible combinations goes to one when the training set
goes to infinity. The form of the tree may not be the same, because there are multiple ways of
representing the same logical function.

III. (5 points)

Most likely, at the beginning of an exam, you try to predict whether each
problem is easy or difficult (D = + if it is difficult and if it is easy). Let us assume that you use
two observable problem attributes (predicates):
The text length L of the problem (L = 1 if it is long, 0 otherwise)
The amount M of math in the text (M = 1 if there is a lot of math, 0 otherwise)
For training data, assume that you have examined 12 previous problems from the homeworks, and
have collected the following data:
L
0
0
0
0
1
1
1
1

M
0
0
1
1
0
0
1
1

#
4
1
0
3
1
2
1
0

The first line of this table reads as follows: 4 problems for which L = 0 and M = 0 were not difficult
(D = ). The second line says: 1 problem for which L = 0 and M = 0 was difficult (D = +). Etc
Note that you observed no problem for which L = 0 and M = 1, or L = 1 and M = 1.
Based on this training data, you want to compute a representation of a difficult problem (D) in the
form of a decision tree using the two binary attributes L and M.
1. Construct the best decision tree you can for the training data. [As presented in class, choose the
best attribute to test next by minimizing the number of misclassified examples in the training

set. In addition, use no stopping criterion or pruning, so that the tree is grown until you run out
of attribute; then, if needed, use majority rule.]
Deciding which attribute to test first, between L and M, we note that L has a 0.5 error rate since in
both L=0 and L=1 the training examples are equally distributed between + and -. M has a (4/12)
(1/4) + (8/12)(3/8) = 1/3 error rate, so we choose to split on M first.
On Ms 1 branch, L=1 has one and no +s, and L=0 has no s and three +s, so its easy to see
that we split on L and assign + to the L=0 branch, and to the L=1 branch.
On Ms 0 branch, L=1 has one and two +s, while L=0 has four s and one plus. So to minimize
error, we split on L and assign + to the L=1 branch and to the L = 0 branch.
The final tree is:
(M(0
L
(0[] 1[+]))
(1
L
(0[+] 1[])))
2. Suppose we have 3 test examples, whose correct classification we know, as follows:
3.
L
6.
0
9.
1
12.
0

4. 5.
M D
7.
0
10. 11.
1 +
13. 14.
1 +

Does the decision tree agree with each of these?


The tree gets the second example wrong, but gets the first and third examples correctly.

IV. (5 points) A program uses the version space method to determine the concept of a rewarded
card. It is assumed that the concept to be learned is of the form:
REWARDED(x) r(x) s(x)
where r is a rank predicate and s a suit predicate.
A rank predicate is one of the following: 1, 2, , 10, J, Q, K, NUM (1, 2, , 10), FACE (J, Q, K),
ODD (1, 3, 5, , 9, J, K), EVEN (2, 4, , 10, Q), HONOR (1, J, Q, K), TEN-VALUE (10, J, Q, K),
ANY-RANK (1, , K).
A suit predicate is one of the following: , , , , BLACK, RED, MAJOR(that is, or), MINOR
(that is, or ), ANY-SUIT.

For simplification we represent the concept REWARDED by such expressions as (FACE, BLACK),
(TEN-VALUE, MAJOR), or (ACE, ANY-SUIT).
1. Draw the directed acyclical graphs (one for ranks and one for suits) that represent the hierarchy of
generality of the predicates.
2. Give the specific and general boundaries (respectively, the S-boundary and the G-boundary) of the
version space after each of the following three series of examples (each example is a card followed by
+ is the card is a rewarded card and otherwise):
a. (K +)(10 -) (8 +)
b. (8 +)(J +)
c. (8 -)(3 -)(2 -)(9 -)(10 -)(1 -)(J +)(Q +)
1.

Suit hierarchy:
ANY-SUIT

RED

BLACK
MAJOR
MINOR

Rank hierarchy:
ANY-RANK ODD

EVEN

NUM

3
5
7
9
J
K
2
4
6
8
10
Q
1
2
3
4
5
6
7

HONOR
FACE

8
9
10
1

J
Q
K
TEN-VALUE 10

FACE

J
Q
K

2.
a. G = (ANY-RANK, MINOR) and S = (ANY-RANK, CLUB)
b. G = (ANY-RANK, ANY-SUIT) and S = (ANY-RANK, ANY-SUIT)
c. G = (FACE, ANY-SUIT) and S = (FACE, MAJOR)

Vous aimerez peut-être aussi