Vous êtes sur la page 1sur 36

!"#$%& !

"()*$+,
Aarti Singh & Barnabas Poczos


Machine Learning 10-701/15-781
Apr 24, 2014

Slides Courtesy: Tom Mitchell
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA
1
-*./,01 2".$",,/*3
2
!""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:
Logistic
function
(or Sigmoid):
;*/-"10 )#.01*. 2<<+-%= '* 2 +-.%23
)#.01*. *) '(% =2'2
z
l
o
g
i
t

(
z
)

-*./,01 2".$",,/*3 /, % -/3"%$
4&%,,/5"$6
3
!""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:




>%0-"-*. ?*#.=23@:
(Linear Decision Boundary)
1
1
7$%/3/3. -*./,01 2".$",,/*3
4
8*) (* &"%$3 (9" :%$%;"("$, )
<
= )
>
= ? )
@
A
A32-.-./ >2'2
B2C-$#$ 5D*.=-1*.2+9 ;-E%+-(**= F"1$2'%"



>-"03-$-.21G% <(-+*"*<(@ H >*.I' ,2"'% %J*3' +%23.-./ 4589K
)*0#" *. 456789 H '(2'I" 2++ '(2' $2L%3" )*3 0+2""-M021*.N

B:0;/C/3. 1*3D"E F#310*3
5
B2C D*.=-1*.2+ +*/O+-E%+-(**= P B-. Q%/21G% D*.=-1*.2+ +*/O
+-E%+-(**=
Q%/21G% D*.=-1*.2+ +*/O+-E%+-(**= -" 2 0*.G%C )#.01*.
Gradient Descent (convex)
Gradient:
Learning rate, !>0 Update rule:
-*./,01 F#310*3 %, % G$%:9
Sigmoid Unit
d
d
d
!"#$%& !"()*$+, (* &"%$3 FH I ! J
) 02. ?% 2 3*3K&/3"%$ )#.01*.
8 5G%0'*3 *)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%"
6 5D"1(*$ *)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%"
Q%#32+ .%',*3E" O S%<3%"%.' ) ?@ !"#$%&' *) +*/-"10R"-/$*-=
#.-'":
Input layer, X
Output layer, Y
Hidden layer, H
Sigmoid Unit
Neural Network trained to distinguish vowel sounds using 2 formants (features)
Highly non-linear decision surface
Two layers of logistic units
Input
layer
Hidden
layer
Output
layer
Neural Network
trained to drive a
car!
Weights of each pixel for one hidden unit
Weights to output units from the hidden unit
L$"@/10*3 #,/3. !"#$%& !"()*$+,
Prediction Given neural network (hidden units and weights), use it to predict
the label of a test point

Forward Propagation
Start from input layer
For each subsequent layer, compute output of sigmoid unit



Sigmoid unit:



1-Hidden layer,
1 output NN:


o
h
Consider regression problem f:X!Y , for scalar Y
y = f(x) + "





assume noise N(0,$
"
), iid
deterministic
MN4O-P 7$%/3/3. F*$ !"#$%& !"()*$+,
Learned
neural network
Lets maximize the conditional data likelihood
Train weights of all units to minimize sum of squared errors
of predicted network outputs
Consider regression problem f:X!Y , for scalar Y
y = f(x) + "





noise N(0,!
"
)
deterministic
MQL 7$%/3/3. F*$ !"#$%& !"()*$+,
Gaussian P(W) = N(0,!%)
ln P(W) & c '
i
w
i
2
Train weights of all units to minimize sum of squared errors
of predicted network outputs plus weight magnitudes
d
E Mean Square Error
For Neural Networks,
E[w] no longer convex in w
Differentiable
d
d
d
7$%/3/3. !"#$%& !"()*$+,
P$$*$ G$%@/"3( F*$ % R/.;*/@ S3/(
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
y
l
y
l
y
l
y
l
y
l
y
l
l
l
l
l
l
l
l
l
l
l l l
l
l
y
Sigmoid Unit
d
d
d
Using all training data D
l
l l
y
l
l
l
l l
y
l
(MLE)
l l l
l
l
y
k
l l
l
l
l
l l l
o
Using Forward propagation
y
k
= target output (label)
o
k/h
= unit output
(obtained by forward
propagation)
w
ij
= wt from i to j

Note: if i is input variable,
o
i
= x
i
Objective/Error no
longer convex in
weights
Our learning algorithm involves a parameter
n=number of gradient descent iterations
How do we choose n to optimize future error?
(note: similar issue for logistic regression, decision trees, !)

e.g. the n that minimizes error rate of neural net over future data


T"%&/3. )/(9 BD"$5U3.
Our learning algorithm involves a parameter
n=number of gradient descent iterations
How do we choose n to optimize future error?

Separate available data into training and validation set
Use training to perform gradient descent
n " number of iterations that optimizes validation set error



T"%&/3. )/(9 BD"$5U3.
Idea: train multiple times, leaving out a disjoint subset of data each time
for test. Average the test set accuracies.
________________________________________________
Partition data into K disjoint subsets
For k=1 to K
testData = kth subset
h " classifier trained* on all data except for testData
accuracy(k) = accuracy of h on testData
end
FinalAccuracy = mean of the K recorded testset accuracies
* might withhold some of this to choose number of gradient decent steps
VKF*&@ 4$*,,KD%&/@%0*3
This is just k-fold cross validation leaving out one example each iteration
________________________________________________
Partition data into K disjoint subsets, each containing one example
For k=1 to K
testData = kth subset
h " classifier trained* on all data except for testData
accuracy(k) = accuracy of h on testData
end
FinalAccuracy = mean of the K recorded testset accuracies
* might withhold some of this to choose number of gradient decent steps
-"%D"K*3"K*#( 4$*,,KD%&/@%0*3
Cross-validation
Regularization small weights imply NN is linear (low VC
dimension)
Control number of hidden units low complexity
T"%&/3. )/(9 BD"$5U3.
"w
i
x
i
L
o
g
i
s
t
i
c

o
u
t
p
u
t

w
0
left

strt

right

up

Semantic Memory Model Based on ANNs
[McClelland & Rogers, Nature 2003]
No hierarchy given.
Train with assertions,
e.g., Can(Canary,Fly)
Humans act as though they have a hierarchical memory
organization


1. Victims of Semantic Dementia progressively lose knowledge of objects
But they lose specific details first, general properties later, suggesting
hierarchical memory organization

Thing
Living
Animal
Plant
NonLiving
Bird Fish
Canary
2. Children appear to learn general
categories and properties first, following
the same hierarchy, top down
*
.

* some debate remains on this.
Memory deterioration follows semantic hierarchy
[McClelland & Rogers, Nature 2003]
Q$051/%& !"#$%& !"()*$+,H R#;;%$W
T !01G%+@ #"%= '* $*=%+ =-"'3-?#'%= 0*$<#'21*. -. ?32-.
T U-/(+@ .*.O+-.%23 3%/3%""-*.R0+2""-M021*.
T V%0'*3OG2+#%= -.<#'" 2.= *#'<#'"
T 4*'%.12++@ $-++-*." *) <232$%'%3" '* %"1$2'% O *G%3MW./
T U-==%. +2@%3" +%23. -.'%3$%=-2'% 3%<3%"%.'21*." H (*, $2.@
'* #"%X
T 43%=-01*. H Y*3,23= <3*<2/21*.
T Z32=-%.' =%"0%.' 5[20EO<3*<2/21*.9K +*02+ $-.-$2 <3*?+%$"
T D*$-./ ?20E -. .%, )*3$ 2" =%%< ?%+-%) .%',*3E" 5<3*?2?-+-"10
-.'%3<3%'21*.9

Vous aimerez peut-être aussi