Vous êtes sur la page 1sur 35

UNIT 5 LEARNING FROM OBSERVATION INDUCTIVE LEARNING DECISION TREES EXPLANATION BASED LEARNING STATISTICAL LEARNING METHODS REINFORCEMENT

CEMENT LEARNING. LEARNING FROM OBSERVATION


The idea behind learning is that percepts should be used not only for acting, but also for improving the agent's ability to act in the future.

Learning takes place as a result of the interaction between the agent and the world, and from observation by the agent of its own decision-making processes

Learning Agent Four Components 1. Performance Element: collection of knowledge and procedures to decide on the next action. E.g. walking, turning, drawing, etc. 2. Learning Element: takes in feedback from the critic and modifies the performance element accordingly. 3. Critic: provides the learning element with information on how well the agent is doing based on a fixed performance standard. E.g. the audience 4. Problem Generator: provides the performance element with suggestions on new actions to take.

Designing a Learning Element Depends on the design of the performance element Four major issues affects the designing:1. Which components of the performance element to improve 2. The representation of those components 3. Available feedback 4. Prior knowledge

Components of the Performance Element A direct mapping from conditions on the current state to actions Information about the way the world evolves Information about the results of possible actions the agent can take Utility information indicating the desirability of world states Representation A component may be represented using different representation schemes Details of the learning algorithm will differ depending on the representation, but the general idea is the same Functions are used to describe a component Feedback & Prior Knowledge Supervised learning: inputs and outputs available Reinforcement learning: evaluation of action Unsupervised learning: no hint of correct outcome Background knowledge is a tremendous help in learning INDUCTIVE LEARNING (8 MARKS) Form of Supervised Learning - the learning element is given the correct (or

approximately correct) value of the function for particular inputs, and changes its representation of the function to try to match the information provided by the feedback.

Key idea: To use specific examples to reach general conclusions Given a set of examples, the system tries to approximate the evaluation function. Also called Pure Inductive Inference Example: a pair (x, f(x)), where x is the input, f(x) is the output of the function applied to x. hypothesis: a function h that approximates f, given a set of examples.

The task of induction: Given a set of examples, find a function h that approximates the true evaluation function f. Figure shows a familiar example : fitting a function of a single variable to some data points. The examples are (x , f(x)) pairs, where both x and f(x) are real numbers. The hypothesis space H is taken as set of polynomial degree at most k such as 3x2 + 2, x17 4 x3 , and so on Figure (a) shows some data with an exact fit by a straight line ( a polynomial of degree 1). The line is consistent hypothesis because it agrees with all the data. Figure (b) shows a high degree polynomial that is also consistent with the same data. ISSUE:- How to choose from multiple consistent hypotheses? ANSWER: OCKHAMS RAZOR :- Prefer the simplest hypothesis, consistent with the data. Figure ( c) shows a second data set.

There is no consistent line for this data set, it requires degree -6 polynomial for exact fit. Similarly Figure(d) shows that the data cannot be fit for the given set of expressions So, the best solution that fits exactly with the data is selected as solution i.e Figure (a)

LEARNING DECISION TREES (16 MARKS) Decision tree induction is one of the simplest, and yet most successful learning algorithm. It serves as an introduction to Inductive Learning. Decision trees as performance elements A decision tree takes as input an object or situation described by a set of attributes and returns a decision the predicted output value for the input The input attributes can be discrete or continuous variables Assume discrete variables are taken as input. O/P can also be discrete or continuous. CLASSIFICATION LEARNING (2 marks) Learning a discrete valued function is called classification learning REGRESSION LEARNING (2 marks) Learning a continuous function is called regression learning. BOOLEAN CLASSIFICATION can be used wherein each example is classified true(positive) or false(negative) A Decision tree reaches its decision by performing a sequence of tests . Each internal node in the tree corresponds to a test of the value of one of the properties and the branches from the node are labeled with the possible values of the test.

Each leaf node in the tree specifies the value to be returned if that leaf is reached. Example Problem: Decide whether to wait for a table at a restaurant, based on the following attributes: AIM :- WILL WAIT 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Expressiveness of Decision Trees Logically speaking, any particular decision tree hypotlhesis for the Will Wait goal predicate can be seen as an assertion of the form

S WillWait(S) ( P1( S) V P2( S) V . . . V Pn( S) ),


where each condition Pi(S) is a conjunction of tests corresponding to a path from the root of the tree to a leaf with a positive outcome.

Although this looks like a first-order sentence, it is, in a sense, propositional. Decision trees are fully expressive within the class of propositional languages; that is, any Boolean function can be written as a decision tree. This can be done trivially by having each row in the truth table for the function correspond to a path in the tree. This would yield an exponentially large decision tree representation because the truth table has exponentially many rows. Decision trees can represent many (but not all) functions with much smaller trees.

Inducing decision trees from examples I An example for a Boolean decision tree consists of a vector of input attributes, X, and a single Boolean output value y. The complete set of examples is called the training set. In order to find a decision tree that agrees with the training set, we could simply construct a decision tree that has one path to a leaf for each example, where the path tests each attribute in turn and follows the value for the example and the leaf has the classification of the example. When given the same example again, the decision tree will come up with the right classification. Unfortunately, it will not have much to say about any other cases! Some Examples

Decision-Tree Leaning Algorithm - I The basic idea behind the Decision-Tree Learning Algorithm is to test the most important attribute first.

By "most important," we mean the one that makes the most difference to the classification of an example. That way, we hope to get to the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be small.

Decision-Tree Leaning Algorithm - II

Decision-Tree Leaning Algorithm - III

Decision-Tree Leaning Algorithm - IV In general, after the first attribute test splits up the examples, each outcome is a new decision tree learning problem in itself, with fewer examples and one fewer attribute. There are four cases to consider for these recursive problems:

1. If there are some positive and some negative examples, then choose the best attribute to split them. Figure 3(b) shows Hungry being used to split the remaining examples. 2. If all the remaining examples are positive (or all. negative), then we are done: we can answer Yes or No. Figure 3(b) shows examples of this in the None and Some cases. 3. If there are no examples left, it means that no such example has been observed, and we return a default value calculated from the majority classification at the node's parent. 4. If there are no attributes left, but both positive and negative examples, we have a problem. It means that these examples have exactly the same description, but different classifications. This happens when some of the data are incorrect; we say there is noise in the data. It also happens either when the attributes do not give enough information to describe the situation fully, or when the domain is truly nondeterministic. One simple way out of the problem is to use a majority vote. Decision-Tree Leaning Algorithm

A Decision Tree Induced From Examples

Choosing attribute tests The scheme used in decision tree learning for selecting attributes is designed to minimize the depth of the final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact classification of the examples. A perfect attribute divides the examples into sets that are all positive or all negative. The Patrons attribute is not perfect, but it is fairly good. A really useless attribute, such as Type, leaves the example sets with roughly the same proportion of positive and negative examples as the original set. All we need, then, is a formal measure of "fairly good" and "really useless. The measure should have its maximum value when the attribute is perfect and its minimum value when the attribute is of no use at all. Amount of Information - I One suitable measure is the expected amount of information provided by the attribute, where we use the term in the mathematical sense first defined in Shannon and Weaver (1949). Information theory measures information content in bits. One bit of information is enough to answer a yeslno question about which one has no idea, such as the flip of a fair coin. Amount of Information - II In general, if the possible answers vi have probabilities P(vi), then the information content I of the actual answer is given by:

Amount of Information - III For decision tree learning, the question that needs answering is; for a given example, what is the correct classification?

An estimate of the probabilities of the possible answers before any of the attributes have been tested is given by the proportions of positive and negative examples in the training set. Suppose the training set contains p positive examples and n negative examples. Then an estimate of the information contained in a correct answer is:

Amount of Information - IV A test on a single attribute A will not usually tell us all the information, but it will give us some of it. We can measure exactly how much by looking at how much information we still need after the attribute test. Any attribute A divides the training set E into subsets E1, . . . , Ev, according to their values for A, where A can have v distinct values. Each subset Ei has pi positive examples and ni negative examples, so if we go along that branch, we will need an additional I(pi/(pi + ni), ni/(pi + ni)) bits of information to answer the question. Amount of Information - V A randomly chosen example from the training set has the ith value for the attribute with probability (pi + n i ) / ( p + n), so on average, after testing attribute A, we will need the following amount of information to classify the example:

Amount of Information - VI The information gain from the attribute test is the difference between the original information requirement and the new requirement:

Amount of Information - VII The heuristic used in the CHOOSE-ATTRIBUTE function is just to choose the attribute with the largest gain. Returning to the attributes considered in Figure 18.4, we have: Gain(Patrons) = 0.541 bits.

Assessing the Performance


Collect a large set of examples. Divide it into two disjoint sets: training set and test set. Learn algorithm with training set and generate hypothesis H. Measure percentage of examples in test set that are correctly classified by H. Repeat steps 1 to 4 for different sets.

Learning curve shows increase of the quality of the prediction, when training set grows.

Noise and Overfitting

Overfitting, problem not to find meaningless regularity in the data (examples: rolling dice characterised according to attributes like hour, day, month result in perfect decision tree, when no two examples have identical description) possibility: decision tree pruning by detecting irrelevant attributes. Irrelevant = no information gain for an infinitely large sample. null hypothesis, assumes that there is no underlying pattern. Only if significant deviation (e.g more than 5%) attribute considered. alternative: cross-validation, i.e. take only part of the data for learning and rest for testing the prediction performance. Repeat with different subsets and select best tree. (can be combined with pruning)

EXPLANATION BASED LEARNING (EBL) (8 MARKS) Learning a lot from one example It is a method for extracting general rules from individual observations. Example:Problem:- If a student is asked to solve a problem in differential calculus or algebraic expressions for the first time with no experience then it is very difficult to solve the problem. Application of the standard rules of differentiation yields simplification of expressions. Advantages: Explanation Based Learning requires only a single training example where as an inductive learning requires many training examples. MOMOIZATION It is a technique that is used in computer science to speed up programs by saving the results of computation.

Memo functions accumulate a data base of I/O pairs, it first checks the database to see whether it can avoid solving the problem from point.
EBL have a general rule Arithmetic unknown (u) => Derivative (u2,2u) = 2u For any unknown u, derivation of u 2 with respect to u is 2u. When such an expression is given to logical agent to solve, it is expressed in the following format ASK(DERIVATIVE(u2, u) = d , KB) where d = 2u But an agent solves the expression for that variable alone, given another variable, the agent tries to derive again, so any expression has to be solved with its constants as it is, and also with general variables so that it can be used for all variables. Extracting General rules from Examples The main idea of EBL is to construct an explanation of the observation using prior knowledge and to establish a definition of the cases for which the same explanation structure can be used. The explanation can be a logical proof but generally it can be any problem solving process whose steps are well defined. EXAMPLE Solve the following expression:1 X (0 + x) KB includes:= Rewrite (u,v) ^ simplify (v,w) => simplify (u,w) Primitive (u) => Simplify (u,u) Arithmetic unknown (u) => Primitive (u) Number (U) => Primitive (u) Rewrite ( 1 X U , U) Rewrite ( 0 + U , U)

There are two proof trees constructed using EBL methods which is hown
below.

The second proof tree in which the constants from original goal are replaced by variables so that it can be taken as general format. The variabilized proof proceeds using the same rule applications like original proof.

To recap, the basic EBL process works as follows:


1. Given an example, construct a proof that the goal predicate applies to the example using the available background knowledge.

In parallel, construct a generalized proof tree for the variabilized goal using the same inference steps as in the original proof. 3. Construct a new rule whose left-hand side consists of the leaves of the proof tree, and whose right-hand side is the variabilized goal (after applying the necessary bindings from the generalized proof). 4. Drop any conditions that are true regardless of the values of the variables in the goal

IMPROVING EFFICIENCY There are three factors involved in the analysis of efficiency gains from EBL:

1. Adding large numbers of rules to a knowledge base can slow down the reasoning process, because the inference mechanism must still check those rules even in cases where they do not yield a solution. In other words, it increases the branching factor in the search space. 2. To compensate for this, the derived rules must offer significant increases in speed for the cases that they do cover. This mainly comes about because the derived rules avoid dead ends that would otherwise be taken, as well as shortening the proof itself. 3. Derived rules should also be as general as possible, so that they apply to the largest possible set of cases.

Reinforcement learning: (8/16 marks) How an agent learns what to do, when there is no one to teach what actions to take in each circumstances. Example: An agent can learn to play chess by supervised learning (i.e) if there is a teacher to teach how to play, but if there is no one to teach, then the agent tries some random moves, the agent build a predictive model of its environment. Without some Feedback about what is good and what is bad the agent will have no grounds for deciding which move to make. The agent needs to know that something good has happened when it wins and something bad has happened when it loses. This kind of feedback is called a reward (or) reinforcement. In games like chess, the reinforcement is received only at the end of the games, otherwise rewards come more frequently. The main task of reinforcement learning is to use observed rewards to learn an optimal policy for the environment. Agents Utility-based agent: Learns utility function on states and uses it to select actions that maximize the expected outcome utility. Q-learning agent: Learns an action value function (or) Q-function giving the expected utility of taking a given action in a given state. (action, state) pairs. Reflex agent: Learns function mapping states to actions. PASSIVE LEARNING learns utility of state. ACTIVE LEARNING Agent learns what to do.

Passive Reinforcement Learning We start with the case of a passive learning agent using a state-based representation in a known, accessible environment. In passive learning, the environment generates state transitions and the agent perceives them.

Consider an agent trying to learn the utilities of the states shown in Figure(a) We assume, for now, that it is provided with a model My giving the probability of a transition from state i to state j, as in Figure (b). In each training sequence, the agent starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal states (3,2) or (3,3), where it receives a reward. The object is to use the information about rewards to learn the expected utility U(i) associated with each nonterminal state i. Make one big simplifying assumption: the utility of a sequence is the sum of the rewards accumulated in the states of the sequence. That is, the utility function is additive We define the reward-to-go of a state as the sum of the rewards from that state until a terminal state is reached.

Direct Utility Estimation For each state the agent ever visits: For each time the agent visits the state: Keep track of the accumulated rewards from the visit onwards. Similar to inductive learning: Learning a function on states using samples. Weaknesses: Ignores correlations between utilities of neighboring states. Converges very slowly. Adaptive Dynamic Programming Learns how states are connected. ADP agent works by learning the transition model of the environment as it goes along and solving the corresponding Markov decision process using a dynamic programming method. For passive learning agent plugging the learned transition model T(S,(S),S) and the observed reward R(S) into Bellman equation to calculate the utilities of the states. We can adopt the approach of modified policy iteration, using a simplified value iterations process to update the utility estimates after each change to the learned model. Because the model changes only with each observation, the value iteration process can use the previous utility estimates as initial values and converge. The process of learning the model itself is easy, because the environment is fully observable. We can also represent the transition model as a table of probabilities. Temporal Difference Every time we make a transition from state s to state s: Update utility of s: U[s] = current observed reward.

Update utility of s: U[S] = (1-a)U[S] + a (r + g U[S] ). a: learning rate r: previous reward g: discount factor Properties of Temporal Difference

Temporal Difference Learning Can we get the best of both worlds use contraints without solving equations for all states? use observed transitions to adjust locally in line with constraints U(i) U(i) + (R(i) + U(j) U(i)) is learning rate Called temporal difference (TD) equation updates according to difference in utilities between successive states. Note: compared with

only involves observed successor rather than all successors However, average value of U(i) converges to correct value. Step further replace with function that decreases with number of observations U(i) converges to correct value (Dayan, 1992)

What happens when an unlikely transition occurs? Properties of Temporal Difference What happens when an unlikely transition occurs? U[s] becomes a bad approximation of true utility. However, U[s] is rarely a bad approximation. Average value of U[s] converges to correct value. If a decreases over time, U[s] converges to correct value.

Hybrid Methods ADP: More accurate, slower, intractable for large numbers of states. TD: Less accurate, faster, tractable. An intermediate approach: Hybrid Methods ADP: More accurate, slower, intractable for large numbers of states. TD: Less accurate, faster, tractable. An intermediate approach: Pseudo-experiences: Imagine transitions that have not happened. Update utilities according to those transitions.

Hybrid Methods Making ADP more efficient: Do a limited number of adjustments after each transition. Use estimated transition probabilities to identify the most useful adjustments. Active Reinforcement Learning Using passive reinforcement learning, utilities of states and transition probabilities are learned. Those utilities and transitions can be plugged into Bellman equations. Problem? Active Reinforcement Learning Using passive reinforcement learning, utilities of states and transition probabilities are learned. Those utilities and transitions can be plugged into Bellman equations. Problem? Bellman equations give optimal solutions given correct utility and transition functions.

Passive reinforcement learning produces approximate estimates of those functions. Solutions?

Exploration/Exploitation The goal is to maximize utility. However, utility function is only approximately known. Dilemma: should the agent Maximize utility based on current knowledge, or Try to improve current knowledge. Exploration/Exploitation The goal is to maximize utility. However, utility function is only approximately known. Dilemma: should the agent Maximize utility based on current knowledge, or Try to improve current knowledge. Answer: A little of both. Exploration Function U[s] = R[s] + g max {f(Q(a,s), N(a, s))}. R[s]: current reward. g: discount factor. Q(a,s): estimated utility of performing action a in state s. N(a, s): number of times action a has been performed in state s. f(u, n): preference according to utility and degree of exploration so far for (a, s). Initialization: U[s] = optimistically large value. Q-learning Learning utility of state-action pairs. U[s] = max Q(a, s). Learning can be done using TD: Q(a,S) = (1-b) Q(a,S) + b(R(S) + g max(Q(a, S)) b: learning factor g: discount factor s: next state

a: possible action at next state

Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, )? Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, )? Solution similar to estimating probabilities of a huge number of events: Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, )? Solution similar to estimating probabilities of a huge number of events: Learn parametric functions, where parameters are features of each state. Example: chess. 20 features adequate for describing the current board. Learning Parametric Utility Functions For Backgammon First approach: Design weighted linear functions of 16 terms. Collect training set of board states. Ask human experts to evaluate training states. Result: Program not competitive with human experts. Collecting training data was very tedious. Learning Parametric Utility Functions For Backgammon Second approach: Design weighted linear functions of 16 terms. Let the system play against itself. Reward provided at the end of each game. Result (after 300,000 games, a few weeks): Program competitive with best players in the world.

Computational Learning Theory Main principle: any hypothesis that is seriously wrong will almost certainly be found out with high probability after a small number of examples, because it will make an incorrect prediction. Assumes that the training and test sets are drawn randomly STATISTICAL LEARNING (8/16 marks) GOALS Learn probabilistic theories of the world from experience We focus on the learning of Bayesian networks More specifically, input data (or evidence), learn probabilistic theories of the world (or hypotheses) Bayesian learning => Approximate Bayesian learning

Maximum a posteriori learning (MAP) Maximum likelihood learning (ML) Parameter learning with complete data ML parameter learning with complete data in discrete models ML parameter learning with complete data in continuous models(linear regression) Naive Bayes models Bayesian parameter learning Learning Bayes net structure with complete data Learning with hidden variables or incomplete data (EM algorithm)

Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h1; h2; : : :, prior P(H) jth observation dj gives the outcome of random variable Dj training data d=d1; : : : ; dN Given the data so far, each hypothesis has a posterior probability: P(hi/d) = P(d/hi)P(hi) where P(d/hi) is called the likelihood Predictions use a likelihood-weighted average over all hypotheses: P(X/d) = P(X/d; hi)P(hi/d) = P(X/hi)P(hi/d) EXAMPLE Suppose there are five kinds of bags of candies: 10% are h1: 100% cherry candies 20% are h2: 75% cherry candies + 25% lime candies 40% are h3: 50% cherry candies + 50% lime candies 20% are h4: 25% cherry candies + 75% lime candies 10% are h5: 100% lime candies Then we observe candies drawn from some bag, What kind of bag is it? What flavor will the next candy be?

Properties of full Bayesian learning 1. The true hypothesis eventually dominates the Bayesian prediction given that the true hypothesis is in the prior 2. The Bayesian prediction is optimal, whether the data set be small or large On the other hand 1. The hypothesis space is usually very large or in_nite summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)

MAP approximation Maximum a posteriori (MAP) learning: choose hMAP maximizing P(hi|d) instead of calculating P(hi|d) for all hypothesis hi I.e., maximize P(d|hi)P(hi) or log P(d|hi) + log P(hi) Over fitting in MAP and Bayesian learning _ Over fitting when the hypothesis space is too expressive such that some hypotheses _t the date set well. _ Use prior to penalize complexity Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses (simplest), P(d|hi) is 1 if consistent, 0 otherwise MAP = simplest hypothesis that is consistent with the data ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose hML maximizing P(d|hi) I.e., simply get the best _t to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard" (non-Bayesian) statistical learning method 1. Researchers distrust the subjective nature of hypotheses priors 2. Hypotheses are of the same complexity

3. Hypotheses priors is of less important when date set is large 4. Huge space of the hypotheses

Seems sensible, but causes problems with 0 counts! This means that if the data set is small enough that some events have not yet been observed, the ML hypotheses assigns zero to those events. - tricks in dealing with this including initialize the counts for each event to 1 instead of 0. ML approach: 1. Write down an expression for the likelihood of the data as a function of the parameter(s); 2. Write down the derivative of the log likelihood with respect to each parameter; 3. Find the parameter values such that the derivatives are zero.

BETA DISTRIBUTION:

Hidden Variables
Learning with Hidden Variables Many real-world problems have hidden variables which are not observable in the data available for learning. Question: If a variable (disease) is not observed, why not construct a model without it? Answer: Hidden variables can dramatically reduce the number of parameters required to specify a Bayesian network. This results in the reduction of needed amount of data for learning. EM: Learning mixtures of Gaussians (EM ALGORITHM SEPARATE 16 MARKS) The unsupervised clustering problem (Fig 20.8 with 3 components):

P(x) = ki=1P(C=i)P(x|C=i)

If we knew which component generated each xj, we can get , If we knew the parameters of each component, we know which ci should xj belong to. However, we do not know either, EM expectation and maximization

Pretend we know the parameters of the model and then to infer the probability that each xj belongs to each component; iterate until convergence. For the mixture of Gaussians, initialize the mixture model parameters arbitrarily; and iterate the following E-step: Compute pij = P(C=i|xj) = P(xj|C=i)P(C=i) P(C=i) = pi = j pij M-step: Compute the new i = j pijxj/pi, i pijxjxjT/pi, wi = pi (component weight)

Instance-based Learning Parametric vs. nonparametric learning

Learning focuses on fitting the parameters of a restricted family of probability models to an unrestricted data set Parametric learning methods are often simple and effective, but can oversimplify whats really happening Nonparametric learning allows the hypothesis complexity to grow with the data IBL is nonparametric as it constructs hypotheses directly from the training data. Nearest-neighbor models The key idea: Neighbors are similar Density estimation example: estimate xs probability density by the density of its neighbors Connecting with table lookup, NBC, decision trees, How define neighborhood N If too small, no any data points If too big, density is the same everywhere A solution is to define N to contain k points, where k is large enough to ensure a meaningful estimate For a fixed k, the size of N varies (Fig 21.12) The effect of size of k (Figure 21.13) For most low-dimensional data, k is usually between 5-10
LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM (16 MARKS) Many real-world problems have LATENT VARIABLES hidden variables (sometimes called latent variables) which are not observable in the data that are available for learning. For example, medical records often include the observed symptoms, the treatment applied, and perhaps the outcome of the treatment, but they seldom contain a direct observation of the disease itself!6 One might ask, If the disease is not observed, why not construct a model without it?

The answer appears in Figure 20.7, which shows a small, fictitious diagnostic model for heart disease. There are three observable predisposing factors and three observable symptoms (which are too depressing to name). Assume that each variable has three possible values (e.g., none, moderate, and severe). Removing the hidden variable from the network in (a) yields the network in (b); the total number of parameters increases from 78 to 708. Thus, latent variables can dramatically reduce the number of parameters required to specify a Bayesian network. This, in turn, can dramatically reduce the amount of data needed to learn the parameters. Hidden variables are important, but they do complicate the learning problem. In Figure20.7(a), for example, it is not obvious how to learn the conditional distribution for HeartDisease, given its parents, because we do not know the value of HeartDisease in each case; the same problem arises in learning the distributions for the symptoms.

An algorithm called expectationmaximization, or EM, that solves this problem in a very general way. We will show three examples and then provide a general description. The algorithm seems like magic at first, but once the intuition has been developed, one can find applications for EM in a huge range of learning problems. Unsupervised clustering: Learning mixtures of Gaussians Unsupervised clustering is the problem of discerning multiple categories in a collection of objects. The problem is unsupervised because the category labels are not given. For example, suppose we record the spectra of a hundred thousand stars; are there different types of stars revealed by the spectra, and, if so, how many and what are their characteristics? We are all familiar with terms such as red giant and white dwarf, but the stars do not carry these labels on their hatsastronomers had to perform unsupervised clustering to identify these categories. Other examples include the identification of species, genera, orders, and so on in the Linnan taxonomy of organisms and the creation of natural kinds to categorize ordinary objects.

Unsupervised clustering begins with data. Figure 20.8(a) shows 500 data points, each of which specifies the values of two continuous attributes. The data points might correspond to stars, and the attributes might correspond to spectral intensities at two particular frequencies. Next, we need to understand what kind of probability distribution might have generated the data. Clustering presumes that the data are generated from a mixture distribution, P. Such a COMPONENT distribution has k components, each of which is a distribution in its own right. A data point is generated by first choosing a component and then generating a sample from that component. Let the random variable C denote the component, with values 1,. ,k; then the mixture

distribution is given by

where x refers to the values of the attributes for a data point. For continuous data, a natural choice for the component distributions is the multivariate Gaussian, which gives the so-called mixture of Gaussians family of distributions. The parameters of a mixture of Gaussians are wi =P(C =i) (the weight of each component), _i (the mean of each component), and _i (the covariance of each component). Figure 20.8(b) shows a mixture of three Gaussians; this mixture is in fact the source of the data in (a). The unsupervised clustering problem, then, is to recover a mixture model like the one in Figure 20.8(b) from raw data like that in Figure 20.8(a). Clearly, if we knew which component generated each data point, then it would be easy to recover the component Gaussians: we could just select all the data points from a given component and then apply (a multivariate version of) Equation (20.4) for fitting the parameters of a Gaussian to a set of data. On the other hand, if we knew the parameters of each component, then we could, at least in a probabilistic sense, assign each data point to a component. The problem is that we know neither the assignments nor the parameters. The basic idea of EM in this context is to pretend that we know the the parameters of the model and then to infer the probability that each data point belongs to each component.

After that, we refit the components to the data, where each component is fitted to the entire data set with each point weighted by the probability that it belongs to that component. The process iterates until convergence. Essentially, we are completing the data by inferring probability distributions over the hidden variableswhich component each data point belongs tobased on the current model. For the mixture of Gaussians, we initialize the mixture model parameters arbitrarily and then iterate the following two steps:

The E-step, or expectation step, can be viewed as computing the expected values pij of the hidden indicator variables INDICATOR VARIABLE Zij , where Zij is 1 if datum xj was generated by the ith component and 0 otherwise. The M-step, or maximization step, finds the new values of the parameters that maximize the log likelihood of the data, given the expected values of the hidden indicator variables. The final model that EM learns when it is applied to the data in Figure 20.8(a) is shown in Figure 20.8(c); it is virtually indistinguishable from the original model from which the data were generated. Figure 20.9(a) plots the log likelihood of the data according to the current model as EM progresses. There are two points to notice. First, the log likelihood for the final learned model slightly exceeds that of the original model, from which the data were generated. It Reflects the fact that the data were generated randomly and might not provide an exact reflection of the underlying model. The second point is that EM increases the log likelihood of the data at every iteration.

This fact can be proved in general. Furthermore, under certain conditions, EM can be proven to reach a local maximum in likelihood. NEURAL NETWORKS

Neural networks Perceptrons Multilayer perceptrons Applications of neural networks

Vous aimerez peut-être aussi