Vous êtes sur la page 1sur 126

What is AI?

Is it...
• A type of applied computer science? (But what about theoretical and applied AI?)
• A branch of cognitive science?
• A science at all?

What is intelligence?
• What distinguishes human behavior from the behavior of everything else?
• Whatever behaviors we don't understand?
• Whatever "works", that is, results in survival (or flourishing) in complex
environments

Possible goals
• To understand intelligence independent of any particular "implementation"
• To model human or animal behavior
• To model human thought
• To implement rational thought
• To implement rational action

How is the topic divided up?


• By domain
o Vision
o Natural language
o Diagnosis (medicine, etc.)
o Mathematics
o Robotics
o Education
• By domain-independent task type
o Search
o Representation
o Inference
o Learning
o Evolution
o Perception
o Planning and action
• By theoretical framework
o Symbolic AI
 Logicist AI
 Case-based reasoning
 Soar
 ACT-R
o "Biomorphic" AI, Non-symbolic AI, Subsymbolic AI
 Neural networks, connectionist AI, dynamical systems
 Evolutionary computation

What are some basic differences in approach?


• The role of learning
o If you want an intelligent machine, program in the intelligence.
o If you want an intelligent machine, make it a good learner and send it out
into the world.
• The role of the hardware
o Intelligence is a software problem. An intelligent program can be run on a
brain or a computer.
o The hardware is relevant. We should look for intelligence that is based on
the properties of nervous systems because the smartest systems we know
about are smart because of their nervous systems. [neural networks,
connectionism]
• The role of domain-independent methods
o Neat AI
 Theories should be elegant and parsimonious.
 We should understand precisely what our theories can do and how
they behave.
 Most (or all) of intelligence is governed by general principles.
o Scruffy AI
 The mind is a kludge. To make things efficient, inelegant shortcuts
are often appropriate.
 It may be impossible to come to a precise understanding of our
theories.
 There are only a few general principles that apply across domains.
Intelligence comes from domain knowledge.
• Two frameworks: symbolic and subsymbolic (connectionist) AI
o Symbolic models
 Physical Symbol Systems (Newell, Pylyshyn, Fodor; summarized
by Harnad)
 A set of arbitrary physical tokens (scratches on paper,
holes on a tape, events in a digital computer, etc.) that are
 manipulated on the basis of explicit rules that are
 likewise physical tokens and strings of tokens. The rule-
governed symbol-token manipulation is based
 purely on the shape of the symbol tokens (not their
"meaning"), i.e., it is purely syntactic, and consists of
 rulefully combining and recombining symbol tokens.
There are
 primitive atomic symbol tokens and
 composite symbol-token strings. The entire cognitive
system and all its parts--the atomic tokens, the composite
tokens, the syntactic manipulations (both actual and
possible) and the rules--are all
 semantically interpretable: The syntax can be
systematically assigned a meaning (e.g., as standing for
objects, as describing states of affairs).
 Processes happen sequentially.
 There is a central controller which coordinates the activities of
the modules of the cognitive system and selects among candidate
processes at each point in time.
 The cognitive system interacts with the world through interfaces to
perception and action, which operate very differently from the
internal (cognitive) system.
 Time is often mapped onto space; that is, the cognitive system has
simultaneous access to all of a pattern of some length (word,
sentence, etc.). Inputs may also be presented sequentially, but the
problem of temporal short-term memory is side-stepped because
the inputs are preprocessed.
 Knowledge is usually programmed into the cognitive system by
someone who has a theory of how knowledge is organized.
Learning is also possible, but models that learn usually start
intelligent.
o Subsymbolic (connectionist, dynamical) models
 Control is distributed. There is just the illusion of someone being
in charge because the behavior seems purposeful, and it seems to
be possible to write a centralized program to make it happen.
 The basic processes involve very simple interactions among
primitive elements arranged in a network. Usally the interaction
amounts to the spread of activation.
 Many of the processes happen in parallel.
 The cognitive system may interact with the world through
perception and action components which are similar to the
internal (cognitive) parts of the creature. In some models, the
environment and the creature itself constitute one large dynamical
system.
 Knowledge is distributed, usually in the form of patterns of
connectivity among the primitive elements. The knowledge in
such systems is implicit; it often cannot be simply read off.
 Knowledge gets into the cognitive system through learning as the
system discovers the statistical properties of the world around it or
through evolution as generations of creatures are forced to survive
in the world.
 The problem of temporal short-term memory is often addressed,
though the continuous interaction of components of the cognitive
system with each other and the world may not be.

Introduction to representation
Why representation?
• Tasks: going from inputs (stimuli) to outputs (responses)
• The most primitive solution: a lookup table which specifies an output for every
input
• The problem with lookup tables:
o There may be too many inputs to store (the world is continuous, after all).
o There is a need for the system to be able to respond appropriately to novel
inputs.
• The alternative solution: a function from inputs to outputs
o AI is about these functions: what they might look like for tasks requiring
"intelligence".
o The functions may be very complex, requiring one or more
transformations of the input on the way to the output: internal
representations.

What are representations like?


• Are they explicit (directly interpretable), or are they in a form that looks like
garbage to an outside observer (even though they serve their function for the
system)?
• Are they localized (in one place), or are they distributed throughout the system?
• Are they propositional (language-like), or are they in some other form, for
example, more like images?
• Are they static or dynamic? Do they just sit there or do they "happen"?
• Are there different kinds of representations for different domains that have little in
common with each other?

What do they need to have?


• Distinction between objects and relations
• Wholes consisting of parts, which are in turn wholes consisting of parts:
recursive structure
• (Maybe) slots (roles) and fillers (values)
o In an object, the SHAPE slot may have filler CYLINDER. In a sentence,
the VERB slot may have filler "PUT".
o In an event, the AGENT slot may have filler ROBOT and the PATIENT
slot may have filler STICK1.
• (Maybe) truth and falsehood
• Generality (abstraction)
o The generalization (concept) BLOCK is an abstraction over all blocks,
including, for example, BLOCK4.
o The generalization (concept) PUT is an abstraction over all instances of
putting, including, for example, PUT8, in which the Robot puts a block on
another block.

Two basic kinds of representations


• Symbolic
o The primitives are symbols, e.g., PUT, BLOCK4.
o More complex expressions are built up by concatenating symbols
together into symbol structures, e.g., PUT(ROBOT, BLOCK4, TABLE).
o Similarity is all-or-none: eq?.
• Connectionist (subsymbolic)
o The primitives are vectors of numbers.
o There are no more complex expressions; combinations are produced
through addition or some other form of superposition.
o Similarity is (potentially) continuous: the distance between the vectors.

Some other questions concerning representation


• What sort of evidence do we have for what the internal representations of
cognition look like?
• What about the input and output themselves? What form do they take?
• How do representations relate to the world, to perception, and to action?

Memory and learning (and representation)


• There is usually a basic distinction between long-term, general knowledge of a
particular type (in long-term memory) and the short-term characterization of the
current situation.
• The system needs to be able to take items in STM and use them to access
knowledge in LTM (a kind of search). In symbolic systems this is usually
accomplished through some form of pattern matching.
• LTM normally consists of categories (concepts) and rules specifying what sort of
action to take or inference to make in a particular sort of situation. Application of
a rule can result in external behavior or a change in the contents of STM.
• Knowledge may get into LTM through hard-wiring by the programmer, through
learning, or through evolution.
• Learning is usually induction: in response to a set of examples, the system creates
new categories or rules (a kind of search). Ease of learning depends on the quality
of the feedback (if any) provided by the environment.

Kinds of processing: what we need to do with


representations
• Categorization: given a representation of an object or situation, assign it to one
of a finite set of labels.

Categorizing an input image as a BLOCK.


Categorizing an input word as a NOUN.

• Parsing: given a pattern, assign some structure to it.

Segmenting an input image into constituent objects.


Segmenting a input sentence into constituent phrases.

• Compression: given a pattern, represent it in a more compact way, taking


advantage of the regularity in it.

Representing an input image in turns of a small number of dimensions.

• Deduction, inference: given some facts, infer one or more other facts that follow
from them.

Given that BLOCK1 is ON BLOCK2, infer that BLOCK2 is UNDER BLOCK1.


Given that X is a BLOCK, infer that it has flat faces.

• Induction: given some examples, create a general rule or category that covers
them all.

Given multiple instances of scenes, create the rule relating ON and UNDER.
Given multiple instances of blocks, create the category BLOCK and use it to
categorize new instances.

• Action: given a situation (or some facts), take an appropriate action.

Given a command to put a block in a box, perform the sequence of actions


involved in doing it.

Predicate calculus
What should a good representational format do for us?
• It should be correct. It should represent what we think it represents, permitting the
inferences that we would like the system to make and failing to make inferences
that we would not like (together with an inference mechanism).
• It should be expressive; it should allow us to distinguish all situations that we
need to distinguish (it should be unambiguous). It should allow us to point to
entities that need to be pointed to.
• It should treat situations which we believe are similar as similar.
• It should be flexible, allowing us to represent the same situation in different ways,
reflecting different construals.

Some spatial examples


• Some "facts" about objects
o BLOCKS
 They have square corners.
 They have straight edges.
 They have six faces.
 They don't roll.
 They are a kind of prism.
 The thing I'm looking at now is one of them.
o BALLS
 They're round.
 They have no edges or faces.
 They roll.
• Some facts about relations
o IN
 The contained object is surrounded by the container at some point.
The contained object is smaller than the container in some
dimension.
 The container has some empty space within it.
 In order to be freely moved, the contained object needs to be taken
out of the container.
 It's a kind of spatial relation.
 The scene I'm looking at now has one in it. It relates a cup (the
container) and a stick (the contained thing).
o BEHIND (viewer-centered)
 At least part of the object that is behind is obscured by the object
that is in front.
 The object that is behind is further from the viewer than the object
that is in front.
 To touch the object that is behind, the viewer has to reach over or
around the object that is in front.
o ON
 The bottom of the supported object is in contact with the top of the
supporter.
• Some facts about functions
o TOP-OF
 The top of an object is a surface or a corner.
 We can find the top of an object by looking for the part of it that is
furthest from the earth (assuming we're on the earth).
 When an object is turned over, whatever was the top of it is now
the bottom of it.
o INSIDE-OF
 The inside of an object is a region in space.
 If an object is solid, the inside of it is part of it.
o SUPPORTER
 Given two objects, one on the other, the supporter of the two (or of
the situation), is the one that if taken away, would cause the other
to fall.

Relations, objects, predicates, functions


• Unlike objects, relations take arguments.
• A relation predicated of one or more objects is either true or not.
• In predicate calculus
o Objects and relations are represented by explicit constant symbols:
block23, in
o Predicates are represented by expressions consisting of relation constants
followed by object constant in a fixed order:
(in block23 cup4)
o An alternate way of representing predicates: the arguments are paired with
role constants, and their order is unspecified:
(in (container cup4) (contained block23))
o Note that nouns on the one hand and verbs, pre/postpositions, adjectives
on the other hand do not map neatly onto object and relation constants.
o Functions are represented by function constants, and function
expressions, like predicates, by a function constant followed by one or
more arguments:
(top-of block8)
o Function expressions return objects so can replace object symbols in
predicates:
(above (bottom-of block8) (top-of block4))

Categories and categorization


• Categorization is about going from lower-level to higher-level representations, for
example, from a specific object instance to an object category such as BLOCK.
• We need to be able to distinguish instances from categories and to represent their
relationship.
• We need to be able to distinguish categories at different levels of abtractness
(different taxonomic levels) and represent their relationship.
• Predicate calculus
o Represents object instances and object categories with expicit symbols and
their relationship as a predicate: (block obj23).
o Represents the relationship between categories at different levels using
variables, universal quantification, and implication:
(forall (?x) (if (block ?x) (prism ?x)))
o An alternative way of representing the relationship between instances and
categories and between categories at different taxonomic levels: AKO (a
kind of) and ISA relations:
(isa obj23 block)
(ako block prism)
• Representing relation instances
o Not explicit in ordinary predicate calculus; no way to directly express that
a particular event belongs to the event category SING.
o Alternative: relation instance (and relation category) symbols.
(sing event23)
(= (singer-of event23) terry)

Miscellaneous considerations
• Connectives: conjunction (and), disjunction (or), implication (if), equivalence
(equiv), negation (not): truth of each defined in terms of the truth of its
operand(s)
• Existential quantification

(exists (?x) (and (b551-student ?x) (not (know ?x scheme))))

(not (exists (?x) (and (b551-student ?x) (not (know ?x arithmetic)))))

• Sentences, well-formed formulas


• Equivalences of particular expressions (examples)

(equiv (if p q)
(if (not q) (not p)))

(equiv (if p q)
(or (not p) q))

(equiv (or p (and q r))


(and (or p q) (or p r)))

(equiv (not (exists (?x) (p ?x)))


(forall (?x) (not (p ?x))))

(equiv (forall (?x) (and (p ?x) (q ?x)))


(and (forall (?x) (p ?x)) (forall (?x) (q ?x))))
• Primitives
o If we opt for a symbolic representation such as predicate calculus, we
sneed an alphabet of basic symbols.
o How would we decide on a set of primitives? This is usually based on
practical, rather than theoretical, considerations.
o We must make decisions about representational granularity.
• Limitations of first-order predicate calculus
o Representing belief and knowledge

(believes al (loves mary al))

(believes al (exists-life mars))

Representing predicate calculus in Scheme


• One possibility: relation constants are procedures, predicates are procedure calls,
returning either #t or #f
• A better option: all predicate calculus expressions take the form of lists, and
procedures are ways of manipulating and searching through them, such as
infer.and true?.
• The second option requires that we build in the definitions of conjunction,
negation, etc.

What will we do with our representations?


• assert takes a sentence and adds it to a database, which is a conjunction of facts
taken as true.

(assert '(block b1) database) → modified database

• infer takes a database and returns a list of sentences that can be inferred from the
database

(infer '((block b1)


(forall (?b) (if (block ?b) (nfaces ?b six)))))
→ ((nfaces b1 six))

• true? takes a database and a sentence and tells whether the sentence is true, given
the database. dont-know is a possibility.

(true? '((block b1)


(forall (?b) (if (block ?b) (nfaces ?b six))))
'(nfaces b1 six))
→ yes
• fill-in takes a database and a sentence containing one or more variables and
returns bindings for the variables.

(fill-in '((block b1)


(forall (?b) (if (block ?b) (nfaces ?b six))))
'(nfaces b1 ?nf))
→ ((?nf six))

• categorize takes a conjunction of facts about an individual and returns a


category for the individual.

(categorize '((nfaces b1 six)


(height b1 5cm)
(square-corners b1)
(forall (?b) (if (and (nfaces ?b six)
(square-corners ?b)
(less-than (height ?b 20cm))
(greater-than (height ?b 2cm)))
(block ?b))))
'b1)
→ (block b1)

• learntakes a conjunction of facts about instances of a category and a category


symbol and returns a generalization about the category.

(learn '((block b1) (nfaces b1 six) (square-corners b1) (color b1 red)


(block b2) (square-corners b2) (color b2 black)...)
'block)
→ (forall (?b) (if (and (nfaces ?b six) (square-corners ?b))
(block ?b)))

Predicate calculus: practice


Using predicate calculus representation (Scheme format), show how to represent the
knowledge embodied in the following English sentences. As far as possible, show the
relationships among the different elements.

When a moving ball strikes a stationary ball, the moving ball is deflected in a direction
which is roughly opposite to its original direction, and the originally stationary ball
starts moving roughly in the direction of the originally moving ball. Ball 3 struck Ball 2,
which was stationary. Ball 3 was moving south when this happened.

(Ignore details like velocity, the effect of the mass of the balls, and the angle at which the
moving ball strikes the stationary ball (unless you want to be really ambitious :-). This is
very naive physics.)
Search 1
Problem solving as search
• A solution as a state in the space of possible solutions
• Problem solving as search through this state space

Some basic questions concerning problem solving as


search
• How can solving the problem be treated as the execution of a sequence of steps?
• Do the steps result in more and more complete solutions, or are candidate
complete solutions available at every step?
• Is it easier (or cheaper) to start at the goal and work backwards?
• Is the consequence of a step predictable? Is there an adversary (another agent who
can constrain the possible steps)?
• Is the solution the way a goal is reached or the nature of the goal itself (what's
found at the end)?
• Is it important to know more than one way to reach the goal?
• Is it important that the goal is reached in the most efficient way, the way requiring
the least cost to the agent?
• Is it important that the solution is found quickly?
• Is there a way to estimate the "distance" to the goal?
• How easy is it to reconsider and try a completely different set of steps?
• Is the step size adjustable?
• Is it possible to consider a number of different options in parallel?
• Is the number of potential ways to the goal finite?

Formalizing search
• Problem state: a particular configuration of the objects that are relevant for a
problem
Figuring out how to represent the problem states is no simple matter.
• State space: the space of all possible problem states
Normally only a partial state space is considered.
• Initial state: where the search starts
• Goal states: where the search should end
There may be any number of these, and we may be interested in finding only one
or all of them.
• State space search: search through the space of problem states for a goal state
• Search trees: nodes are states (or paths), links are one-step transitions from one
state to a successor state
• Expanding a node (extending a state/path)
• Testing for goal states (nodes)
• The queue of untried states (paths); adding new states (paths) to the queue (stack)
• Branching factor of the search: average number of successor states each state has
• Depth of the search: how far down the tree is extended during the search

Basic schema for search


(Assuming that the path makes a difference, we maintain a queue of paths rather than just
states.)

The algorithm makes use of two procedures specific to the problem

1. A predicate goal?, which takes a state and returns #t if the state is a goal state
2. A procedure expand, which takes a state and returns all of the successor states to
the state

• Form a one-element queue consisting of a zero-length path that contains only the
root node.
• Until the first path in the queue terminates at a goal node (satisfies goal?) or the
queue is empty,
o Remove the first path from the queue; create new paths by expanding the
first path to all the neighbors of the terminal node.
o Reject all new paths with loops.
o ...
o Add the new paths, if any, to the queue.
o ...
• If the goal node is found, announce success and return the path to the goal state
found; otherwise, announce failure.

Depth-first, breadth-first, nondeterministic search


• Depth-first search
Add the new paths to the front of the queue. (That is, use a stack.)
o Backtracks when a dead-end or "futility" limit is reached
o Appropriate when there is a high branching factor
o May be advantageous when many solutions exist but only one needs to be
found
o May fail to find a solution
• Breadth-first search
Add the new paths to the back of the queue.
o Appropriate when there are long useless branches but not when there is a
high branching factor (time and space complexity is exponential)
o Guaranteed to find a solution (if there is one) and to find the one with the
least steps (though not necessarily the least cost) first
• Nondeterministic search
Add the new paths at random places in the queue.
o When unable to choose between depth-first and breadth-first

To implement the type of search, we can add a merge-queue argument to our search
procedure. There is a different one of these for each of the ways we will add new paths to
the queue.

Tower of Hanoi
• To do best-first search using your basic search procedure, you only need to define
a new merge-queue procedure. best-first-merge takes the queue, the new
paths, and the problem-specific estimate procedure, adds the new paths to the
queue, and then sorts the new queue by the values returned by estimate when
applied to the first state on each path. You can use the Scheme procedure sort to
do this. sort takes a comparison predicate and a list and sorts the list using the
predicate to compare items.
• Figure out how you will represent states for your particular problem. Here's one
way for Tower of Hanoi.
o For the Tower of Hanoi Puzzle, states are lists consisting of a sublist for
the disks on each peg. Numbers represent the diameters of the disks, and
they are arranged from top to bottom. Thus this is the initial state for the 3-
disk puzzle:

((1 2 3) () ())

• Write the goal?, expand, estimate, and print-state procedures for your
particular problem.
o For the Tower of Hanoi Puzzle, these are one possibility.
• Callthe search procedure on the problem-specific states and procedures.
o Best-first search on the Tower of Hanoi Puzzle: a trace indicating states
which are searched and the queue at each point in the search
Heuristic search: using estimated distance remaining
• Another procedure specific to the problem: estimate, which takes a state and
estimates the distance from it to a goal state
• Hill climbing
Sort the new paths by the estimated distance left to the goal (using estimate) and
add them to the front of the queue.
o Parameter-oriented hill climbing: each problem state is a setting for a set
of parameters
o Problems for hill-climbing: foothills (local maxima), plateaus, ridges
o Nondeterministic search as a way of escaping from local maxima
o Gradient ascent
For a parameter x and "goodness" g which is a smooth function of x, the
change in x should be proportional to the speed with which g changes as a
function of x, that is, ∂g/∂x.
• Best-first search
After adding new paths to the queue, sort all the paths in the queue by the
estimated distance left to the goal (using estimate).
Unlike hill climbing, can jump around in the search space.

Heuristic search: optimal search


• British Museum procedure: blindly find all paths, selecting best
• Branch-and-bound search
After adding new paths to the queue, sort all the paths in the queue by the current
path length.
Note: When a goal state is found, it is still necessary to extend partial paths which
are shorter than the complete one because they may end up shorter overall.
• Using underestimates of distance remaining
After adding new paths to the queue, sort all the paths in the queue by an
underestimate of total path length (using estimate).
Underestimates allow you to stop when partial path estimates are longer than the
shortest complete path.
• Eliminating redundant paths
After adding new paths to the queue, if there are two or more paths reaching a
common node, keep only the one with the shortest path length.
• A*: branch-and-bound with underestimates and redundant paths eliminated

Genetic search
• Parameter adjustment, function optimization problems
o Calculus methods
o Blind search and random methods
o Parallel search
• Evolutionary computation
o What's needed for evolutionary computation (abstract or real) to work
 "Creatures" which
1. Give birth to other creatures, passing on their traits to them
2. Die
 A way for traits to be passed on: an inherited genotype, in
interaction with the environment, results in a phenotype
 A way of evaluating the creatures' traits: some aid in survival or
reproduction, others don't ("survival of the fittest")
 A way of generating new traits: mutation
o How it works
 Each creature is born with some combination of traits; it may not
be possible to simply figure out what combination works best for
the environment of the creatures.
 Creatures live their lives. Some survive long enough to reproduce
and have offspring.
 If a particular trait or combination of traits helps creatures
reproduce or live longer, creatures with that trait (those traits) will
tend to have more offspring.
 Creatures pass on (at least some of) their traits to their offspring. In
sexual reproduction, they pass on a combination of the parents'
traits.
 The percentage of creatures with the good traits should increase on
each generation.
 There is a small probability that a new creature will end up with
some random traits which it did not inherit from its parent(s). In
this way new traits or combinations of traits can be tried out in the
world.
o Genetic algorithms
 What makes them special

Work from a coding of the parameter set, not the


parameters themselves
Search from a population of points, not a single point
Use fitness information only, no auxiliary knowledge
Use probabilistic transition rules
 Used for
Parameter-oriented search for problems in which partial
solutions are not evaluated (paths are not sought)
Modeling biological evolution
Designing a suitable initial architecture for cognition (say, a
neural network)
 The basic GA
Operators
 Selection: Select individuals in the population for
mating on the basis of the individuals' fitness. A
common choice is fitness-proportionate selection:
the probability of selecting an individual is its
relative fitness (implemented through "roulette-
wheel sampling")
 Crossover: Combine the genomes of two parents to
produce the genome of the choice. The most
common choice: select a position and exchange the
substrings before and after that locus between two
genomes to create two offspring.
 Mutation: Make random changes in the genome of
an individual. The most common choice: with a
small probability, flip each bit in the genome.
Parameters
 n individuals in the population
 l loci in each genome
 Probability of crossover: pc (often something like
.7)
 Probability of mutation: pm (often something like
.001)
Environment
 Fitness function f which evaluates each individual
assigning a quantity to it
 Or (less commonly) a "world" which permits some
individuals to produce more offspring than others
Algorithm
 For each run
 Start with a population of n randomly
generated genomes
 For each generation
 (Realize the genomes as phenomes
(individuals).)
 Evaluate each of the individuals with
the fitness function.
 Until n offspring have been created,
 Using the selection operator,
choose a pair of parents from
the population.
 Produce two offspring. Use
the crossover operator with
probability pc. Otherwise
produce copies of the parents.
 For each locus in each new
offspring, apply the mutation
operator with probability pm.
 Place the resulting offspring
in the new population.
 Replace the old population with the
new.
 An example: maximizing a function (x2)

Generation Genomes Fitness Individuals selected for mating


1 0 1 0 1 0 100 (.06) 1 1 1|1 0
1 1 1 1 0 900 (.57) 1 1 0|0 0
1 1 0 0 0 576 (.36) 1|1 1 1 0
0 0 1 0 0 16 (.01) 0|1 0 1 0
2 1 1 1 0 0 784 (.34) 1 1 1 0|0
1 1 0 1 0 676 (.29) 1 1 0 1|0
1 1 0 1 0 676 (.29) 1 1 1|0 0
0 1 1 1 0 196 (.08) 1 1 0|1 0
Forward and backward chaining
The basic elements
Given a set of assertions (facts), a set of rules, and a goal, prove the goal.

• Assertions
Each a predicate calculus sentence with no variables and no connectives other
than not.
• Goal
Either a sentence with no variables and no connectives, in which case the goal is
to prove that it is true (inferrable from the facts and the rules), or an existentially
qualified sentence whose variables are to be assigned values (if possible) given
the facts and the rules.
• Rules
Each a universally qualified implication with a conjunction of sentences as
antecedent and a single sentence as consequent.

An example
• Assertions
o A ball is on a block
(on ball1 block1)
o A pyramid is above the ball.
(above pyramid1 ball1)
• Goals
o Is the pyramid above the block?
(above pyramid1 block1)
o What's above the block?
(above ?x block1)
• Rules

1.If something is on something, it's also above it.


(((on ?x ?y)) (above ?x ?y))
2.If a is above b, b is below a.
(((above ?a ?b)) (below ?b ?a))
3.If b is above c and a is above b, then a is above c.
(((above ?b ?c) (above ?a ?b)) (above ?a ?c))

(In Homework 2 you will be doing a restricted version of forward chaining. Because each
rule has only one conjunct in its antecedent, you can iterate through the assertions rather
than the rules to attempt to find new assertions (extend the current state).)

Forward chaining
Attempt to match the antecedents of rules with the assertions, adding new assertions
based on the consequents if this is possible, until an assertion matching the goal is added.

To prove: (above pyramid1 block1)

• The antecedent of the first rule matches (on ball1 block1), so we can asssert
(above ball1 block1).
• The antecedent of the third rule matches (above ball1 block1) and (above
pyramid1 ball1), so we can assert (above pyramid1 block1). This matches
the goal.

Backward chaining
Attempt to match the consequents of rules with a goal, replacing the goal with new goals
based on the antecedent of the rule if this is possible, until all of these goals match
assertions.

To prove: (below block1 ?x)

• The consequent of the second rule matches the goal, so we can replace the goal
with (above ?x block1)
• The consequent of the first rule matches the goal, so we can replace the goal with
(on ?x block1). This goal matches the assertion (on ball1 block1) with the
variable binding ?x = ball1

More on representation
Frames
Much of human knowledge seems to be organized in chunks representing types of events:
frames or schemas.

Objects

Consider what we know about CABBAGE: what it looks like, how it tastes, how it's
prepared, how nutritious it is, how much it costs, what other vegetables and plants it's
related to. Within the CABBAGE frame, there is knowledge about what CABBAGE has.
Since CABBAGE is (probably) a basic-level category, there is a lot of knowledge in its
frame.

But some knowledge about CABBAGE is shared with other vegetables, and some
knowledge about vegetables is shared with other food items. Also there are subtypes of
CABBAGE such as RED-CABBAGE. Knowledge seems to be organized in an
inheritance (is-a) hierarchy, one sort of ontology. Categories also have default
properties that can be overridden by subcategories or instances.

Events

Consider what we know about instances of GOING and GIVING.

When something is given, there is a GIVER, a RECEIVER, and a GIVEN-OBJECT.


Before the giving, the GIVER controls the OBJECT, and RECEIVER doesn't. After the
giving, the RECEIVER controls the OBJECT, and the GIVER doesn't. The giving is
consciously initiated by the GIVER, who wants the RECEIVER to control the OBJECT.

(forall (?g ?r ?o ?t0)


(if (give ?g ?r ?o ?t0)
(and (exists (?t1)
(and (before ?t1 ?t0)
(control ?g ?o ?t1)
(not (control ?r ?o ?t1))
(exists (?t2)
(and (before ?t1 ?t2)
(goal ?g (control ?r ?o ?t2) ?t1)))))
(exists (?t3)
(and (before ?t0 ?t3)
(control ?r ?o ?t3)
(not (control ?g ?o ?t3)))))))

Though we don't have an English verb for it, there is a more abstract event category that
includes GIVE, STEAL, TAKE, and RECEIVE. And there are subtypes of GIVE such as
DONATE and LEND.

Using frames

• How is knowledge within a frame instantiated when an instance of the category is


created?
• How can we efficiently access inherited knowledge for an instance of a category?
• How can we answer questions about properties using an inheritance hierarchy?

Frame representation
(PHYS-OBJ
(is-a THING)
(color)
(weight)
(shape)
(edibility)
...)
(FOOD-ITEM
(is-a PHYS-OBJ)
(nutritional-value)
(fat-content)
(starch-content)
(vitamin-content)
(source)
(taste)
(preparation
(processing)
(cooking)
(accompanying-ingredients)
(serving))
(availability-form)
(edibility YES)
(english-lex
(neutral "FOOD")
(informal "GRUB"))
(japanese-lex "TABEMONO")
...)
(VEGETABLE
(is-a FOOD-ITEM)
(plant-part)
(plant-type)
(source PLANT)
(nutritional-value HIGH)
(fat-content LOW)
(vitamin-content HIGH)
(english-lex
(neutral "VEGETABLE")
(informal "VEGGIE"))
...)
(LEAF-VEGETABLE
(is-a VEGETABLE)
(plant-part LEAF)
(color GREEN)
(taste BITTER)
...)
(COLE-VEGETABLE
(is-a VEGETABLE)
...)
(CABBAGE
(is-a LEAF-VEGETABLE)
(is-a COLE-VEGETABLE)
(plant-type CABBAGE-PLANT)
(taste CABBAGE-TASTE)
(availability-form VEGETABLE-HEAD)
(shape SPHERICAL)
...)
(RED-CABBAGE
(is-a CABBAGE)
(color PURPLE)
(english-lex "RED CABBAGE")
...)
(RED-CABBAGE23
(is-a RED-CABBAGE)
(preparation
(accompanying-ingredients
{MAYONNAISE, MUSTARD, TARRAGON})
(cooking NIL)))
(ABSTRACT-TRANSFER
(source ?s)
(destination ?d)
(object ?o)
(precondition1
(is-a CONTROL)
(controller ?s)
(object ?o))
(effect1
(is-a CONTROL)
(controller ?d)
(object ?o)))
(GIVE
(is-a ABSTRACT-TRANSFER)
(source ?s)
(destination ?d)
(object ?o)
(effect1 ?e)
(precondition2
(is-a WANT)
(wanter ?s)
(wanted ?e)))
(GIVE23
(is-a GIVE)
(source AL)
(destination SUE)
(object
(is-a DVD)))

A more elaborate style of frame representation from the CHEF program (Hammond,
1989)

(defmop i-m-beef-and-green-beans (m-recipe)


(meat i-m-beef)
(vege i-m-green-beans)
(style i-m-stir-fried)
(steps m-recipe-steps
(bone-steps i-m-empty-group)
(chop-steps m-step-group
(1 m-chop-step (object i-m-beef)))
(let-stand-steps m-step-group
(1 m-let-stand-step
(object m-ingred-group
(1 i-m-beef)
(2 i-m-spices))))
(stir-fry-steps m-step-group
(1 m-stir-fry-step
(object m-ingred-group
(1 i-m-beef)
(2 i-m-spices)))
(2 m-stir-fry-step
(object m-ingred-group
(1 i-m-beef)
(2 i-m-green-beans)
(3 i-m-spices))))
(serve-steps m-step-group
(1 m-serve-step
(object m-ingred-group
(1 i-m-beef)
(2 i-m-green-beans)
(3 i-m-spices))))))
(define instantiate
(lambda (frame bindings)
...))
(define inherit
(lambda (instance role)
...))

Semantic networks
Knowledge in the form of a graph. A single node corresponds to a frame symbol. Two
"styles":

1. labeled links represent many relations and roles


2. all relations and roles are represented by nodes; there are only a small number of
very general link types

Examples in a particular formalism of type 2 (NETL: Falhman, 1979)

Types, roles and value, is-a

Individuals, sets

Insects have six legs. Each leg is jointed. Ladybugs are insects. Sam is a ladybug. Sam's
right rear leg is broken.
Events
An example of a type 1 formalism
Inheritance in a semantic network

Distributed connectionist representation


Elements

A fixed network of nodes (units) that can be active or not (or active to different degrees).
The unlabeled, weighted, modifiable links (connections) between the units represent the
tendency for pairs of units to be co-activated. Any representation is a pattern of
activation across the network.

A representation is distributed if each element is involved in the representation of


multiple concepts and each concept is represented by multiple elements. Units may
represent primitive semantic features or (for example, in holographic representations)
may not be directly interpretable.

Instantiation, inheritance

Instantiation does not involve the creation of any new hardware (because the size of the
network is fixed) but rather in the activation of certain and units and the strengthening or
weakening of some weights. There is no distinction in the network between individuals
and types.

Different types are not represented by separate units in the network.


Abstraction/generality corresponds to uncertainty about the value of units.

Inheritance is in a sense automatic. When we activate a type, for example, on the basis of
an input word such as cabbage, the pattern represents features of the instance as well as
the type.

Representing commonsense knowledge


Primitives

• Acts (Schank, etc.)


• Semantic roles

Scripts, plans, goals

Knowledge of stereotypical situations helps in understanding language (Schank, etc.).

Mary wanted that camera really bad, so she went and bought a gun.

Phil went into a restaurant and sat down at a table. A waiter came over after a few
minutes, but he said he wasn't ready to order.

Big Projects (CYC, etc.)

The Grounding Problem

• Will the knowledge be in a usable form if it's not tied to perception and action?
• Is it possible to build in commonsense knowledge, or will it have to be learned?

Machine learning: overview


Introduction
• Internal changes in biological systems recorded in memory of one kind or
another. Change may be more or less permanent, resulting in different (usually
improved) behavior following the change.
Kinds of change and kinds of memory:
o Evolutionary change; genetic memory, genotype
o Development; phenotype
o Learning; long-term memory
o Cultural change; cultural memory
o Processing (temporary change); short-term (working) memory
• Why learn (and develop), rather than evolve? (Miller & Todd)
o Learning (development) allows an organism to build a more complex
phenotype than it could otherwise, given a genotype of a certain size.
Environmental regularities can do much of the work of wiring up adaptive
behavior-generators.
o Learning allows an organism to make use of the past as well as the here-
and-now. This sort of learning consists in the creation of episodic
memories and their retrieval later on.
o Learning allows an organism to adjust its behavior faster than natural
selection would allow. This is advantageous because there may be changes
in
 the organism's body
 the organism's family
 the organism's environment

This function dominates thinking about learning but may be less important
than the other two for most animals.

Some basic concepts


• Availability of feedback
o Available
 Supervised learning: there is access to the correct output
 Reinforcement learning: there is access to the goodness of the
actual output
 The credit assignment problem in reinforcement (and sometimes
supervised) learning: when the behavior of the system is wrong,
what aspect of the system's internals led to the error?
o Unavailable: unsupervised learning, no information about the correctness
of outputs
Example: A speech system is to be trained to recognize English words.
During an initial phase, the system is simply exposed to samples of
English speech without being told what the content of the speech is. The
hope is that the system will pick up on some of the systematic phonetic
properties that characterize English.
• Prior knowledge
o Learning from scratch
o Building on prior knowledge
o Martin's law: You cannot learn anything unless you almost know it
already.
• What is learned
o Stimulus-response behavior
o Concepts
o Regularities in the environment: cooccurrences, clustering, prediction
o Utility information concerning possible states of the world
o Results of possible actions
o Ways of organizing knowledge internally to maximize performance

Induction
(supervised or reinforcement)

• Learning the representation of a function f


• Given a collection of examples of f, find a function h (the hypothesis) which
approximates f
• Positive and negative examples of the function
• Generalization of the current hypothesis through positive examples
• Specialization of the current hypothesis through negative examples; value of near
misses
Example of a bad negative example: A robot is being trained to recognize a soda
can. It is shown a can from two different angles. Then in order that it doesn't
produce too general a concept, it is shown a chair and told that that is not an
example of a soda can. The robot does not seem to improve.
• In general induction is not sound: a hypothesis is not usually a logical conclusion
of the data; numerous hypotheses may be consistent with the data
• Incremental learning: hypothesis is updated whenever an example arrives
Example: A robot is being trained using reinforcement learning to find all of the
empty soda cans in the lab and throw them in the recycling bin. This proves too
difficult, so the robot is first trained only to recognize soda cans. Then it is trained
to approach soda cans. ...

Introduction to neural networks


Overview
• Elements

o Units: simple processing elements that respond to the behavior of other


units via input connections, produce an activation, and send it along output
connections to other units. The activations of all of the units in the
network represent the system's short-term memory.
o Connections: Weighted (unlabeled) links between units, multiplying the
activation from the source unit on its way to the destination unit. The
weights along all of the connections in the network represent the system's
long-term memory.
• Formalization
o State
 Vector of activations x(t)
 Matrix of weights W
o Task
 Set of input vectors I(t), possibly infinite
 (Sometimes) an associated set of target vectors T(t)
o Dynamics
 Discrete (difference equations) or continuous (differential
equations)
 Activation
x(t+1) = g(h(x(t), W(t), I(t)))
g the activation function, h the input function
 Weight
W(t+1) = f(x(t), W(t), I(t), T(t)))
f the learning rule
• Some differences between models
o What sort of connectivity (reflected in where there are gaps in the weight
matrix)?
o Are there targets (a supervised network)?
o Is this is a feedforward network, a partially recurrent network, or a
completely recurrent (settling, attractor, constraint satisfaction) network?
o Is the activation function threshold or continuous?
o Does the network handle sequences of inputs or just static patterns?
o Are there separate input and output units?
o Does the network have hidden units?
• Running a network
o Update units
 For attractor (feedback) networks, update units (usually randomly
selected) until the network has settled (no further changes in
activations occur)
 For feedforward (and simple recurrent networks), update each unit
once in a (mostly) fixed sequence
 Update a unit
 Calculate the input (h) to the unit, the weighted sum of the
activations of units feeding into the unit
 Calculate the activation (x) of the unit, a function (g) of the
input
 Example: activation of a unit with a threshold activation function:
if hi > θi, g(hi) = 1
else g(hi) = 0
o Update connections
 Following the presentation of a single training pattern or the whole
set of training patterns
 Weight changes are usually small changes in a given direction,
determined by a learning rate (η)
 The direction and magnitude of the change is usually proportional
to the activation of the source unit and either the activation of the
destination unit or some error measure.

Supervised learning in neural networks


• Patterns
o Training set: pairs of inputs and targets for training the network
o Test set: pairs of inputs and targets for testing the network for
generalization
• Training
o Training phase: weights are adjusted in response to training set
o Test phase: weights are not adjusted as test set is presented

Feedforward networks
• Appropriate for problems with no interacting bottom-up and top-down effects;
pattern association
• Usually trained with error-driven learning, a form of supervised learning in
which the change in weights depends on the error, ultimately the difference
between the target and the actual output for each output unit
• Networks with no hidden units, for example, perceptrons
• Networks with hidden layers, usually trained with backpropagation

Perceptrons
Pattern association problems
• Given a representation of one kind of entity (the input), generate a representation
of another (the output).
• Both representations often take the form of patterns, that is, vectors of numbers.
In this case, the inputs and outputs are represented in a distributed way.
• Training (supervised): expose the learner to a number of input patterns and their
correct associated output patterns.
• Generalization: given an unfamiliar input pattern, respond with an appropriate
output pattern.
• Examples
o Perceptual input, category output (pattern classification)
o State (perceptual) input, action Q-value output
o Word input, meaning output
o Meaning input, word output
o English input, Spanish output
• Implementation in feedforward neural networks

Feedforward networks and pattern association


• In a feedforward neural network, the units can be thought of as arranged in
separate groups, or layers. The connections joining units in two layers all have the
same direction.
• Some of the units are designated input units. These are clamped to particular
activations when the network is presented an input pattern. The activation of a
clamped unit does not change.
• Some of the units are designated output units; their activations represent the
network's response to the current input, the pattern that the network "believes"
should be associated with the input pattern.
• Each output unit repeatedly updates its activation while the network is "running".
In a feedforward network, each output unit updates its activation once in response
to each input pattern.
• Also the possibility of one or more layers of hidden units between the input and
output layers

Perceptrons: how they work


• The simplest architecture for supervised pattern classification
• Architecture
o Input units + bias (threshold) unit
o Binary output unit; each output unit a separate perceptron
• Input and activation rules

N
y = δ( ∑ wj xj + b)
j=1

1 if x > 0
δ(x) = {
0 otherwise

• That is, the activation function is a simple linear threshold function.


• Learning
o For each input pattern p, there are three cases:
 The pattern is classified correctly. In this case, no changes are
made to the weights.
 The target is 1, but the network yielded 0. In this case, we need to
change each weight so that the output will be higher. We can
achieve this by adding the input vector (or a fraction of the input
vector, the learning rate η) to the weight vector (the superscript
represents the particular pattern).

Δw = ηxp

 The target is 0, but the network yielded 1. In this case, we need to


change each weight so that the output will be lower. We can
achieve this by subtracting the input vector (or a fraction of the
input vector) to the weight vector.

Δw = -ηxp

o The three cases can all be expressed with this general rule:

Δw = ηxp(tp - yp)

• The Perceptron Convergence Theorem (Rosenblatt)


o If there is a set of weights that solves the problem, then there is weight
vector w* that never yields a sum lying in a region around 0 of width 2 *
ε. That is, it for inputs that are supposed to yield a positive output, w*
yields an output greater than ε, and for inputs that are supposed to give a
negative output, w* yields an output less than -ε.
o The theorem proves that the angle between the current weight vector and
w* is bounded for each training pattern by an envelope that decreases with
each presentation of the pattern.
o The theorem does not guarantee that the angle between the current and
final weight vectors will decrease monotonically, only that the envelope
within which this angle is found decreases with the number of updates.
That is, the error may sometimes rise for a given run through the training
patterns.
o The theorem only guarantees convergence if there is a set of weights that
solves the problem.

Perceptrons: what they can and can't do


• One input
o The input patterns fall along a line; all points on one side of a given value
are in, the other outside the category. The trainable bias establishes the
threshold, the sign of the single weight the direction of category
membership on either side of the threshold.

o If the points in the category are broken up by points not in the category,
there is no way for the network to solve the problem.

• Two inputs
o The two weights and the bias define a line
w1x1 + w2x2 + b = 0

with slope -w1/w2 and y-intercept -b/w2. Points on one side of the line turn
on the output unit; points on the other side turn it off. Because the
perceptron defines an inequality (two possibilities for each line), three
values are required rather than the two required to define a line.

o The line defined by the weights and bias divides the input space into two
regions. A perceptron in 2-space can only learn to separate two sets of
points that are on either side of a line.
• In general, a perceptron can only solve a pattern classification problem if the sets
of patterns are linearly separable; that is, if in N-space, there is a hyperplane of
N-1 dimensions which separates the two sets of points. N values (N-1 weights and
the trainable bias) are needed to specify the desired behavior because in addition
to the hyperplane, we need to say on which side of the hyperplane points are in
the category.
• Problems which perceptrons can't solve
o Examples of non-linearly separable sets of patterns
 Exclusive OR: (0, 0), (1, 1); (1, 0), (0, 1)
 Connectivity

o Solving the problems with additional input dimensions, for example, for
exclusive OR an input unit that codes for whether the other two inputs are
the same
o Solving the problems with hidden units
 Networks with hidden units with linear activation functions are
equivalent to networks without hidden units
 Hidden units must have non-linear activation functions, for
example, a simple threshold function (like perceptron output units)
or sigmoidal or gaussian functions.
 One possibility: provide "enough" hidden units connected by
random, hard-wired weights to the input units.
 Another possibility: train the input-to-hidden weights. But how?

The delta rule and backpropagation


Activation functions
• The simplest activation function is the identity function: the activation is just the
input.
• Another possibility is a threshold function which converts inputs above some
threshold to a maximum value (usually 1.0) and inputs below the threshold to a
minimum values (usually -1.0 or 0.0).
• A third possibility is a "soft" threshold function which smooths out the region
near the threshold. The following function, the sigmoid, has a minimum of 0.0
and a maximum of 1.0 (note that these are never reached):

g(h) = 1 / (1 + e-h)

(h is the input to the unit.)

The delta rule


• Supervised learning (pattern association) in feedforward networks with multiple
output units and continuous activation functions
• The delta rule (least mean squares rule) for supervised learning (when the
activation function of the output unit is the identity function)

Δwji = η (tj - xj) xi

(The xs represent activations, t represents a target, and η is a learning rate between


0.0 and 1.0.)

• A formal way to derive the delta rule for the more general case
o Gradient descent learning: learning by moving in the direction which
looks locally to be the best
o For supervised neural network learning, the best direction to move in
"weight space": for each weight, how the global error changes with respect
to that weight
o A global error function: for each pattern the sum of the errors over all of
the output units

E = ∑j ½ (tj - xj)2

o We want to move in "weight space" in a direction which is opposite that of


the slope of the error function with respect to each weight because this
will move us toward a region with a lower error. The size of the move
should be proportional to the magnitude of the slope.
1. To find the slope, we take the partial derivative of the error with
respect to the weight. But the only element in the sum of error
terms that depends on the weight is the one for the output unit
where that weight ends (j in what follows).

∂E/∂wji = ∂ [½ (tj - xj)2] / ∂wji

2. Using the chain rule, we can decompose this derivative into two
that are easier to calculate:

(∂[½ (tj - xj)2]/∂xj) (∂xj/∂wji)

3. The first derivative is easy to figure; it's just

-(tj - xj)

4. The second derivative can be decomposed using the chain rule


again if we remember that the activation of unit j is a function of
the input to the unit, hj, which is in turn a function of the weights
into the unit.

∂xj/∂wji = (∂xj/∂hj) (∂hj/∂wji)

5. Since the activation of an output unit is the activation function g


applied to the input h, the first derivative on the right-hand side of
(4) is just

g′(hj)

that is, the derivative of whatever the activation function is at the


value of the current input to unit j.

6. The second derivative on the right-hand side of (4) can be derived


as follows:

∂Ij/∂wji = ∂(∑kxkwjk)/∂wji = xi

because none of the other weights or input activations depend on


wji.

7. Putting all of the parts together, we get

∂E/∂wji = -(tj - xj) g′(hj) xi

8. Remember that we want the weight change to be proportional to


the negative of the derivative with respect to the weight. So with a
learning rate to control the step size for weight changes, we get the
more general delta (least mean squares) learning rule

Δwji = η (tj - xj) g'(hj) xi

Backpropagation
• Problems that are not linearly separable can't be solved by a perceptron (or a
network learning with the delta rule).
• How hidden units can solve non-linearly separable problems.

• Backpropagation: a gradient descent algorithm for learning the weights into


hidden units as well as output units
• A network with hidden units with linear activation functions is equivalent to a
network with no hidden units, at least the hidden units must have non-linear
activation functions, and these must be differentiable for backpropagation to
apply: usually the sigmoid function.
• The learning rule:

Δwji = ηδjxi

o For output units:

δj = (tj - xj) g′(hj)

o For hidden units (k indexes the units in the next highest layer):

δj = [∑k δk wkj] g′(hj)

• A famous example: NETTALK, the text-to-speech problem


• Some questions and concerns
o Does BP get stuck in local minima?
o Does it take forever to learn the weights?
 Faster as number of hidden units increases (assuming parallel
update)
 Faster with higher learning rate, within limits
o How does the network solve the problem? What sort of hidden-layer
representations does it build? Using statistical techniques to analyze
hidden-layer representations.
o Does it generalize? Does the network behave appropriately on patterns
which it has not been trained on?
 More local and more distributed (greater generalization) hidden-
layer patterns
 Effect of too many trainable connections: overfitting, the network
"memorizes" individual patterns rather than generalizing over them
• Optimization: setting the learning rate, other parameters
• Incremental training: learning a simpler task which enables the learning of a
more complex task
• Multiple tasks in a single network
o Catastrophic forgetting: does the network unlearn one set of patterns when
trained on a second?
o Does the network fail to learn two interfering tasks which it is trained on
simultaneously? Example: the what-where vision problem
o Modularity as a solution to problems of interference

Sequential problems and simple recurrent networks


• Sequence processing: inputs consist of sequences of patterns; the network's output
depends on previous patterns as well as the current one
o Prediction: given a partial sequence, predict the next element (pattern)
o Sequence classification
o Parsing
• Sequence processing requires some form of short-term memory (in addition to
unit activations
• Simple recurrent network (Elman net): recurrent connections on the hidden
layer with a time delay of one sequence event, usually implement with a context
layer that maintains a copy of the hidden-unit activations on the last time step
• Training an SRN on prediction
o The input and output layers represent a single sequence event.
o During training, sequences of inputs are presented repeatedly.
o On a single training trial, an event is presented to the input layer, and the
network is run in the usual fashion, with the context layer treated as
another input layer.
o The target is the next event in the sequence. Error is back-propagated, and
weights are updated using the backpropagation rule, with context-to-
hidden weights treated exactly as input-to-hidden weights.
o Finally the activations on the hidden layer are copied to the context layer.

Unsupervised learning
Auto-association and content-addressable memories
• Auto-association: one form of unsupervised learning
o Patterns are associated with themselves
o Purposes: dimensionality reduction, data compression, pattern completion
o Implementation: Hopfield nets (pattern completion only), other constraint
satisfaction (settling) networks with hidden layers, feedforward nets (can
be trained with backpropagation)
o Content-addressable memories
Desired behavior:
 When part of a familiar pattern enters the memory system, the
system fills in the missing parts (recall).
 When a familiar pattern enters the memory system, the response is
a stronger version of the input (recognition).
 When an unfamiliar pattern enters the memory system, it is
dampened (unfamiliarity).
 When a pattern similar to a stored pattern enters the memory
system, the response is a version of the input distorted toward the
stored pattern (assimilation).
 When a number of similar patterns have been stored, the system
will respond to the central tendency of the stored patterns, even if
the central tendency itself never appeared (prototype effects).
o Discrete Hopfield networks
 Basic properties
 CAM
 Potentially completely recurrent
 Symmetric weights
 Activation rule (θ a threshold, sgn(): 1 if its argument is
positive, -1 otherwise):

xi(t + 1) = sgn(∑j wij xj(t) - θi)

 Settling: asynchronous, random update


 Training: single presentation of each pattern
 Each training pattern should yield an (fixed-point) attractor.
 Stability
 Lyapunov stability: if there is a function of the network
state which decreases or stays the same as the network is
updated, then the network is asymptotically stable.
 Energy of network (a Lyapunov function):

E = -½ ∑i ∑j wij xi xj

 For symmetric weights, this can be rewritten as

E = C - ∑(ij) wij xi xj

where (ij) refers to distinct pairs of indeces, and C is a


constant.

 Activation rule minimizes energy


Assuming no thresholds, for a given updated unit i, either
its activation is unchanged, in which case the energy is
unchanged, or it is negated, in which case xi and ∑j wij xj
have opposite signs, and x′i = -xi, where x′i is the activation
of unit i following the update.
Then the difference between the energy after and before the
update of unit i is

E′ - E = - ∑j≠i wijxi′xj + ∑j≠i wijxixj


= 2 ∑j≠i wijxixj
= 2 xi ∑j≠i wijxj
= 2 xi ∑j wijxj - 2 wii

But both of these terms are negative, so, for asynchronous


updates, the energy always either remains the same or
decreases.

 Learning
 Storing Q memories in a Hopfield net:

 Hebbian learning: weight on the connection joining two


units is proportional to the correlation between their
activations.
 For one pattern p, we get stability if, for all i :

The expression in parentheses (the input to unit i) is

If the magnitude of the second term, the crosstalk term, is


less than N, then pattern p is stable.

 Capacity of a network
 Crosstalk between patterns
 Number of random patterns storable is proportional to N if
small percentage of errors tolerated, but it is quite small.

Competitive learning
• What it is
o Winner-take-all output units compete to classify input patterns, only one
(roughly) coming on at a time
o Clustering unlabelled data
o Categorization, vector quantization
• Simple, single-layer competitive learning
o Binary output units fully connected to (usually binary) input units by non-
negative weights
o Only one output unit fires at a time, the one whose input weight vector is
closest to the input vector (i* is the winning unit):

|wi* - x| ≤ |wi - x|
(for all i)

For normalized weights, the winner is always the one with highest input
(dot product of input pattern and weight vector):

wi* ⋅ x ≥ wi
(for all i)

o Winner-take-all process can be implemented by simply picking the unit


with the highest activation or through lateral inhibitory connections.
o Learning
 Weights initially random
 For each input pattern, update the weights into the winning unit
only
 The standard rule moves the winning weight vector directly
towards the input pattern.

 Because losers are not activated, the rule is equivalent to (yi is the
activation of the ith category unit)

 Geometric analogy
o Problem of "dead units", units which start out far away from input patterns
and never win
• Feature maps
o Networks in which location of output unit conveys information
o Output units have fixed positions in one-, two-, or three-dimensional grids
o Topology preserving map from the space of possible inputs to the line,
plane, or cube of the output units
 A mapping that preserves neighborhood relations.
 As two input patterns get closer in input space, the winning output
units get closer in output space.
o Self-organizing feature maps (Kohonen nets), one type of feature map
architecture)
 The neighborhood relations in the output array are built into the
learning rule. Weights into many (sometimes all) units are changed
on each update (there may be no dead units).
 Winning output unit i*:

|wi* - x| ≤ |wi - x|
(for all i)
 Learning rule:

Δwij = ηΛ(i, i*) (xj - wij)


Λ(i, i*) = 1, for i = i*

The neighborhood function falls off with the distance |ri - ri*| in the
output array, where the r vectors are the coordinates of the units in
the output space.

 Network as an elastic net in which the weight vector of the winner


is dragged toward the input vector and the weight vectors of
neighboring units are pulled along with it. Nearby units respond to
nearby input patterns.
 Typical neighborhood function:

 Both σ and η start large and are decreased during training


 Result is sensitive to probability of inputs as well as their location
in input space: more output units are associated with regions of
higher probability
 1-to-1, 2-to-1, 2-to-2 mappings
 Convergence
 Usually in two stages: (1) untangling, (2) detailed adapting
 Kinds of tangles: twists (2 dimensions), kinks (1
dimension)
 Example applications
 Robot joint angles, rather than actual positions, as input
 Phoneme similarity

Reinforcement learning
Markov decision processes
• The agent and the environment (the world)
• Discrete time
• States
At each time step, the agent's sensory/perceptual system returns a state, xt, a
representation of its current situation in the environment, which may be errorful
and normally misses many aspects of the "real" situation.
• Actions
At each time step, the agent has the option of executing one of a finite set of
possible actions, ut, each of which potentially puts it in a new state.
• Reinforcements: rewards and punishments
In response to the agent's action in a particular state, the world provides a
reinforcement.
• The reinforcement function in the world: r(x,u)
• The next-state function in the world: s(x,u)
• An example

Simple reinforcement learning


• The goal: to learn a value for each state-action pair
• One possibility

Q(xt, ut) = r(xt, ut)

• But this bases too much on a single instance of reinforcement. We need to learn in
smaller steps.

Qnew(xt, ut) = (1 - η)Qold(xt, ut) + η r(xt, ut)

where η is a learning rate between 0 and 1.

• But so far the algorithm can only learn in response to immediate reinforcement.
What about delayed reinforcement?

Q learning
• The real value of an action in a state (optimal Q) depends not only on immediate
reinforcement but also on reinforcements that can be received later as a result of
the next state the agent gets to.
• The value (estimate Q) that an agent stores for each state-action pair should
reflect how much reinforcement it will receive immediately and in the future if it
takes that action in that state.
• Policy: a way of using the stored Q values to select actions.
• More precisely, an optimal Q value for a given state and action is the sum of all
reinforcements received if that action is taken in the state, and then the agent
follows the optimal policy specified by the other Q values. A first definition:

Qopt(xt, ut) = r(xt, ut) + maxut + 1[Qopt(xt + 1, ut + 1)]


• But this causes problems because there may be many, even an infinite number of,
future reinforcements. We need to weight the future by a discount rate (γ)
between 0 and 1.

Qopt(xt, ut) = r(xt, ut) + γ maxut + 1[Qopt(xt + 1, ut + 1)]

• To approach optimal Q values, the learner starts with 0 or random values for each
state-action pair, then updates the values gradually usually the reinforcement
received and what it thinks is the best Q value for the next state.

Qnew(xt, ut) = (1 - η)Qold(x, u) + η{r(xt, ut) + γ maxut + 1[Qold(xt + 1, ut + 1)]}

An example
γ = 0.8, η = 0.5 and all Q values initialized at 0. In the chart, "new" means the
reinforcement received plus the discounted maximum value of the next state. The "new"
value is combined with the "old" using the learning rate to give the updated Q value
appearing in the next line of the chart. (Note: in this example, in order to illustrate how
the agent can learn to "look ahead", it is effectively picked up after it reaches the goal
state and dropped back in state 1. There is no "natural" way of reaching state 1 from state
4.)

Q
x new u
1,r 2,r 2,l 3,r 3,l 4,l
1 0 0 0 0 0 0 0 r
2 0 0 0 0 0 0 0 r
3 0 0 0 0 0 0 1 r
4 0 0 0 .5 0 0 0 l
1 0 0 0 .5 0 0 0 r
2 0 0 0 .5 0 0 .4 r
3 0 .2 0 .5 0 0 1 r
4 0 .2 0 .75 0 0 0 l

Making decisions
o How is the agent to pick an action? One possibility is exploitation, to pick
the action that has the highest Q-value for the current state.
o But the agent can only learn about the value of actions that it tries. Thus it
should try a variety of actions. This may involved ignoring what it thinks
is best some of the time: exploration.
o Exploration makes more sense early on in learning when the agent doesn't
know much.
o One possibility for selecting an action; pick the "best" action with
probability P = 1 - e-E a, where a is the number of training samples (the
"age" of the agent). Here is how the probability of selecting the "best"
action depends on age when E is 0.1.

Here is how it depends on age when E is 0.01

o A smarter possibility would be to have the probability of picking an action


depend on how high its value is relative to the values of all of the other
possible actions. Here is one way:
where the vs represent all of the possible actions in state xt.

Implementing Q learning
• A lookup table: a Q value for each state-action pair
• But in the real world the number of states may be very large, even infinite.
Distributed representations of states permit
o More efficient coding
o Generalization to novel states
• A neural network
o Inputs are distributed representations of states.
o Outputs Q values for each action (represented locally).
o Weights represent associations of state features with actions.
o Error-driven learning: for the selected action, the target is the "new" Q
value from the Q learning rule.

Concept learning
Concepts (categories)
• Features
o Sufficient features and decision trees
o Necessary features
o Typical features
o Prohibited features
• Which features are relevant for concepts (of particular types)?
• Where does the set of features come from?
• One-shot learning
• Quine's problem

Positive and negative examples, generalization and


specialization
• Evolving hypotheses
o Current best guess
o Current set consistent with examples
• Generalization
o In response to a positive example that should not be an example,
according to the current hypothesis: a false negative
o Variabilization
o Assigning a more general type to an element
o Dropping features from a conjunction of features
o Making a disjunction of features
• Specialization
o In response to a negative example that should be an example, according to
the current hypothesis: a false positive
o The value of near misses
o Adding features to a conjunction of features
o Assigning a more specific type to an element
o Prohibiting a feature
• Generalization (left) and specialization (right) illustrated
• An example of current-best-guess learning (that fails): SAME-COLOR
o A positive example
Object 1 is to the left of object 2.
Object 1 is a square.
Object 2 is a square.
Object 1 is red.
Object 2 is red.

o Another positive example

Object 1 is to the left of object 2.


Object 1 is a rectangle.

o A negative example
Object 1 is to the left of object 2.
Object 1 is a rectangle.
Object 2 must not be yellow.

o The learner fails because of the inadequacy of the representation; learning


relies in part on perception.

Version Space learning (Mitchell)


• Incremental learning vs. batch learning
• Maintaining the set of all hypotheses that are consistent with the set of positive
and negative examples so far
• Version Space learning: incremental, least-commitment algorithm (makes no
arbitrary choices)
• Version space: set of all hypotheses consistent with examples seen so far
• Version graph: directed acyclic graph in which nodes are elements of version
space and there is an arc from node p to node q iff p is less general than q and
there is no node r that is more general than p and less general than q.
• Positive examples lead to the elimination of hypotheses that are too specific.
• Negative examples lead to the elimination of hypotheses that are too general.
• Problem of size of version graph
• Solution: use of boundary sets, sets of hypotheses defining boundaries on which
hypotheses are consistent with examples
o Most general boundary set (G-set): every member consistent with
examples, and there are no more general consistent hypotheses
o Most specific boundary set (S-set): every member consistent with
examples, and there are no more specific consistent hypotheses
• Updating the boundary sets, given a new example (the rules for G-set and S-set
are symmetric)
o G-set
 If the example is positive, exclude any hypotheses that do not
cover the example.
 If the example is negative, find the most general set within or
below the old G-set and above the S-set which fails to cover the
example.
o S-set
 If the example is positive, find the most specific set within or
above the old S-set and below the G-set which covers the example.
 If the example is negative, exclude any hypotheses that cover the
example.
• Update until
o there is exactly one concept left in the version space or
o the version space collapes--either the S-set or G-set becomes empty or
o there are no more examples.
• Example
• Another detailed example, from a class by Julian Francis Miller when he was at
the University of Birmingham

ID3 (decision tree) learning (Quinlan)


• ID3: batch learning of concepts using decision trees
• A set of classified examples
class body covering habitat flies? breathes air?
1 m hair land no yes
2 m hair land yes yes
3 m other water no yes
4 b feathers land yes yes
5 b feathers land no yes
6 f scales water no no
7 f scales water yes no
8 f other water no no
9 f scales water no yes
10 r scales land no yes
• Alternate decision trees that successfully classify the data, differing in number of
decisions

• Creating a decision tree


For each decision point,
o If all remaining examples are all classified, we're done.
o Else if there are some unclassified examples left and attributes left, pick
the remaining attribute which is the "most important", the one which tends
to divide the remaining examples into homogeneous sets
o Else if there are no examples left, no such example has been observed;
return default
o Else if there are no attributes left, examples with the same description
have different classifications: noise or insufficient attributes or
nondeterministic domain
• Selection of "most important" attribute
o For possible answers v to a question with probabilities P(vi), the
information content of the answer is

I(P(v1), ... , P(vn)) = ∑in −P(vi) log2P(vi)

o In our example the information content of a tree that classifies the data is

I(P(m), P(b), P(f), P(r)) =


P(m) log2P(m) + P(b) log2P(b) + P(f) log2P(f) + P(r) log2P(r) =
= −0.3 log20.3 − 0.2 log20.2 −0.4 log20.4 − 0.1 log20.1 =
(0.3)(1.73) + (0.2)(2.32) + (0.4)(1.32) + (0.1)(3.32) = 1.843

o Each decision point adds to the information content. To determine the


information gain, we subtract the information content left after the
decision from the total. The information content left after the decision is
the weighted sum of the information content in each of the groups
resulting from the decision. At each decision point, the algorithm picks the
attribute of those left that maximizes the information gain.
o For the first decision point, the four attributes classify the examples in this
way:
o body covering: hair(M: 2), feathers(B: 2), scales(F: 3, R:
1), other(M: 1, F: 1)
o habitat: land(M: 2, B: 2, R: 1), water(M:1, F: 4)
o flies?: yes(M:1, B: 1, F: 1), no(M: 2, B: 1, F: 3, R: 1)
o breathes air?: yes(M: 3, B: 2, F: 1, R: 1), no(F: 3)
o For each of these, the information content remaining after the decision is:
o body covering: (0.2)(0) + (0.2)(0) + (0.4)(0.811) +
0.2(1.0) = 0.524
o habitat: (0.5)(1.524) + (0.5)(0.722) = 1.123
o flies?: (0.3)(1.586) + (0.7)(1.845) = 1.767
o breathes air?: (0.7)(0.802) + (0.3)(0) = 1.291

Therefore the appropriate first choice is the "body covering" attribute.

o For the "scales" branch below this decision point, the three remaining
attributes classify the four examples as follows
o habitat: land(R: 1), water(F: 3)
o flies?: yes(F: 1), no(F: 2, R: 1)
o breathes air?: yes(F: 1, R: 1), no(F: 2)
o For each of these, the information content remaining after the decision is:
o habitat: (0.1)(0) + (0.3)(0) = 0
o flies?: (0.1)(0) + (0.3)(0.918) = 0.275
o breathes air?: (0.2)(1) + (0.2)(0) = 0.2

Therefore "habitat" is the right choice for this branch.

Some implications of AI
Philosophy and cognitive science
• Is intelligence something that can be defined and studied abstractly, independently
from the details for the intelligent agent? Does intelligence require a body?
• Are there different intelligences (logics?), each with its own advantages for a
given environment, that could collaborate in solving problems?
• Where does the mind stop and the "outside world" begin?
• Are there fundamental differences between human and "animal" intelligence, or
are all of the differences just a matter of degree?
• How much of human intelligence is cultural, as opposed to genetic?

Society
• How can AI give certain groups of people (more) power over other groups of
people?
• Who benefits from applied AI (medicine, law, design, education, information
retrieval, natural language processing, stock market, marketing, transportation,
entertainment, agriculture, military)?
• Who funds AI research?
• How might AI be used to further
o free, independent media; equal access to information; a better informed
public?
o a public capable of making rational economic and political decisions?
o international (intercultural) understanding (tolerance)?
o equal access to (or equitable distribution of) the world's resources?
o protection of the environment?

Artificial neural network


An artificial neural network (ANN) or commonly just neural network (NN) is an
interconnected group of artificial neurons that uses a mathematical model or
computational model for information processing based on a connectionist approach to
computation. In most cases an ANN is an adaptive system that changes its structure based
on external or internal information that flows through the network.
(The term "neural network" can also mean biological-type systems.)

In more practical terms neural networks are non-linear statistical data modeling tools.
They can be used to model complex relationships between inputs and outputs or to find
patterns in data.

A neural network is an interconnected group of nodes, akin to the vast network of


neurons in the human brain.

More complex neural networks are often used in Parallel Distributed Processing.

Background
There is no precise agreed definition among researchers as to what a neural network is,
but most would agree that it involves a network of simple processing elements (neurons)
which can exhibit complex global behavior, determined by the connections between the
processing elements and element parameters. The original inspiration for the technique
was from examination of the central nervous system and the neurons (and their axons,
dendrites and synapses) which constitute one of its most significant information
processing elements (see Neuroscience). In a neural network model, simple nodes (called
variously "neurons", "neurodes", "PEs" ("processing elements") or "units") are connected
together to form a network of nodes — hence the term "neural network." While a neural
network does not have to be adaptive per se, its practical use comes with algorithms
designed to alter the strength (weights) of the connections in the network to produce a
desired signal flow.

These networks are also similar to the biological neural networks in the sense that
functions are performed collectively and in parallel by the units, rather than there being a
clear delineation of subtasks to which various units are assigned (see also
connectionism). Currently, the term Artificial Neural Network (ANN) tends to refer
mostly to neural network models employed in statistics, cognitive psychology and
artificial intelligence. Neural network models designed with emulation of the central
nervous system (CNS) in mind are a subject of theoretical neuroscience.

In modern software implementations of artificial neural networks the approach inspired


by biology has more or less been abandoned for a more practical approach based on
statistics and signal processing. In some of these systems neural networks, or parts of
neural networks (such as artificial neurons) are used as components in larger systems that
combine both adaptive and non-adaptive elements. While the more general approach of
such adaptive systems is more suitable for real-world problem solving, it has far less to
do with the traditional artificial intelligence connectionist models. What they do however
have in common is the principle of non-linear, distributed, parallel and local processing
and adaptation.

[edit] Models

Neural network models in artificial intelligence are usually referred to as artificial neural
networks (ANNs); these are essentially simple mathematical models defining a function
. Each type of ANN model corresponds to a class of such functions.

[edit] The network in artificial neural network

The word network in the term 'artificial neural network' arises because the function f(x) is
defined as a composition of other functions gi(x), which can further be defined as a
composition of other functions. This can be conveniently represented as a network
structure, with arrows depicting the dependencies between variables. A widely used type

of composition is the nonlinear weighted sum, where ,


where K is some predefined function, such as the hyperbolic tangent. It will be
convenient for the following to refer to a collection of functions gi as simply a vector
.
ANN dependency graph

This figure depicts such a decomposition of f, with dependencies between variables


indicated by arrows. These can be interpreted in two ways.

The first view is the functional view: the input x is transformed into a 3-dimensional
vector h, which is then transformed into a 2-dimensional vector g, which is finally
transformed into f. This view is most commonly encountered in the context of
optimization.

The second view is the probabilistic view: the random variable F = f(G) depends upon the
random variable G = g(H), which depends upon H = h(X), which depends upon the
random variable X. This view is most commonly encountered in the context of graphical
models.

The two views are largely equivalent. In either case, for this particular network
architecture, the components of individual layers are independent of each other (e.g., the
components of g are independent of each other given their input h). This naturally enables
a degree of parallelism in the implementation.

Recurrent ANN dependency graph

Networks such as the previous one are commonly called feedforward, because their graph
is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such
networks are commonly depicted in the manner shown at the top of the figure, where f is
shown as being dependent upon itself. However, there is an implied temporal dependence
which is not shown. What this actually means in practice is that the value of f at some
point in time t depends upon the values of f at zero or at one or more other points in time.
The graphical model at the bottom of the figure illustrates the case: the value of f at time t
only depends upon its last value. Models such as these, which have no dependencies in
the future, are called causal models.
See also: graphical models

[edit] Learning

However interesting such functions may be in themselves, what has attracted the most
interest in neural networks is the possibility of learning, which in practice means the
following:

Given a specific task to solve, and a class of functions F, learning means using a set of
observations, in order to find which solves the task in an optimal sense.

This entails defining a cost function such that, for the optimal solution f * ,
(no solution has a cost less than the cost of the optimal
solution).

The cost function C is an important concept in learning, as it is a measure of how far


away we are from an optimal solution to the problem that we want to solve. Learning
algorithms search through the solution space in order to find a function that has the
smallest possible cost.

For applications where the solution is dependent on some data, the cost must necessarily
be a function of the observations, otherwise we would not be modelling anything related
to the data. It is frequently defined as a statistic to which only approximations can be
made. As a simple example consider the problem of finding the model f which minimizes

, for data pairs (x,y) drawn from some distribution . In


practical situations we would only have N samples from and thus, for the above

example, we would only minimize . Thus, the cost is


minimized over a sample of the data rather than the true data distribution.

When some form of online learning must be used, where the cost is partially
minimized as each new example is seen. While online learning is often used when is
fixed, it is most useful in the case where the distribution changes slowly over time. In
neural network methods, some form of online learning is frequently also used for finite
datasets.

See also: Optimization (mathematics), Statistical Estimation, Machine Learning


[edit] Choosing a cost function

While it is possible to arbitrarily define some ad hoc cost function, frequently a particular
cost will be used either because it has desirable properties (such as convexity) or because
it arises naturally from a particular formulation of the problem (i.e., In a probabilistic
formulation the posterior probability of the model can be used as an inverse cost).
Ultimately, the cost function will depend on the task we wish to perform. The three main
categories of learning tasks are overviewed below.

[edit] Learning paradigms

There are three major learning paradigms, each corresponding to a particular abstract
learning task. These are supervised learning, unsupervised learning and reinforcement
learning. Usually any given type of network architecture can be employed in any of those
tasks.

[edit] Supervised learning

In supervised learning, we are given a set of example pairs and


the aim is to find a function f in the allowed class of functions that matches the examples.
In other words, we wish to infer the mapping implied by the data; the cost function is
related to the mismatch between our mapping and the data and it implicitly contains prior
knowledge about the problem domain.

A commonly used cost is the mean-squared error which tries to minimise the average
error between the network's output, f(x), and the target value y over all the example pairs.
When one tries to minimise this cost using gradient descent for the class of neural
networks called Multi-Layer Perceptrons, one obtains the well-known backpropagation
algorithm for training neural networks.

Tasks that fall within the paradigm of supervised learning are pattern recognition (also
known as classification) and regression (also known as function approximation). The
supervised learning paradigm is also applicable to sequential data (e.g., for speech and
gesture recognition). This can be thought of as learning with a "teacher," in the form of a
function that provides continuous feedback on the quality of solutions obtained thus far.

[edit] Unsupervised learning

In unsupervised learning we are given some data x, and the cost function to be minimised
can be any function of the data x and the network's output, f.

The cost function is dependent on the task (what we are trying to model) and our a priori
assumptions (the implicit properties of our model, its parameters and the observed
variables).
As a trivial example, consider the model f(x) = a, where a is a constant and the cost C =
(E[x] − f(x))2. Minimising this cost will give us a value of a that is equal to the mean of
the data. The cost function can be much more complicated. Its form depends on the
application: For example in compression it could be related to the mutual information
between x and y. In statistical modelling, it could be related to the posterior probability of
the model given the data. (Note that in both of those examples those quantities would be
maximised rather than minimised)

Tasks that fall within the paradigm of unsupervised learning are in general estimation
problems; the applications include clustering, the estimation of statistical distributions,
compression and filtering.

[edit] Reinforcement learning

In reinforcement learning, data x is usually not given, but generated by an agent's


interactions with the environment. At each point in time t, the agent performs an action yt
and the environment generates an observation xt and an instantaneous cost ct, according to
some (usually unknown) dynamics. The aim is to discover a policy for selecting actions
that minimises some measure of a long-term cost, i.e. the expected cumulative cost. The
environment's dynamics and the long-term cost for each policy are usually unknown, but
can be estimated.

More formally, the environment is modeled as a Markov decision process (MDP) with
states and actions with the following probability
distributions: the instantaneous cost distribution P(ct | st), the observation distribution P(xt
| st) and the transition P(st + 1 | st,at), while a policy is defined as conditional distribution
over actions given the observations. Taken together, the two define a Markov chain (MC).
The aim is to discover the policy that minimises the cost, i.e. the MC for which the cost is
minimal.

ANNs are frequently used in reinforcement learning as part of the overall algorithm.

Tasks that fall within the paradigm of reinforcement learning are control problems, games
and other sequential decision making tasks.

See also: dynamic programming, stochastic control

[edit] Learning algorithms

Training a neural network model essentially means selecting one model from the set of
allowed models (or, in a Bayesian framework, determining a distribution over the set of
allowed models) that minimises the cost criterion. There are numerous algorithms
available for training neural network models; most of them can be viewed as a
straightforward application of optimization theory and statistical estimation.
Most of the algorithms used in training artificial neural networks are employing some
form of gradient descent. This is done by simply taking the derivative of the cost function
with respect to the network parameters and then changing those parameters in a gradient-
related direction.

Evolutionary methods, simulated annealing, and Expectation-maximization and non-


parametric methods are among other commonly used methods for training neural
networks. See also machine learning.

[edit] Employing artificial neural networks


Perhaps the greatest advantage of ANNs is their ability to be used as an arbitrary function
approximation mechanism which 'learns' from observed data. However, using them is not
so straightforward and a relatively good understanding of the underlying theory is
essential.

• Choice of model: This will depend on the data representation and the application.
Overly complex models tend to lead to problems with learning.
• Learning algorithm: There are numerous tradeoffs between learning algorithms.
Almost any algorithm will work well with the correct hyperparameters for
training on a particular fixed dataset. However selecting and tuning an algorithm
for training on unseen data requires a significant amount of experimentation.
• Robustness: If the model, cost function and learning algorithm are selected
appropriately the resulting ANN can be extremely robust.

With the correct implementation ANNs can be used naturally in online learning and large
dataset applications. Their simple implementation and the existence of mostly local
dependencies exhibited in the structure allows for fast, parallel implementations in
hardware.

[edit] Applications
The utility of artificial neural network models lies in the fact that they can be used to
infer a function from observations. This is particularly useful in applications where the
complexity of the data or task makes the design of such a function by hand impractical.

[edit] Real life applications

The tasks to which artificial neural networks are applied tend to fall within the following
broad categories:

• Function approximation, or regression analysis, including time series prediction


and modeling.
• Classification, including pattern and sequence recognition, novelty detection and
sequential decision making.
• Data processing, including filtering, clustering, blind source separation and
compression.

Application areas include system identification and control (vehicle control, process
control), game-playing and decision making (backgammon, chess, racing), pattern
recognition (radar systems, face identification, object recognition and more), sequence
recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial
applications, data mining (or knowledge discovery in databases, "KDD"), visualization
and e-mail spam filtering.

[edit] Neural network software


Main article: Neural network software

Neural network software is used to simulate, research, develop and apply artificial
neural networks, biological neural networks and in some cases a wider array of adaptive
systems.

[edit] Types of neural networks


[edit] Feedforward neural network

The feedforward neural networks are the first and arguably simplest type of artificial
neural networks devised. In this network, the information moves in only one direction,
forward, from the input nodes, through the hidden nodes (if any) and to the output nodes.
There are no cycles or loops in the network.

[edit] Single-layer perceptron

The earliest kind of neural network is a single-layer perceptron network, which consists
of a single layer of output nodes; the inputs are fed directly to the outputs via a series of
weights. In this way it can be considered the simplest kind of feed-forward network. The
sum of the products of the weights and the inputs is calculated in each node, and if the
value is above some threshold (typically 0) the neuron fires and takes the activated value
(typically 1); otherwise it takes the deactivated value (typically -1). Neurons with this
kind of activation function are also called McCulloch-Pitts neurons or threshold neurons.
In the literature the term perceptron often refers to networks consisting of just one of
these units. They were described by Warren McCulloch and Walter Pitts in the 1940s.

A perceptron can be created using any values for the activated and deactivated states as
long as the threshold value lies between the two. Most perceptrons have outputs of 1 or -1
with a threshold of 0 and there is some evidence that such networks can be trained more
quickly than networks created from nodes with different activation and deactivation
values.
Perceptrons can be trained by a simple learning algorithm that is usually called the delta
rule. It calculates the errors between calculated output and sample output data, and uses
this to create an adjustment to the weights, thus implementing a form of gradient descent.

Single-unit perceptrons are only capable of learning linearly separable patterns; in 1969
in a famous monograph entitled Perceptrons Marvin Minsky and Seymour Papert showed
that it was impossible for a single-layer perceptron network to learn an XOR function.
They conjectured (incorrectly) that a similar result would hold for a multi-layer
perceptron network. Although a single threshold unit is quite limited in its computational
power, it has been shown that networks of parallel threshold units can approximate any
continuous function from a compact interval of the real numbers into the interval [-1,1].
This very recent result can be found in [Auer, Burgsteiner, Maass: The p-delta learning
rule for parallel perceptrons, 2001 (state Jan 2003: submitted for publication)].

A single-layer neural network can compute a continuous output instead of a step function.
A common choice is the so-called logistic function:

With this choice, the single-layer network is identical to the logistic regression model,
widely used in statistical modelling. The logistic function is also known as the sigmoid
function. It has a continuous derivative, which allows it to be used in backpropagation.
This function is also preferred because its derivative is easily calculated:

y' = y(1 − y)

[edit] Multi-layer perceptron

A two-layer neural network capable of calculating XOR. The numbers within the neurons
represent each neuron's explicit threshold (which can be factored out so that all neurons
have the same threshold, usually 1). The numbers that annotate arrows represent the
weight of the inputs. This net assumes that if the threshold is not reached, zero (not -1) is
output. Note that the bottom layer of inputs is not always considered a real neural
network layer

This class of networks consists of multiple layers of computational units, usually


interconnected in a feed-forward way. Each neuron in one layer has directed connections
to the neurons of the subsequent layer. In many applications the units of these networks
apply a sigmoid function as an activation function.

The universal approximation theorem for neural networks states that every continuous
function that maps intervals of real numbers to some output interval of real numbers can
be approximated arbitrarily closely by a multi-layer perceptron with just one hidden
layer. This result holds only for restricted classes of activation functions, e.g. for the
sigmoidal functions.

Multi-layer networks use a variety of learning techniques, the most popular being back-
propagation. Here the output values are compared with the correct answer to compute the
value of some predefined error-function. By various techniques the error is then fed back
through the network. Using this information, the algorithm adjusts the weights of each
connection in order to reduce the value of the error function by some small amount. After
repeating this process for a sufficiently large number of training cycles the network will
usually converge to some state where the error of the calculations is small. In this case
one says that the network has learned a certain target function. To adjust weights properly
one applies a general method for non-linear optimization that is called gradient descent.
For this, the derivative of the error function with respect to the network weights is
calculated and the weights are then changed such that the error decreases (thus going
downhill on the surface of the error function). For this reason back-propagation can only
be applied on networks with differentiable activation functions.

In general the problem of teaching a network to perform well, even on samples that were
not used as training samples, is a quite subtle issue that requires additional techniques.
This is especially important for cases where only very limited numbers of training
samples are available. The danger is that the network overfits the training data and fails to
capture the true statistical process generating the data. Computational learning theory is
concerned with training classifiers on a limited amount of data. In the context of neural
networks a simple heuristic, called early stopping, often ensures that the network will
generalize well to examples not in the training set.

Other typical problems of the back-propagation algorithm are the speed of convergence
and the possibility of ending up in a local minimum of the error function. Today there are
practical solutions that make back-propagation in multi-layer perceptrons the solution of
choice for many machine learning tasks.

[edit] ADALINE

Adaptive Linear Neuron or later called Adaptive Linear Element. It was developed by
Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in
1960. It's based on the McCulloch-Pitts model. It consists of a weight, a bias and a
summation function.

Operation: yi = wxi + b

Its adaptation is defined through a cost function (error metric) of the residual e = di − (b +

wxi) where di is the desired input. With the MSE error metric the

adapted weight and bias become: and

While the Adaline is through this capable of simple linear regression, it has limited
practical use.

There is an extension of the Adaline, called the Multiple Adaline (MADALINE) that
consists of two or more adalines serially connected.

[edit] Radial basis function (RBF) network


Main article: Radial basis function network

Radial Basis Functions are powerful techniques for interpolation in multidimensional


space. A RBF is a function which has built into a distance criterion with respect to a
centre. Radial basis functions have been applied in the area of neural networks where
they may be used as a replacement for the sigmoidal hidden layer transfer characteristic
in multi-layer perceptrons. RBF networks have 2 layers of processing: In the first, input is
mapped onto each RBF in the 'hidden' layer. The RBF chosen is usually a Gaussian. In
regression problems the output layer is then a linear combination of hidden layer values
representing mean predicted output. The interpretation of this output layer value is the
same as a regression model in statistics. In classification problems the output layer is
typically a sigmoid function of a linear combination of hidden layer values, representing
a posterior probability. Performance in both cases is often improved by shrinkage
techniques, known as ridge regression in classical statistics and known to correspond to a
prior belief in small parameter values (and therefore smooth output functions) in a
Bayesian framework.

RBF networks have the advantage of not suffering from local minima in the same way as
multi-layer perceptrons. This is because the only parameters that are adjusted in the
learning process are the linear mapping from hidden layer to output layer. Linearity
ensures that the error surface is quadratic and therefore has a single easily found
minimum. In regression problems this can be found in one matrix operation. In
classification problems the fixed non-linearity introduced by the sigmoid output function
is most efficiently dealt with using iterated reweighted least squares.
RBF networks have the disadvantage of requiring good coverage of the input space by
radial basis functions. RBF centres are determined with reference to the distribution of
the input data, but without reference to the prediction task. As a result, representational
resources may be wasted on areas of the input space that are irrelevant to the learning
task. A common solution is to associate each data point with its own centre, although this
can make the linear system to be solved in the final layer rather large, and requires
shrinkage techniques to avoid overfitting.

Associating each input datum with an RBF leads naturally to kernel methods such as
Support Vector Machines and Gaussian Processes (the RBF is the kernel function). All
three approaches use a non-linear kernel function to project the input data into a space
where the learning problem can be solved using a linear model. Like Gaussian Processes,
and unlike SVMs, RBF networks are typically trained in a Maximum Likelihood
framework by maximizing the probability (minimizing the error) of the data under the
model. SVMs take a different approach to avoiding overfitting by maximizing instead a
margin. RBF networks are outperformed in most classification applications by SVMs. In
regression applications they can be competitive when the dimensionality of the input
space is relatively small.

[edit] Kohonen self-organizing network

The self-organizing map (SOM) invented by Teuvo Kohonen uses a form of unsupervised
learning. A set of artificial neurons learn to map points in an input space to coordinates in
an output space. The input space can have different dimensions and topology from the
output space, and the SOM will attempt to preserve these.

[edit] Recurrent network

Contrary to feedforward networks, recurrent neural networks (RNs) are models with bi-
directional data flow. While a feedforward network propagates data linearly from input to
output, RNs also propagate data from later processing stages to earlier stages.

[edit] Simple recurrent network

A simple recurrent network (SRN) is a variation on the multi-layer perceptron, sometimes


called an "Elman network" due to its invention by Jeff Elman. A three-layer network is
used, with the addition of a set of "context units" in the input layer. There are connections
from the middle (hidden) layer to these context units fixed with a weight of one. At each
time step, the input is propagated in a standard feed-forward fashion, and then a learning
rule (usually back-propagation) is applied. The fixed back connections result in the
context units always maintaining a copy of the previous values of the hidden units (since
they propagate over the connections before the learning rule is applied). Thus the network
can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that
are beyond the power of a standard multi-layer perceptron.
In a fully recurrent network, every neuron receives inputs from every other neuron in the
network. These networks are not arranged in layers. Usually only a subset of the neurons
receive external inputs in addition to the inputs from all the other neurons, and another
disjunct subset of neurons report their output externally as well as sending it to all the
neurons. These distinctive inputs and outputs perform the function of the input and output
layers of a feed-forward or simple recurrent network, and also join all the other neurons
in the recurrent processing.

[edit] Hopfield network

The Hopfield network is a recurrent neural network in which all connections are
symmetric. Invented by John Hopfield in 1982, this network guarantees that its dynamics
will converge. If the connections are trained using Hebbian learning then the Hopfield
network can perform as robust content-addressable memory, resistant to connection
alteration.

[edit] Echo State Network

The Echo State Network (ESN) is a recurrent neural network with a sparsely connected
random hidden layer. The weights of output neurons are the only part of the network that
can change and be learned. ESN are good to (re)produce temporal patterns.

[edit] Stochastic neural networks

A stochastic neural network differs from a regular neural network in the fact that it
introduces random variations into the network. In a probabilistic view of neural networks,
such random variations can be viewed as a form of statistical sampling, such as Monte
Carlo sampling.

[edit] Boltzmann machine

The Boltzmann machine can be thought of as a noisy Hopfield network. Invented by


Geoff Hinton and Terry Sejnowski in 1985, the Boltzmann machine is important because
it is one of the first neural networks to demonstrate learning of latent variables (hidden
units). Boltzmann machine learning was at first slow to simulate, but the contrastive
divergence algorithm of Geoff Hinton (circa 2000) allows models such as Boltzmann
machines and products of experts to be trained much faster.

[edit] Modular neural networks

Biological studies showed that the human brain functions not as a single massive
network, but as a collection of small networks. This realisation gave birth to the concept
of modular neural networks, in which several small networks cooperate or compete to
solve problems.
[edit] Committee of machines

A committee of machines (CoM) is a collection of different neural networks that together


"vote" on a given example. This generally gives a much better result compared to other
neural network models. In fact in many cases, starting with the same architecture and
training but using different initial random weights gives vastly different networks. A CoM
tends to stabilize the result.

The CoM is similar to the general machine learning bagging method, except that the
necessary variety of machines in the committee is obtained by training from different
random starting weights rather than training on different randomly selected subsets of the
training data.

[edit] Associative Neural Network (ASNN)

The ASNN is an extension of the committee of machines that goes beyond a


simple/weighted average of different models. ASNN represents a combination of an
ensemble of feed-forward neural networks and the k-nearest neighbour technique (kNN).
It uses the correlation between ensemble responses as a measure of distance amid the
analysed cases for the kNN. This corrects the bias of the neural network ensemble. An
associative neural network has a memory that can coincide with the training set. If new
data becomes available, the network instantly improves its predictive ability and provides
data approximation (self-learn the data) without a need to retrain the ensemble. Another
important feature of ASNN is the possibility to interpret neural network results by
analysis of correlations between data cases in the space of models. The method is
demonstrated at www.vcclab.org, where you can either use it online or download it.

[edit] Other types of networks

These special networks do not fit in any of the previous categories.

[edit] Holographic associative memory

Holographic associative memory represents a family of analog, correlation-based,


associative, stimulus-response memories, where information is mapped onto the phase
orientation of complex numbers operating.

[edit] Instantaneously trained networks

Instantaneously trained neural networks (ITNNs) were inspired by the phenomenon of


short-term learning that seems to occur instantaneously. In these networks the weights of
the hidden and the output layers are mapped directly from the training vector data.
Ordinarily, they work on binary data, but versions for continuous data that require small
additional processing are also available.
[edit] Spiking neural networks

Spiking neural networks (SNNs) are models which explicitly take into account the timing
of inputs. The network input and output are usually represented as series of spikes (delta
function or more complex shapes). SNNs have an advantage of being able to
continuously process information. They are often implemented as recurrent networks.

Networks of spiking neurons -- and the temporal correlations of neural assemblies in such
networks -- have been used to model figure/ground separation and region linking in the
visual system (see e.g. Reitboeck et.al.in Haken and Stadler: Synergetics of the Brain.
Berlin, 1989).

Gerstner and Kistler have a freely-available online textbook on Spiking Neuron Models.

Spiking neural networks with axonal conduction delays exhibit polychronisation, and
hence could have a potentially unlimited memory capacity.

In June 2005 IBM announced construction of a Blue Gene supercomputer dedicated to


the simulation of a large recurrent spiking neural network [1].

[edit] Dynamic neural networks

Dynamic neural networks not only deal with nonlinear multivariate behaviour, but also
include (learning of) time-dependent behaviour such as various transient phenomena and
delay effects. Meijer has a Ph.D. thesis online where regular feedforward perception
networks are generalized with differential equations, using variable time step algorithms
for learning in the time domain and including algorithms for learning in the frequency
domain (in that case linearized around a set of static bias points).

[edit] Cascading neural networks

Cascade-Correlation is an architecture and supervised learning algorithm developed by


Scott Fahlman and Christian Lebiere. Instead of just adjusting the weights in a network of
fixed topology, Cascade-Correlation begins with a minimal network, then automatically
trains and adds new hidden units one by one, creating a multi-layer structure. Once a new
hidden unit has been added to the network, its input-side weights are frozen. This unit
then becomes a permanent feature-detector in the network, available for producing
outputs or for creating other, more complex feature detectors. The Cascade-Correlation
architecture has several advantages over existing algorithms: it learns very quickly, the
network determines its own size and topology, it retains the structures it has built even if
the training set changes, and it requires no back-propagation of error signals through the
connections of the network.
[edit] Neuro-fuzzy networks

A neuro-fuzzy network is a fuzzy inference system in the body of an artificial neural


network. Depending on the FIS type, there are several layers that simulate the processes
involved in a fuzzy inference like fuzzification, inference, aggregation and
defuzzification. Embedding an FIS in a general structure of an ANN has the benefit of
using available ANN training methods to find the parameters of a fuzzy system.

[edit] Theoretical properties


[edit] Capacity

Artificial neural network models have a property called 'capacity', which roughly
corresponds to their ability to model any given function. It is related to the amount of
information that can be stored in the network and to the notion of complexity.

[edit] Convergence

Nothing can be said in general about convergence since it depends on a number of


factors. Firstly, there may exist many local minima. This depends on the cost function and
the model. Secondly, the optimization method used might not be guaranteed to converge
when far away from a local minimum. Thirdly, for a very large amount of data or
parameters, some methods become impractical. In general, it has been found that
theoretical guarantees regarding convergence are not always a very reliable guide to
practical application.

[edit] Generalisation and statistics

In applications where the goal is to create a system that generalises well in unseen
examples, the problem of overtraining has emerged. This arises in overcomplex or
overspecified systems when the capacity of the network significantly exceeds the needed
free parameters. There are two schools of thought for avoiding this problem: The first is
to use cross-validation and similar techniques to check for the presence of overtraining
and optimally select hyperparameters such as to minimise the generalisation error. The
second is to use some form of regularisation. This is a concept that emerges naturally in a
probabilistic (Bayesian) framework, where the regularisation can be performed by putting
a larger prior probability over simpler models; but also in statistical learning theory,
where the goal is to minimise over two quantities: the 'empirical risk' and the 'structural
risk', which roughly correspond to the error over the training set and the predicted error in
unseen data due to overfitting.
Confidence analysis of a neural network

Supervised neural networks that use an MSE cost function can use formal statistical
methods to determine the confidence of the trained model. The MSE on a validation set
can be used as an estimate for variance. This value can then be used to calculate the
confidence interval of the output of the network, assuming a normal distribution. A
confidence analysis made this way is statistically valid as long as the output probability
distribution stays the same and the network is not modified.

By assigning a softmax activation function on the output layer of the neural network (or a
softmax component in a component-based neural network) for categorical target
variables, the outputs can be interpreted as posterior probabilities. This is very useful in
classification as it gives a certainty measure on classifications.

The softmax activation function:

[edit] Dynamical properties


Various techniques originally developed for studying disordered magnetic systems (spin
glasses) have been successfully applied to simple neural network architectures, such as
the Hopfield network. Influential work by E. Gardner and B. Derrida has revealed many
interesting properties about perceptrons with real-valued synaptic weights, while later
work by W. Krauth and M. Mezard has extended these principles to binary-valued
synapses.

About Fuzzy Logic


What is fuzzy logic? What's the difference between fuzzy logic and
Boolean logic? What are connectives in fuzzy logic?
Boolean or "two-valued" logic is traditional logic with all statements
either being true or false. Symbolic logic is something that you can
master. The hardest thing about symbolic logic is learning how to work
with the symbols. Once you know what all the symbols stand for, the
logic should come more easily.
Philosophers and Logicians

First, I'd like to do a bit of philosophizing as a way to lead into the logic. Philosophers
and logicians have a lot of overlap in what they do. Many logicians are also philosophers,
and all philosophers are logicians to some extent (some much more so than others).
Given that there is such a connection between philosophers and logicians, I find it
striking just how radically different the fields are. Philosophers are interested in finding
deep truths about the world, be they epistemological, metaphysical, ethical, etc. Logicians
(qua logicians) are only interested in using a set of rules to manipulate arbitrary symbols
that have no relevance to the world.

The (sometimes difficult) marriage between philosophy and logic comes from the fact
that everyone in the world (except, I would argue, people who are commonly called
"crazy") accepts the truths proven by logic to be universally true and unquestionable.
Philosophy needs logic because in order to establish that a philosophical doctrine is true,
one needs to show that the doctrine is universally and unquestionably true. One needs to,
in other words, make a demostration that everyone would accept as proof that the
proposition is true. To do that, the philosopher needs logic.

[TOP]

Logic Takes Small Steps

Logic accomplishes this magical universal acceptance because it makes only little tiny
steps. It does silly little things like:

ASSUMING: The dog is brown.

AND ASSUMING: The dog weighs 15 lbs.

I CONCLUDE: The dog is brown and the dog weighs 15 lbs.


which anyone who understands what the word "and" means would agree with.

[TOP]

Shorthand Sentences

Logicians, though, are very lazy people. They don't like to write long derivations in
English because English sentences can be fairly long. So what they do instead is a kind of
shorthand. If you give a logician a sentence like

The dog is brown.

he will pick a letter and assign it to that sentence. He now knows that the letter is just
shorthand for the sentence. The way I learned logic, capital letters are used for sentences
(with the exception of U, V, W, X, Y, and Z; I'll get to those later).
So let's just start at the beginning of the alphabet and use the letter "A" to represent the
sentence "The dog is brown." While we're at it, let's use the letter "B" to represent the
sentence "The dog weighs 15 lbs."

In addition to saving time and ink, this practice of using capital letters to represent whole
sentences has a couple of other advantages. The first is that to a logician, not every word
is as interesting as every other. Logicians are extremely interested in the following list of
words:

and

or

if...then

if and only if

not

They call these words "connectives." This is because you can use them to connect
sentences that you already have together to make new sentences.

When you write in English, those words don't stand out; they just get lost in the middle of
sentences. Logicians want to make sure the words look special, so they take the whole
rest of the sentence (the part they don't care about) and use a single letter to represent
that. Then their favorite words stand out. Let's rewrite our earlier example about the dog
using our logician's shorthand:

ASSUMING: A

AND ASSUMING: B

I CONCLUDE: A and B

The other advantage of using capital letters to represent sentences is that you ignore all
the information that isn't relevant to what you're trying to do. For the derivation I did
above, it didn't matter that the sentences were both about some dog. It didn't matter that
they were about weight or color. They could have just as easily been sentences about how
tall the dog is or about a cat or a person or a war or whatever. And if we can do the
derivation for A and B, then we can do the same exact derivation for C and D or E and N
or any other sentences we like.

[TOP]

Connectives
Now, as I said before, logicians are lazy. They really don't want to have anything to do
with English. So instead of using the English words:

and

or

if...then

if and only if

not

they make up their own symbols for these:

For these words Logicians use this symbol


and ^
or v
if ... then ->
if and only if <->
not ~

(Sometimes they also use a triple equals sign for '<->', but I can't type that.)

Here are some examples:

You and I would write this A logician writes this


The dog is brown and the dog weighs 15 lbs. (A ^ B)
The dog is brown or the dog weighs 15 lbs. (A v B)
if the dog is brown, then the dog weighs 15 lbs. (A -> B)
The dog is brown if and only if the dog weighs 15 lbs. (A <-> B)
The dog is not brown. ~A

There are a few things to notice here:

1. The symbols: ^, v, ->, and <-> are called "two-place connectives." This is because
they connect two sentences together into a more complicated sentence.
2. The symbol: ~ is called a "one-place connective" because you only add it to one
sentence. (You cannot join multiple sentences together with it.) To negate a
sentence, all you have to do is stick a ~ on the front.

3. When you join two sentences with a two-place connective, you ALWAYS put
parentheses around it. So it is NOT appropriate to write this:

A^ B

That makes as much sense in symbolic logic as writing:

Nn7&% mm)]mm (

[TOP]

Parentheses

I know that a lot of books and instructors claim that it is okay to drop the outermost
parentheses in a sentence. I've done it myself many times. And 95% of the time it won't
cause you trouble if you're careful. But let's say we started with this 'sentence'

A^ B

and decided to negate it. Well, the way to negate a sentence is to stick a ~ on the front, so
let's do that:

~A ^ B

But wait! What we did there was just negate the A. We wanted to negate the whole
sentence. If we were really sharp, then we might notice that somebody had given us an
illegitimate sentence that was missing parentheses, and so we would add the parentheses
before adding the ~, to get:

~(A ^ B)
which is what we wanted.

It seems silly to make such a big deal about parentheses when we're dealing with simple
sentences, but when you're doing a 30-line derivation and you're tired, it's easy to make a
mistake just like that on line 17 and get yourself into real trouble. It's better to just
remember the simple rule and always add parentheses when you have a two-place
connective.

. . .
Let's take a deep breath and then go quickly over what we have so far.

[TOP]

Using Connectives

Connectives are logical terms,

^ (and)
v (or)
-> (if...then)
<-> (if and only if)
~ (not)
which you can add to a sentence.

A simple sentence is one that has no connectives. For example: A (the dog is brown).

A complex sentence is a sentence which is made up of one or more simple sentences and
one or more connectives. Some examples are:

(A ^ B)

(A v B)

(A -> B)

(A <-> B)

~A

You can use connectives on complex sentences just as you can on simple sentences. Let's
introduce a new simple sentence "it is raining," and let's call our new sentence C. We now
have a lot more sentences that we can make. (Keep in mind, we have no idea yet which of
these sentences are true or false; we also don't yet know how these sentences relate.) For
example:

(C ^ B)

~C

(B v C)

(C v B)
(B -> C)

(A -> C)

(C -> A)

(B <-> C)

(~B <-> C)

~(B <-> C)

~~C

((A ^ B) v C)

(((A ^ ~B) v ~C) -> (~(A v B) <-> C))

These can get a little complicated. That last sentence is especially scary looking; we'll
come back to it in a little while. For now, here is a quick run-down of how to use
connectives to make complex sentences from simple ones.

To make this complex


Do this We say
sentence
~C Stick a ~ on C. ~C is the negation of C.
Use a v to join B and (B v C) is the disjunction of B and
(B v C)
C. C.
Use a ^ to join B and (B ^ C) is the conjunction of B and
(B ^ C)
C. C.
B implies C.
Use a -> to join B and
(B -> C) (B -> C) is a conditional or
C.
implication.
Use a <-> to join B B implies C and C implies B.
(B <-> C)
and C. (B <-> C) is a biconditional.

[TOP]

Subsentences

Some of our sentences had more than one connective:


(~B <-> C)

~(B <-> C)

~~C

((A ^ B) v C)

(((A ^ ~B) v ~C) -> (~(A v B) <-> C))


The sentence
(~B <-> C)
is made by joining the sentences
~B

C
with
<->
The complex sentence
~B
is called a "subsentence" of the larger sentence
(~B <-> C)
because it is a smaller sentence inside the large one.

The simple sentence


C
is also a subsentence of the larger sentence
(~B <-> C)
The simple sentence
B
is a subsentence of the subsentence
~B
and so
B
is also a subsentence of
(~B <-> C)
There are two connectives used in the larger sentence
<->

~
but they are not equally important. In this case the
<->
is much more important than the
~
Remember how the sentence was made by taking the two smaller sentences
~B

C
and connecting them with a
<->

The <-> is therefore called the "main connective" of the sentence. Main connectives are,
without a doubt, absolutely the most important idea in logic. The hardest skill to learn in
logic is to identify the main connective of a sentence. Make sure you understand what
main connectives are.

Compare the sentence we've been looking at,

(~B <-> C)
with one that looks similar,
~(B <-> C)
This new sentence is very different. It was made by negating
(B <-> C)
The main connective of
~(B <-> C)
is therefore
~
and
(B <-> C)
is just a subsentence of
~(B <-> C)

[TOP]

Complicated sentences

Now let's take a closer look at the most complicated sentence on our list and see if we can
make it more manageable. The way to analyze a complicated sentence is to start at the
outside and work your way in.

The outermost parentheses on this ugly sentence


(((A ^ ~B) v ~C) -> (~(A v B) <-> C))
are used to connect these two sentences
((A ^ ~B) v ~C)

(~(A v B) <-> C)
with a
->
So the way to build our ugly sentence is to start with these two less ugly sentences:
((A ^ ~B) v ~C)

(~(A v B) <-> C)
and connect them with the main connective
->
We can then analyze each subsentence if we like.

I told you before that simple sentences are represented by the capital letters A through T,
and that U, W, X, Y, and Z are saved for something else. (I rarely use V because it looks
to much like the symbol for 'or'.) U, W, X, Y, and Z are used as shorthand for other
sentences in logic (some books use italic letters and others use Greek letters, but since I
only have plain text to work with, I use the end of the alphabet). I call these "sentence
variables."

So, just as we can take the English sentence


There is nothing on TV.
and use the capital letter D to represent it, we can take the sentence in logic
((A ^ ~B) v ~D)
and use the capital letter U to represent it.

[Note: it is also legal to use capital variables to stand for simple sentences. So you can
take the simple sentence
B
and use the letter Z to stand for it.]

This can be useful in analyzing complicated sentences. For example, if we have the scary
looking sentence
((((A ^ ~B) v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A)
we can start using sentence variables to stand for subsentences. So if U stands for
(A ^ ~B)
Then we have
(((U v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A)
and if V stands for
(U v (B <-> C))
then we have
((V -> (~(C v D) ^ ~(~A -> ~~D))) v A)
If W stands for
~(C v D)
we have
((V -> (W ^ ~(~A -> ~~D))) v A)
and if X stands for
~(~A -> ~~D)
we have
((V -> (W ^ X)) v A)
and if Y stands for
(V -> (W ^ X))
we have
(Y v A)

So we know where our main connective is. And by substituting back in for the sentence
variables, we can recreate our sentence in managable chunks.

It is very important to keep track of what sentence variables stand for when you're doing
this kind of substitution. This can be a major source of error if you're not keeping close
track of what every letter stands for.

. . .

Now that we know all the details of the language of symbolic logic, it's time to actually
do symbolic logic.

The first step with every sentence is to identify the main connective. The reason is
simple:

In symbolic logic, the main connective of a sentence is the only thing that you can
work with.
Let's look at our complicated sentence from earlier

(((A ^ ~B) v ~C) -> (~(A v B) <-> C))


is fundamentally an implication between these two subsentences:
((A ^ ~B) v ~C)

(~(A v B) <-> C)
There is no
->
in either subsentence, but the sentence as a whole is still first and foremost an implication
because of what its main connective is. So when you're trying to figure out how in the
heck you can work with this ugly sentence
(((A ^ ~B) v ~C) -> (~(A v B) <-> C))
you need to remember that it is an implication and treat it just as one.

[TOP]

What the Connectives Mean

Here's a quick course on what the connectives mean. (I assume you have some familiarity
with them.)

The
is TRUE whenever is FALSE whenever
sentence
A A is true A is false
~A A is false A is true
A is false; or B is false; or both A and
(A ^ B) A is true and B is true
B are false
A is true; or B is true; or both A and
(A v B) A is false and B is false
B are true
A and B are both true; or A and B A is false and B is true, or A is true
(A <-> B)
are both false and B is false
A is false; or B is true; or A is false
(A -> B) A is true and B is false
and B is true

This last one is a little weird, so let's think about it. If we translate it back into English,
we get

If the dog is brown then the dog weighs 15 lbs.

How would we go about proving that this sentence is false?

Let's say that the dog is brown and the dog weighs 15 lbs. Does that disprove the if...then
statement? Certainly not!

What if the dog is brown but the dog weighs 25 lbs.? That does disprove the statement.

What if the dog turns out to be white? Then we cannot disprove the inference because it
only makes a prediction about a brown dog. If the dog isn't brown, then we can't test the
prediction.

So the only way to make the sentence

(A -> B)
false is to make A true and B false at the same time. Given any other values of A and B,
the sentence comes out true.

[TOP]

The Rules of Logic

Now we're finally ready to learn the rules of logic. There are exactly 12 - no more, no
less. Each connective has two rules associated with it, and there are two special rules.

Let's start with one of the special rules first.

[TOP]
1. Assumptions

The first special rule is the rule of assumptions. It is deceptively easy. The rule is:

You are allowed to assume anything you want at any time.


But there is a catch:
You have to keep track of what assumptions you have made.

Well that makes sense. Let's say you and I are detectives trying to solve a mystery. I could
say something like "let's assume for the time being that the dog is brown." Once I said
that, we could discuss what that would mean. Anything we conclude from that
assumption is perfectly okay, as long as we remember that it was under the assumption
that the dog is brown. In other words, whatever we do prove under the assumption that
the dog is brown must be followed by a disclaimer "assuming that the dog is brown."

Eventually, we would want to prove something about the case that doesn't depend on the
dog being brown. Logicians call this "discharging" the assumption. Fortunately, some of
our other rules tell us how to discharge assumptions.

1. When I do derivations, I number each new line. I start new assumptions using curly
brackets
{

2. and then I indent everything after a new assumption;

3. When I discharge an assumption I close the curly brackets

}
4. And then I stop indenting.

One other very important thing to keep in mind:


Once you close off an assumption, you can no longer use any lines between the
curly brackets. So since I've closed the curly brackets above, I would no longer be
able to use either of the two lines between them: they are gone forever. So lines
(2) and (3) above are illegal.

However, line (1) is legal because it is outside the curly brackets, and so is line
(4).
This can get complicated if you have assumptions inside of assumptions.

And finally, and perhaps central to logic:


A logical truth is something that you can write with all your assumptions
discharged.
Before we can do some short derivations, we need to learn two other rules. Let's start
with one of the two rules that we get from the
->
connective.

[TOP]

2. -> Introduction

The rule is called "-> introduction." The way it works is:

If you assume
X
and then you derive
Y
then you are entitled to discharge the assumption and write
(X -> Y)
That makes sense. Let's just say that we assumed
A

[The dog is brown]


And then we did some logic and out of that we proved
E

[The killer is a man]


If we did that, we would be entitled to say to a jury
(A -> E)

[If the dog is brown then the killer is a man]


The sentence
(A -> E)
is true.

[TOP]

3. ^ Elimination

Let's learn one more rule for now. This one is called "^ elimination."

If you have
(X ^ Y)
then you are entitled to
X
and you are also entitled to
Y
That makes sense too. Lets say we knew for a fact that
(A ^ B)

[The dog is brown and the dog weighs 15 lbs]


Then we would certainly be entitled to conclude
A

[The dog is brown]


and we would also certainly be entitled to conclude
B

[The dog weighs 15 lbs]

A Derivation

Now let's take an example of a derivation. Suppose I wanted to prove that this is a logical
truth
((A ^ B) -> A)
I would start by identifying the main connective, which is a ->. I know how to introduce a
new ->, I assume the left and then derive the right. Let's try it:
{

1) (A ^ B) [assumption]

2) A [^elim on 1]

3) ((A ^ B) -> A) [->intro on 1-2]


We just used our 3 rules to derive
((A ^ B) -> A)

[If (the dog is brown and the dog weighs 15 lbs) then the dog is brown]

[TOP]

4. Repetition

There's one other special rule. It's called "repetition." The rule simply says that if you
have
X
Then you are entitled to write
X
Provided that it was not inside a closed curly bracket.

[TOP]

5. ^ Introduction

The other rule with ^ is called "^ introduction." It says, if you have
X
and you also have
Y
then you are entitled to
(X ^ Y)
That makes sense too. Let's say that I have already proven
E

[The killer is a man]


And I have also proven
F

[The killer is tall]


Then I am certainly allowed to say to the jury
(E ^ F)

[The killer is a man and the killer is tall]

Continuing the Derivation

Let's continue our derivation using our new rules.

1) (A ^ B) [assumption]

2) A [^elim on 1]

3) ((A ^ B) -> A) [->intro on 1-2]

4) C [assumption]

5) ((A ^ B) -> A) [repetition of 3]

6) (((A ^ B) -> A) ^ C) [^intro on 4 and 5]

7) (C -> (((A ^ B) -> A) ^ C) [->intro on 4-5]

[TOP]

6. -> Elimination

The other rule for -> is called "-> elimination." It says that if you have
X
and you have
(X -> Y)
then you are entitled to
Y
That makes sense too. If I know
A

[The dog is brown]


and I know
(A -> E)

[If the dog is brown then the killer is a man]


then I am certainly entitled to conclude
E

[The killer is a man]

Adding to the Derivation

Let's add a little more to our derivation:

1) (A ^ B) [assumption]

2) A [^elim on 1]

3) ((A ^ B) -> A) [->intro on 1-2]

4) C [assumption]

5) ((A ^ B) -> A) [repetition of 3]

6) (((A ^ B) -> A) ^ C) [^intro on 4 and 5]

7) (C -> (((A ^ B) -> A) ^ C) [->intro on 4-5]

8) (A ^ B) [assumption]

9) ((A ^ B) -> A) [repetition of 3]

[Note: Line (9) is NOT a repetition of (5) because


(5) is inside closed curly brackets. (3) is not,
so it is okay to repeat it here.]

10) A [->elim on 8 and 9]


[Note: I did not discharge the assumption I made on line (8). So
A
is not a logical truth; it is true only on the assumption that
(A ^ B)
is true.]

[TOP]

7. <-> Introduction

The next two rules have to do with <->. The first is called "<-> introduction." It states
that if you have
(X -> Y)
and you have
(Y -> X)
then you are entitled to
(X <-> Y)
This one is a little tricky to explain, and the best way (I'm sorry to say) is truth tables. So
you should try all the possible combinations for X and Y and convince yourself that if
(X -> Y)
and
(Y -> X)
are both true, then
(X <-> Y)
must be true too.

[TOP]

8. <-> Elimination

The next rule is called "<-> elimination." This one says that if you have
(X <-> Y)
And you have
X
Then you are entitled to
Y
OR

If you have
(X <-> Y)
And you have
Y
Then you are entitled to
X
This makes sense because if you know
(X <-> Y)
then you know that X and Y have the same truth value. So if you know one of them is
true, then the other must also be true.

A New Derivation
Let's start a new derivation.

1) (A ^ B) [assumption]

2) A [^elim on 1]

3) B [^elim on 1]

4) (B ^ A) [^intro on 2 and 3]

5) ((A ^ B) -> (B ^ A)) [->intro on 1-4]

6) (B ^ A) [assumption]

7) B [^elim on 6]

8) A [^elim on 6]

9) (A ^ B) [^intro on 7 and 8]

10) ((B ^ A) -> (A ^ B)) [->intro on 6-9]

11) ((A ^ B) -> (B ^ A)) [repetition of 5]

12) ((A ^ B) <-> (B ^ A)) [<->intro on 10 and 11]

[TOP]

9. ~ Introduction

Next we have "~ introduction." It says that if you assume


X
And then you derive a contradiction, you are entitled to discharge the assumption and
write
~X
A contradiction is any sentence
Y
followed on the next line by the negation of that sentence
~Y
This rule is the familiar "reductio ad absurdum." An easy way to think of it is this. If we
assume
~F

[The killer does not have red hair]


And we prove from that
A

[The dog is brown]


and
~A

[The dog is not brown]


then something is wrong with our assumption.

[TOP]

10. ~ Elimination

"~ elimination" is almost identical. It says that if you assume


~X
and derive a contradiction, then you are entitled to discharge the assumption and write
X

A Quick Derivation

1) (A ^ ~A) [assumption]

2) A [^elim on 1]

3) ~A [^elim on 1]

4) ~(A ^ ~A) [~intro on 1-3]

Lastly, let's look at the rules for v.

[TOP]

11. v Introduction

The first is "v introduction." It says that if you have


X
then you are entitled to write
(X v Y)
no matter what Y is.

That seems a little strange. Normally you wouldn't think you can just go throwing any old
sentence into a derivation. But remember
(X v Y)
is true as long as X is true OR Y is true OR both are true. So if you already know that X
is true, then the disjunction of X and anything else will be true.

A Short Derivation

1) A [assumption]

2) A [repetition of 1]

3) (A -> A) [->intro on 1-2]

4) ((A -> A) v B) [vintro on 3]

[TOP]

12. v Elimination

The last rule is a little tricky. It's "v elimination." It says if you have
(X v Y)
and you have
(X -> Z)
and you have
(Y -> Z)
Then you are entitled to
Z

[Most of the time this means that when you have a disjunction that you don't know what
to do with, you have to derive an implication for each side of the disjunction before you
can go on.]

The rule is hard to do with derivations, but it is actually not too hard to understand if you
take an example.

Let's say we know


(A v B)

[The dog is brown or the dog weighs 15 lbs]


And we know
(A -> E)

[If the dog is brown, then the killer is a man]


And we know
(B -> E)

[If the dog weighs 15 lbs, the killer is a man]


Then we don't have to bother figuring out whether A is true or B is true; either way we
are entitled to
E

[The killer is a man]

Continuing Our Last Derivation

Let's continue our last derivation to get a demonstration of "velim."

1) A [assumption]

2) A [repetition of 1]

3) (A -> A) [->intro on 1-2]

4) ((A -> A) v B) [vintro on 3]

5) (A -> A) [assumption]

6) C [assumption]

7) C [repetition of 6]

9) (C -> C) [->intro on 6-7]

10) ((A -> A) -> (C -> C)) [->intro on 5-9]

11) B [assumption]

12) C [assumption]

13) C [repetition of 12]

}
14) (C -> C) [->intro on 12-13]

15) (B -> (C -> C)) [->intro on 11-14]

16) ((A -> A) v B) [repetition of 4]

17) ((A -> A) -> (C -> C)) [repetition of 10]

18) (B -> (C -> C)) [repetition of 15]

19) (C -> C) [velim on 16, 17, 18]

[Not the most efficient way to prove (C -> C), but it is valid.]

There are a lot of other rules people try to tell you, but anything you can do with those,
you can do with these 12 rules.

[TOP]

Why These 12 Rules? A Review

The reason I like these rules is that with these rules you can do any derivation using the
same five steps:

Step 1: Find the main connective of the sentence you are trying to derive.

Step 2: Apply the rule for introducing that main connective.

Step 3: When you're in the middle of a derivation and you don't know what to do,
find the main connective of the sentence you have and eliminate it.

Step 4: Along the way you may have to derive subsentences using steps 1 through
3.

Step 5: If all else fails, you may have to do a "~ elimination" [I'll explain this step
a little later].
If you use those five steps, you should always know which rule to use. The reason is that
there are ONLY four things you are ever allowed to do in a derivation:

1. Eliminate the main connective of the sentence you are on.

2. Use the sentence you are on to eliminate the main connective of another sentence
(AS LONG AS THAT OTHER SENTENCE ISN'T CLOSED OFF IN CURLY
BRACKETS).

3. Repeat an earlier line that isn't closed off in curly brackets.


4. Make a new assumption.

[TOP]

Mundane Rules: What Do You Have?

Now that we have the steps for doing derivations, let me try to explain that confusing
business about discharging assumptions. I'm going to approach this from a slightly
different angle this time.

Of the 12 rules I gave you, 8 are pretty straightforward. They are what I would call the
"Mundane Rules." The way Mundane Rules work is: they say "if you have X and Y and
Z, then you are entitled to U."

The tricky thing with Mundane Rules is knowing what you "have."

You "have" any sentence that is written down on a line of the derivation except those
which are closed off in curly brackets (which are gone forever once the brackets close).

Being "entitled" to something just means that you can legally write it down as the next
line of the derivation.

The easiest Mundane Rule is repetition:

If you have
X
then you are entitled to
X

Another Mundane Rule is ^ introduction:

If you have
X
and you have
Y
then you are entitled to
(X ^ Y)

Simple enough. (I went into more detail on _why_ this is a sound rule in the last e-mail.)

Another pretty easy Mundane Rule is ^ elimination:

If you have
(X ^ Y)
then you are entitled to
X
Or, if you prefer, you are also entitled to
Y

So far so good.

Another Mundane Rule is -> elimination:

If you have
X
and you have
(X -> Y)
then you are entitled to
Y

This is actually the same thing as Modus Ponens, so you can call it that if you prefer.
Since I don't speak Latin, I prefer calling it "-> elimination" because that is more
descriptive of what the rule is doing.

Another Mundane Rule is <-> introduction:

If you have
(X -> Y)
and you have
(Y -> X)
then you are entitled to
(X <-> Y)
This one is a little tricky to explain. Let's assume somehow we have
(X -> Y)
Under what conditions could that be true? There are 3 possibilities:
X is true and Y is true

X is false and Y is true

X is false and Y is false


Also, we have
(Y -> X)
That can only be true under these conditions:
X is true and Y is true

X is true and Y is false

X is false and Y is false


Since we "have" both of these sentences, then they must both be true. So under what
conditions are they both true? Well, only these two:
X is true and Y is true

X is false and Y is false


Which are exactly the conditions for:
(X <-> Y)

Which means we are entitled to write that.

Another Mundane rule is <-> elimination:

If we have
(X <-> Y)
and we have
X
then we are entitled to
Y
OR:

If we have
(X <-> Y)
and we have
Y
then we are entitled to
X

Another Mundane Rule is v introduction:

If you have
X
then you are entitled to
(X v Y)
and you are also entitled to
(Y v X)

This is a little tricky too. We "have" X, which means X must be true. Now we can just,
out of the blue, pick any sentence we like and put it into a disjunction with X. Why can
we do that? Well, let's say we pick a FALSE sentence. Is that still okay?

Yes, it is! Even if Y is false, the disjunction with X is still true, so we haven't written a
false sentence, and we are still okay.

The last Mundane Rule is v elimination:

If you have
(X v Y)
and you have
(X -> Z)
and you have
(Y -> Z)
then you are entitled to
Z
So much for the Mundane Rules. Mundane Rules are useful in derivations because they
let you move from one step to the next. They tell you what you can do with the sentences
you have. They also can give you a hint as to what you need to do next. For example, if
you have
(X v Y)
and you want to eliminate the v, but you don't have
(X -> Z)

(Y -> Z)
yet, then you'd better go get those two sentences.

The problem with the Mundane Rules is that they only let you play around with sentences
you already HAVE. You can't get anything NEW out of them.

So far we've gone over the 8 Mundane Rules. There are 12 rules in total. Of the 4
remaining, one is a 'Special Rule' and three are 'Fun Rules'.

[TOP]

Special Rule

The Special Rule is the rule of assumptions:

You are free to assume anything you like at any time as long as you do these things:

1. Use curly brackets and indentation to keep track of what you have assumed.

2. Only discharge the assumption using one of the Fun Rules.

[Discharging an assumption just means you close the curly brackets and stop
indenting. So you can forget about the assumption.]

[TOP]

Three 'Fun' Rules

The three Fun Rules all have this form:

If you assume
X
and then, on that assumption, you derive
Y
You can discharge the assumption you made at X and then you are entitled to
Z

The first Fun Rule is -> introduction:

If you assume
X
And then, on that assumption, you derive
Y
You can discharge the assumption you made at X and then you are entitled to
(X -> Y)

This is, in my opinion, the most important and fundamental rule in logic. It is the
foundation of all logic. [It's also really important for derivations. If you look up at the
Mundane Rules, a lot of them require you to have sentences of the form (X -> Y) to apply
them.]

The justification is that if you assume (but don't prove)


A

[The dog is brown]


and then, on that assumption, you derive
P

[The floor is wet]


then you HAVE NOT proven that the floor is wet, but you have PROVEN (no
assumptions required) that
(A -> P)

[If the dog is brown, then the floor is wet.]

The last two Fun Rules are closely related. One is ~ introduction:

If you assume
X
And then, on that assumption, you derive
Y
and
~Y
you can discharge the assumption you made at X; then you are entitled to
~X

[You may have been taught this rule as a reductio ad absurdum.] The idea is that if
assuming X leads you to a contradiction, then there must've been something contradictory
ABOUT X ITSELF. So X must be false. If X is false, then by definition ~X is true: no ifs,
ands, buts, or assumptions about it.]
The last Fun Rule is ~ elimination:

If you assume
~X
and then, on that assumption, you derive
Y
and
~Y
you can discharge the assumption you made at ~X and then you are entitled to:
X

The idea here is basically the same. ~X is contradictory and therefore false, so X is
proven true.

I have another nickname for ~ elimination. It is what I call the "Fallback Rule." With
every other rule, the way it works is by getting what you want by introducing the main
connective or by using what you have by eliminating the main connective. But take a
look back at what I just did with ~ elimination. I just proved X is true. There's no way to
tell by looking at X that you can prove it by eliminating a ~, but you can. [Actually, ANY
sentence you like can be proven with ~ elimination, but it is sometimes hard to do.]

So this is where Step 5 of the derivations comes from. If you are trying to prove some
sentence X, the first thing to try is to try to introduce the main connective of X. But if you
run into a dead end doing that, then assume ~X and try to derive a contradiction.

Those are the rules reviewed and better organized so that they make sense, and the
difficult bit about discharging assumptions is (I hope) a little clearer.

One other thing to watch out for. Some logic problems ask you to prove that a certain
sentence is a logical truth. On those problems, you have to discharge all your assumptions
and prove that the sentence is true with no assumptions (that is, write it without indenting
and outside of all the curly brackets). I'll do an example of a derivation like that in a
minute.

[TOP]

Deriving a Conclusion

Other logic problems give you a list of "givens" or "hypotheses" and ask you to derive a
conclusion from them. In those problems, what they are saying is that you need to assume
the hypotheses, but not discharge those assumptions. Let me give you an example:

Given:
A

(A -> B)
(~B v C)
Prove:
C

Here we go:

1) A [assumption, given]

2) (A -> B) [assumption, given]

3) (~B v C) [assumption, given]

4) A [repetition of 1]

5) (A -> B) [repetition of 2]

6) B [->elim on 4&5]

Now what do we do? We have a disjunction on 3 that we don't know what to do with, so
we need to eliminate it. But in order to eliminate it, we need to get (~B -> something) and
(C -> something).

Let's first work on getting (~B -> something).]

{ [Note: this assumption isn't given,


so we're going to have to discharge it]

7) ~B [assumption]

Let's see if we can derive C. If we derive (~B -> C) then we'll be most of the way to
finishing the problem. How can we derive C? Well, we should try to introduce the main
connective. But wait! There is no main connective. C is just a simple sentence. So what
can we do? I guess we have to do step 5, try ~elimination.

{ [another assumption we'll have to discharge]

8) ~C [assumption]

9) B [repetition of 6]

10) ~B [repetition of 7]

} [Closes off lines 8-10]

11) C [~elim on 8-10]


Up to this point we haven't closed off any assumptions. That means that all of the lines up
to this point were sentence that we "have" and can use. But now we just closed off lines
8-10 by discharging the assumption at 8. That means that lines 8-10 are gone, they are
off-limits and illegal forever. The good news, though is that we derived C, so now we can
discharge the assumption we made at line 7.

} [Closes off 7-11]

12) (~B -> C) [->intro on 7-12]

This may seem like a bit of sleight of hand, like I'm trying to pull the wool over your
eyes. How can I use the assumption I made at line 7 as part of the contradiction? I just did
a ~elimination to prove C, but there was nothing contradictory about ~C itself; the
contradiction was that I had B and then I assumed ~B. This is the familiar refrain
"anything can be proven from a contradiction." Once I assumed ~B, I could've proven
(~B -> anything-I-want), I chose to prove (~B -> C) because I eventually want to get C.

Now we have made some progress on eliminating the disjunction we had on line 3: (~B v
C). We have (~B -> C), now we need (C -> C), so let's go get it.

13) C [assumption]

14) C [repetition of 13]

Notice that I can repeat 13 because I have not yet discharged that assumption. However, I
cannot repeat the C on line 11 because I closed off line 11 already, so it is gone forever.

} [Closes off 13-14]

15) (C -> C) [->intro on 13-14]

16) (~B v C) [repetition of 3]

17) (~B -> C) [repetition of 12]

18) (C -> C) [repetition of 15]

19) C [velim on 16,17,&18]

Not all of those repetitions were necessary, since we "had" those lines already (they
hadn't been closed off), but I added them for clarity.

You should go back and double-check the derivation to make sure that I never broke the
rules by using a line that was closed off and that I didn't break any other rules. Also, make
sure that I discharged all the assumptions except the three I was given at the start.

[TOP]
Deriving a Sentence

As I said before, the other type of problem is where you are handed a sentence and told to
derive it. In this problem, you can make any assumptions you need, but you have to
discharge all of them and end up with the sentence you're looking for at the end. This
usually involves finding the main connective of the sentence you're supposed to prove
and then introducing it (sometimes you have to find other sentences too: for example if
the main connective is ^, you need to prove each half of the sentence and then do an
^intro).

Sometimes trying to do this will get you to a dead end, and then you may try to assume
the negation of the sentence you're trying to get and see if you can find a contradiction.

Let's do an example. Let's try to prove

(((~A ^ ~C) v (~C <-> B)) -> (B -> ~C))

The main connective is a ->, so let's introduce it. To do that we need to assume the left
and derive the right.

1) ((~A ^ ~C) v (~C <-> B)) [assumption]

Here we have a disjunction, so we need to eliminate it. That means we need to find two
entailments. It would be great if we had ((~A ^ ~C) -> (B -> ~C)) and ((~C <-> B) -> (B
-> ~C)), so let's try to get those. First, let's work on ((~A ^ ~C) -> (B -> ~C)).

2) (~A ^ ~C) [assumption]

3) ~A [^elim on 2]

4) ~C [^elim on 2]

We actually don't need line 3, but it's good practice to get both sides of an ^ while you
can just in case you might need them later. Now we have ~C, but what we want is (B ->
~C), so let's work toward getting that.

5) B [assumption]

6) ~C [repetition of 4]

} [closes 5-6]

7) (B -> ~C) [->intro on 5-6]


} [closes 2-7]

8) ((~A ^ ~C) -> (B -> ~C)) [->intro on 2-7]

So now what we need to finish the velimination on line 1 is to derive ((~C <-> B) -> (B
-> ~C)).

9) (~C <-> B) [assumption]

So now we need (B -> ~C).

10) B [assumption]

11) ~C [<->elim on 9-10]

If you wanted to, you could make this a little more clear by repeating (~C <-> B) and
then doing the <->elim, but it's not necessary since we have both (~C <-> B) and B we
are entitled to ~C.

} [closes 10-11]

12) (B -> ~C) [->intro on 10-11]

} [closes 9-12]

13) ((~C <-> B) -> (B -> ~C)) [->intro on 9-12]

Now we can do our velimination on line 1 because we have ((~C <-> B) -> (B -> ~C))
and ((~A ^ ~C) -> (B -> ~C)). If you want, for clarity, you can repeat line 1 and line 8,
but it's not necessary.

14) (B -> ~C) [velim on 1,8,13]

15) (((~A ^ ~C) v (~C <-> B)) -> (B -> ~C))

As always, you should go back and double-check the derivation once you're done.

The hardest thing about doing derivations is figuring out what to do next. When you have
a lot of random rules with Latin names to choose from, it's difficult. This set of rules
helps you to know what to do by either introducing what you're trying to get or
eliminating what you have.

My advice to you is to try to do some of the derivations in your book or that you had for
your class using these rules. Any derivation is possible with them. It takes a lot of time to
learn logic and have it sink in, but if you take it slowly enough and practice, it will
become easier. Get comfortable with these 12 basic rules and the 5-step method for doing
derivations.

[TOP]

Rules with Latin Names

Many (probably most) places, you don't learn these 12 rules with logic. Even though I
think these rules make the most sense and allow a straightforward approach to solving
any problem in symbolic logic, more advanced students may want to study the rules such
as Modus Tollens and DeMorgan's Law.

The twelve rules I've presented here are systematic and straightforward, and all of them
move by baby steps. The way I think of these rules (in some cases this is not historically
accurate) is that logicians noticed that when doing derivations, they often repeated the
same steps over and over. Eventually, someone decided that rather than doing these same
five or ten steps, you can take shortcuts.

Modus Tollens serves as an instructive example. Let's say I have:

(A -> B)
And I have:
~B
And I am trying to get:
~A
Here's what I'd have to do. Since I'm trying to find ~A, I'll do a ~introduction on A:
{

1) (A -> B) [assumption, given]

2) ~B [assumption, given]

3) A [new assumption]

4) B [->elim on 1 and 3]

5) ~B [repetition of 2]

} [closes off 3-5]

6) ~A [~intro 3-5]

We have to do this so often that we just call these five steps "Modus Tollens." Modus
Tollens is a shortcut rule. There are several others too, some more involved.
The following is a list of the major rules, together with a justification of why each of
them is valid and a short example of how you might use some of the more challenging
ones.

[TOP]

1. Modus Ponens

This is one of the most straightforward laws in logic. It states that if you have

(X -> Y)

and you have

then you are entitled to:

This is just what we've been calling "-> elimination."

The reason it works is that we are given (X -> Y). Which means that X cannot be true at
the same time Y is false. So if X is true (which is the other given), then Y must be true as
well, so we are free to conclude Y is true.

Example: "If it is raining, then there are clouds" and "it is raining" together imply "there
are clouds."

[TOP]

2. Modus Tollens

This law is just the flip side of modus ponens. It states that if you have

(X -> Y)

and you have

~Y

then you are entitled to

~X
The reason this works is that we are again given (X -> Y). This means that X cannot be
true at the same time Y is false. So if Y is false (which is the other given), then X must be
false as well. So we are free to conclude X is false (or ~X is true).

Example: "If it is raining, then there are clouds" and "there are no clouds" together imply
"it is not raining."

[TOP]

3. DeMorgan's Law (I)

DeMorgan came up with a couple sets of equivalencies. The first is that if you have

~(X ^ Y)

then you can conclude

(~X v ~Y)

and if you have

(~X v ~Y)

then you can conclude

~(X ^ Y)

The reason this works is that our starting point is ~(X ^ Y), which is the negation of (X ^
Y). Now, (X ^ Y) can only be true if X is true and Y is also true. So (X ^ Y) will be false
if X is false or if Y is false. That is, (X ^ Y) will be false if (~X v ~Y) is true. So ~(X ^ Y)
is equivalent to (~X v ~Y).

Example: "My dog is fat, or my cat is fat" is equivalent to "It is not true that both my dog
and cat are thin."

[TOP]

4. DeMorgan's Law (II)

The second equivalence which bears DeMorgan's name is that

~(X v Y)

is interchangeable with

(~X ^ ~Y)
The only way in which ~(X v Y) can be true is if X and Y are both false. So the two
expressions can be interchanged just like in the first law.

Example: "My dog is fat and my cat is fat" is equivalent to "It is not true that my dog or
cat is thin."

[TOP]

5. Hypothetical Syllogism

The rule here is that if you have

(X -> Y)

and you have

(Y -> Z)

then you can conclude

(X -> Z)

Here's why:

We know that "if X is true, then Y is true." And we know that "if Y is true, then Z is true."
But we don't know anything about whether any of the letters are actually true or not.

Let's assume (or hypothesize) for a second that X is true. Then, by modus ponens, Y is
true. And then by modus ponens again, Z is true. So: If we assume X is true, then we
conclude Z is true. Since we didn't know X was true, we cannot take Z home with us, but
we can say that "If X was true, then Z would be true." This is equivalent to saying "If X,
then Z" or (X -> Z).

Example: "If it is raining, then there are clouds" together with "if there are clouds, then
the sun will be blocked" imply "if it is raining, then the sun will be blocked."

[TOP]

6. Disjunctive Syllogism

The rule here is that if you have

(X v Y)

and you have

~X
then you can conclude

Here's why:

We know first of all that "X or Y is true." We also know that X is false. If X or Y is true,
and X is false, then Y has no choice but to be true. So we can conclude that Y is true.

Example: "My dog is fat or my cat is fat" together with "my dog is thin" imply "my cat is
fat."

[TOP]

7. Reductio Ad Absurdum (Proof by Contradiction)

This rule states that if you assume

and, from that, you conclude a contradiction, such as

(Y ^ ~Y)

then you can conclude that your assumption was false, and

~X

must be true. You can find a more complete explanation of this at

Proof by Contradiction
http://mathforum.org/library/drmath/view/62852.html

[TOP]

8. Double Negation

This rule simply states that if you have

~~X

then you can interchange that with

which should be apparent based on what the ~ is.


[TOP]

9. Switcheroo

(I've heard that this was actually named after a person, but I don't know that for certain.)

This is a shortcut rule which states that if you have

(X v Y)

then you can interchange that with

(~X -> Y)

To understand why, let's think about (~X -> Y). This says that ~X cannot be true at the
same time that Y is false. Or, to put that another way, X cannot be false at the same time
Y is false.

So (~X -> Y) can only be false when X and Y are both false. Similarly, the only way for
(X v Y) to be false is to have X and Y both false. So the two expressions are true unless X
and Y are both false, so they have the same "truth conditions" and are therefore
equivalent (i.e. interchangeable).

Example: "My dog is fat, or my cat is fat" is equivalent to "If my dog is thin, then my cat
is fat."

(This one is hard to wrap your mind around, but think about what must be true/false
about the world in order to make each statement true or false and it should eventually
become clear.)

[TOP]

10. Disjunctive Addition

This is just what we've been calling "v introduction."

[TOP]

11. Simplification

This is just what we've been calling "^ elimination."

[TOP]

12. Rule of Joining

This is just what we've been calling "^ introduction.""


The problem with shortcut rules is that they're easy to misuse. In my opinion, the best
way to learn them is to practice with the twelve systematic rules and if you find yourself
doing the same steps over and over, you may have found a shortcut rule.

If there's a rule you don't understand, try to use the twelve systematic rules to figure out
how the rule works. Once you see the steps in deriving the rule and you know why it is a
valid shortcut, you won't have any trouble using it. And remember, if you get stuck and
don't know what to do, you can always fall back on the twelve systematic rules.

Fuzzy or "multi-valued" logic is a variation of traditional logic in


which there are many (sometimes infinitely many) possible truth values
for a statement. True is considered equal to a truth value of 1, false
is a truth value of 0, and the real numbers between 1 and 0 are
intermediate values.

What is Fuzzy Logic?


The easy definition is that fuzzy logic is a kind of logic in which
propositions don't have to be either true or false.

In normal binary logic, the answer to a question like "Is Joe tall?"
would have to be either "yes" or "no" - either 1 or 0. In terms of
attributes, Joe would either have "tallness," or he wouldn't.

This is one of the things that makes binary logic break down so easily
when you try to apply it to the real world, where people are "sort of
tall," food is "mostly cooked," cars are "pretty fast," jewelry is
"very expensive," patients are "barely conscious," and so on.

To paraphrase Einstein, to the extent that binary logic applies to


reality, it is not certain; and to the extent that it is certain, it
doesn't apply to reality.

In fuzzy logic, Joe can have a tallness value of, say, 0.9, which can
combine with values for other attributes to produce "conclusions" that
look more like "The clothes have a dryness value of 0.91" than "The
clothes are dry" or "The clothes are not dry." It is frequently used
to control physical processes - washing or drying clothes, toasting
bread, bringing trains to smooth stops at the right places, keeping
planes on course, and so on. It's also used to support decisions -
whether to buy or sell stock, whether to support or oppose a
particular political position, and so on.

In short, fuzzy logic provides an alternative to logic that is useful


whenever you want to be able to express attributes in shades of gray,
rather than as black or white.

As far as the more advanced question of connectives in fuzzy logic,


here are what I believe are the generally accepted rules.

You start with the basic connectives is symbolic logic:


^ (and)
v (or)
-> (if, then)
<-> (if and only if)
~ (not)

And extend them.

Let's start with ^ (and). In Boolean logic (A ^ B) is true if and only


if both A is true and B is true. We can say that in a different way:

(A ^ B) has a value of 1 if A has a value of 1 and B has a value


of 1, if either A or B has a value of 0, then the conjunction
has a value of 0.

So the way this is traditionally extended to fuzzy logic is to say


that the conjunction (A ^ B) carries the minimum truth value of A or
B.

For example, if A has a truth value of 1 and B has a truth value of


0.8, then the minimum of these two is 0.8 and the conjunction (A ^ B)
will carry the truth value of 0.8.

Next, let's talk about v (or). In Boolean logic (A v B) is true if A


is true or if B is true or if both are true. We can say that in a
different way:

(A v B) has a value of 1 if A has a value of 1 or B has a value


of 1, if both A and B have a value of 0, then the disjunction has
a value of 0.

So the way this is traditionally extended to fuzzy logic is to say


that the disjunction (A v B) carries the maximum truth value of A or
B.

For example, if A has a truth value of 0.2 and B has a truth value of
0.6, then the maximum of these is 0.6 and the disjunction (A v B) will
carry the truth value of 0.6.

Next, let's talk about ~ (not). Unlike the other connectives, ~


doesn't join two sentences; it is only applied to a single sentence.

In Boolean logic, ~A carries the opposite truth value of A. So it is


false if A is true and true if A is false.

Or: ~A has a value of 0 if A has a value of 1 and a value of 1 if A


has a value of 0.

The way this is extended to fuzzy logic is to say that ~A has a truth
value equal to 1 minus the truth value of A.

So, for example, if A has a value of 0.3, then 1 - 0.3 = 0.7, so ~A


has a value of 0.7.

Next, let's talk about -> (if, then). In Boolean logic, (A -> B) is
ONLY false if A is true AND B is false; in all other cases it is true.
Or, (A -> B) carries a value of 1 UNLESS A has a value of 1 and B has
a value of 0. So, if A has a value of 0, then (A -> B) will definitely
have a value of 1; or if B has a value of 1 then (A -> B) will
definitely have a value of 1.

So (A -> B) is at least as true as the opposite of A, and it is also


at least as true as B. So in Boolean logic, (A -> B) has a truth value
equal to the maximum of ~A and B.

Fuzzy logic uses this definition. The truth value of (A -> B) is equal
to the maximum of the truth value of ~A and the truth value of B.

For example, if B has a truth value of 0.5 and A has a truth value of
0.4, then ~A has a value of 0.6. The maximum of B (0.5) and ~A (0.6)
is 0.6 so (A -> B) will have a value of 0.6.

The last connective, <-> (if and only if), is the hardest to extend to
fuzzy logic.

In Boolean logic, (A <-> B) is true if both A and B have the same


truth value, and false otherwise. Or you can say:

(A <-> B) has a value of 1 if A and B both have a value of 1 or


if A and B both have a value of 0, and it has a value of 0
otherwise (i.e. if A has a value of 0 and B has 1 or vice versa).

My first guess for how to apply this to fuzzy logic was that it is
simply an equals relation. So if A and B have equal values, then
(A <-> B) will have a value of 1; otherwise it will have a value of 0.
This is okay as a first guess, but the problem is that now (A <-> B)
can only have 1 or 0 as a value, which is not really very fuzzy at
all.

Let's go back to Boolean logic for a minute and think a bit more about
(A <-> B). This is read "A if and only if B." It means that if we know
A is true, then we can conclude that B is true AND we if we know B is
true, then we can conclude that A is true. In other words, (A <-> B)
is equivalent to the conjunction of (A -> B) and (B -> A) or to put it
more formally:

(A <-> B) = ((A -> B) ^ (B -> A))

But we've already figured out how to do fuzzy logic on these


connectives. So let's just apply that to our <-> connective.

Specifically:

(A <-> B) has a value equal to the minimum (conjunction) of:


(A -> B)
and
(B -> A)

So first you figure out the value of (A -> B) (which is the maximum
of ~A and B), then you figure out the value of (B -> A) (which is the
maximum of ~B and A), and then you take the minimum of those.

That's pretty complicated, so before we do an example calculation with


fuzzy logic, let's make sure it works with two-valued (Boolean) logic.

Let's say that A and B both have truth values of 0. What does
(A <-> B) have in this case?

First, we take the maximum of ~A and B. ~A will have a value of 1 and


B will have a value of 0. So the maximum is 1.

Second, we take the maximum of ~B and A. ~B will have a value of 1 and


A will have a value of 0. So the maximum is 1.

Finally, we take the minimum of the first two steps above. Step one
gave us 1 and step two gave us 1, so the minimum is 1.

What if A has a truth value of 1 and B has a truth value of 0?

First, we take the maximum of ~A and B. ~A will have a value of 0 and


B will have a value of 0. So the maximum is 0.

Second, we take the maximum of ~B and A. ~B will have a value of 1 and


A will have a value of 1. So the maximum is 1.

Finally, we take the minimum of the first two steps above. Step one
gave us 0 and step two gave us 1, so the minimum is 0.

For a fuzzy logic example, let's say that A has a value of 0.7 and B
has a value of 0.5. What is the value of (A <-> B)?

First, we take the maximum of ~A and B. ~A will have a value of 0.3


and B will have a value of 0.5. So the maximum is 0.5.

Second, we take the maximum of ~B and A. ~B will have a value of 0.5


and A will have a value of 0.7. So the maximum is 0.7.

Finally, we take the minimum of the first two steps above. Step one
gave us 0.5 and step two gave us 0.7, so the minimum is 0.5. So in
this example (A <-> B) has a truth value of 0.5.

Vous aimerez peut-être aussi