Académique Documents
Professionnel Documents
Culture Documents
Is it...
• A type of applied computer science? (But what about theoretical and applied AI?)
• A branch of cognitive science?
• A science at all?
What is intelligence?
• What distinguishes human behavior from the behavior of everything else?
• Whatever behaviors we don't understand?
• Whatever "works", that is, results in survival (or flourishing) in complex
environments
Possible goals
• To understand intelligence independent of any particular "implementation"
• To model human or animal behavior
• To model human thought
• To implement rational thought
• To implement rational action
Introduction to representation
Why representation?
• Tasks: going from inputs (stimuli) to outputs (responses)
• The most primitive solution: a lookup table which specifies an output for every
input
• The problem with lookup tables:
o There may be too many inputs to store (the world is continuous, after all).
o There is a need for the system to be able to respond appropriately to novel
inputs.
• The alternative solution: a function from inputs to outputs
o AI is about these functions: what they might look like for tasks requiring
"intelligence".
o The functions may be very complex, requiring one or more
transformations of the input on the way to the output: internal
representations.
• Deduction, inference: given some facts, infer one or more other facts that follow
from them.
• Induction: given some examples, create a general rule or category that covers
them all.
Given multiple instances of scenes, create the rule relating ON and UNDER.
Given multiple instances of blocks, create the category BLOCK and use it to
categorize new instances.
Predicate calculus
What should a good representational format do for us?
• It should be correct. It should represent what we think it represents, permitting the
inferences that we would like the system to make and failing to make inferences
that we would not like (together with an inference mechanism).
• It should be expressive; it should allow us to distinguish all situations that we
need to distinguish (it should be unambiguous). It should allow us to point to
entities that need to be pointed to.
• It should treat situations which we believe are similar as similar.
• It should be flexible, allowing us to represent the same situation in different ways,
reflecting different construals.
Miscellaneous considerations
• Connectives: conjunction (and), disjunction (or), implication (if), equivalence
(equiv), negation (not): truth of each defined in terms of the truth of its
operand(s)
• Existential quantification
(equiv (if p q)
(if (not q) (not p)))
(equiv (if p q)
(or (not p) q))
• infer takes a database and returns a list of sentences that can be inferred from the
database
• true? takes a database and a sentence and tells whether the sentence is true, given
the database. dont-know is a possibility.
When a moving ball strikes a stationary ball, the moving ball is deflected in a direction
which is roughly opposite to its original direction, and the originally stationary ball
starts moving roughly in the direction of the originally moving ball. Ball 3 struck Ball 2,
which was stationary. Ball 3 was moving south when this happened.
(Ignore details like velocity, the effect of the mass of the balls, and the angle at which the
moving ball strikes the stationary ball (unless you want to be really ambitious :-). This is
very naive physics.)
Search 1
Problem solving as search
• A solution as a state in the space of possible solutions
• Problem solving as search through this state space
Formalizing search
• Problem state: a particular configuration of the objects that are relevant for a
problem
Figuring out how to represent the problem states is no simple matter.
• State space: the space of all possible problem states
Normally only a partial state space is considered.
• Initial state: where the search starts
• Goal states: where the search should end
There may be any number of these, and we may be interested in finding only one
or all of them.
• State space search: search through the space of problem states for a goal state
• Search trees: nodes are states (or paths), links are one-step transitions from one
state to a successor state
• Expanding a node (extending a state/path)
• Testing for goal states (nodes)
• The queue of untried states (paths); adding new states (paths) to the queue (stack)
• Branching factor of the search: average number of successor states each state has
• Depth of the search: how far down the tree is extended during the search
1. A predicate goal?, which takes a state and returns #t if the state is a goal state
2. A procedure expand, which takes a state and returns all of the successor states to
the state
• Form a one-element queue consisting of a zero-length path that contains only the
root node.
• Until the first path in the queue terminates at a goal node (satisfies goal?) or the
queue is empty,
o Remove the first path from the queue; create new paths by expanding the
first path to all the neighbors of the terminal node.
o Reject all new paths with loops.
o ...
o Add the new paths, if any, to the queue.
o ...
• If the goal node is found, announce success and return the path to the goal state
found; otherwise, announce failure.
To implement the type of search, we can add a merge-queue argument to our search
procedure. There is a different one of these for each of the ways we will add new paths to
the queue.
Tower of Hanoi
• To do best-first search using your basic search procedure, you only need to define
a new merge-queue procedure. best-first-merge takes the queue, the new
paths, and the problem-specific estimate procedure, adds the new paths to the
queue, and then sorts the new queue by the values returned by estimate when
applied to the first state on each path. You can use the Scheme procedure sort to
do this. sort takes a comparison predicate and a list and sorts the list using the
predicate to compare items.
• Figure out how you will represent states for your particular problem. Here's one
way for Tower of Hanoi.
o For the Tower of Hanoi Puzzle, states are lists consisting of a sublist for
the disks on each peg. Numbers represent the diameters of the disks, and
they are arranged from top to bottom. Thus this is the initial state for the 3-
disk puzzle:
((1 2 3) () ())
• Write the goal?, expand, estimate, and print-state procedures for your
particular problem.
o For the Tower of Hanoi Puzzle, these are one possibility.
• Callthe search procedure on the problem-specific states and procedures.
o Best-first search on the Tower of Hanoi Puzzle: a trace indicating states
which are searched and the queue at each point in the search
Heuristic search: using estimated distance remaining
• Another procedure specific to the problem: estimate, which takes a state and
estimates the distance from it to a goal state
• Hill climbing
Sort the new paths by the estimated distance left to the goal (using estimate) and
add them to the front of the queue.
o Parameter-oriented hill climbing: each problem state is a setting for a set
of parameters
o Problems for hill-climbing: foothills (local maxima), plateaus, ridges
o Nondeterministic search as a way of escaping from local maxima
o Gradient ascent
For a parameter x and "goodness" g which is a smooth function of x, the
change in x should be proportional to the speed with which g changes as a
function of x, that is, ∂g/∂x.
• Best-first search
After adding new paths to the queue, sort all the paths in the queue by the
estimated distance left to the goal (using estimate).
Unlike hill climbing, can jump around in the search space.
Genetic search
• Parameter adjustment, function optimization problems
o Calculus methods
o Blind search and random methods
o Parallel search
• Evolutionary computation
o What's needed for evolutionary computation (abstract or real) to work
"Creatures" which
1. Give birth to other creatures, passing on their traits to them
2. Die
A way for traits to be passed on: an inherited genotype, in
interaction with the environment, results in a phenotype
A way of evaluating the creatures' traits: some aid in survival or
reproduction, others don't ("survival of the fittest")
A way of generating new traits: mutation
o How it works
Each creature is born with some combination of traits; it may not
be possible to simply figure out what combination works best for
the environment of the creatures.
Creatures live their lives. Some survive long enough to reproduce
and have offspring.
If a particular trait or combination of traits helps creatures
reproduce or live longer, creatures with that trait (those traits) will
tend to have more offspring.
Creatures pass on (at least some of) their traits to their offspring. In
sexual reproduction, they pass on a combination of the parents'
traits.
The percentage of creatures with the good traits should increase on
each generation.
There is a small probability that a new creature will end up with
some random traits which it did not inherit from its parent(s). In
this way new traits or combinations of traits can be tried out in the
world.
o Genetic algorithms
What makes them special
• Assertions
Each a predicate calculus sentence with no variables and no connectives other
than not.
• Goal
Either a sentence with no variables and no connectives, in which case the goal is
to prove that it is true (inferrable from the facts and the rules), or an existentially
qualified sentence whose variables are to be assigned values (if possible) given
the facts and the rules.
• Rules
Each a universally qualified implication with a conjunction of sentences as
antecedent and a single sentence as consequent.
An example
• Assertions
o A ball is on a block
(on ball1 block1)
o A pyramid is above the ball.
(above pyramid1 ball1)
• Goals
o Is the pyramid above the block?
(above pyramid1 block1)
o What's above the block?
(above ?x block1)
• Rules
(In Homework 2 you will be doing a restricted version of forward chaining. Because each
rule has only one conjunct in its antecedent, you can iterate through the assertions rather
than the rules to attempt to find new assertions (extend the current state).)
Forward chaining
Attempt to match the antecedents of rules with the assertions, adding new assertions
based on the consequents if this is possible, until an assertion matching the goal is added.
• The antecedent of the first rule matches (on ball1 block1), so we can asssert
(above ball1 block1).
• The antecedent of the third rule matches (above ball1 block1) and (above
pyramid1 ball1), so we can assert (above pyramid1 block1). This matches
the goal.
Backward chaining
Attempt to match the consequents of rules with a goal, replacing the goal with new goals
based on the antecedent of the rule if this is possible, until all of these goals match
assertions.
• The consequent of the second rule matches the goal, so we can replace the goal
with (above ?x block1)
• The consequent of the first rule matches the goal, so we can replace the goal with
(on ?x block1). This goal matches the assertion (on ball1 block1) with the
variable binding ?x = ball1
More on representation
Frames
Much of human knowledge seems to be organized in chunks representing types of events:
frames or schemas.
Objects
Consider what we know about CABBAGE: what it looks like, how it tastes, how it's
prepared, how nutritious it is, how much it costs, what other vegetables and plants it's
related to. Within the CABBAGE frame, there is knowledge about what CABBAGE has.
Since CABBAGE is (probably) a basic-level category, there is a lot of knowledge in its
frame.
But some knowledge about CABBAGE is shared with other vegetables, and some
knowledge about vegetables is shared with other food items. Also there are subtypes of
CABBAGE such as RED-CABBAGE. Knowledge seems to be organized in an
inheritance (is-a) hierarchy, one sort of ontology. Categories also have default
properties that can be overridden by subcategories or instances.
Events
Though we don't have an English verb for it, there is a more abstract event category that
includes GIVE, STEAL, TAKE, and RECEIVE. And there are subtypes of GIVE such as
DONATE and LEND.
Using frames
Frame representation
(PHYS-OBJ
(is-a THING)
(color)
(weight)
(shape)
(edibility)
...)
(FOOD-ITEM
(is-a PHYS-OBJ)
(nutritional-value)
(fat-content)
(starch-content)
(vitamin-content)
(source)
(taste)
(preparation
(processing)
(cooking)
(accompanying-ingredients)
(serving))
(availability-form)
(edibility YES)
(english-lex
(neutral "FOOD")
(informal "GRUB"))
(japanese-lex "TABEMONO")
...)
(VEGETABLE
(is-a FOOD-ITEM)
(plant-part)
(plant-type)
(source PLANT)
(nutritional-value HIGH)
(fat-content LOW)
(vitamin-content HIGH)
(english-lex
(neutral "VEGETABLE")
(informal "VEGGIE"))
...)
(LEAF-VEGETABLE
(is-a VEGETABLE)
(plant-part LEAF)
(color GREEN)
(taste BITTER)
...)
(COLE-VEGETABLE
(is-a VEGETABLE)
...)
(CABBAGE
(is-a LEAF-VEGETABLE)
(is-a COLE-VEGETABLE)
(plant-type CABBAGE-PLANT)
(taste CABBAGE-TASTE)
(availability-form VEGETABLE-HEAD)
(shape SPHERICAL)
...)
(RED-CABBAGE
(is-a CABBAGE)
(color PURPLE)
(english-lex "RED CABBAGE")
...)
(RED-CABBAGE23
(is-a RED-CABBAGE)
(preparation
(accompanying-ingredients
{MAYONNAISE, MUSTARD, TARRAGON})
(cooking NIL)))
(ABSTRACT-TRANSFER
(source ?s)
(destination ?d)
(object ?o)
(precondition1
(is-a CONTROL)
(controller ?s)
(object ?o))
(effect1
(is-a CONTROL)
(controller ?d)
(object ?o)))
(GIVE
(is-a ABSTRACT-TRANSFER)
(source ?s)
(destination ?d)
(object ?o)
(effect1 ?e)
(precondition2
(is-a WANT)
(wanter ?s)
(wanted ?e)))
(GIVE23
(is-a GIVE)
(source AL)
(destination SUE)
(object
(is-a DVD)))
A more elaborate style of frame representation from the CHEF program (Hammond,
1989)
Semantic networks
Knowledge in the form of a graph. A single node corresponds to a frame symbol. Two
"styles":
Individuals, sets
Insects have six legs. Each leg is jointed. Ladybugs are insects. Sam is a ladybug. Sam's
right rear leg is broken.
Events
An example of a type 1 formalism
Inheritance in a semantic network
A fixed network of nodes (units) that can be active or not (or active to different degrees).
The unlabeled, weighted, modifiable links (connections) between the units represent the
tendency for pairs of units to be co-activated. Any representation is a pattern of
activation across the network.
Instantiation, inheritance
Instantiation does not involve the creation of any new hardware (because the size of the
network is fixed) but rather in the activation of certain and units and the strengthening or
weakening of some weights. There is no distinction in the network between individuals
and types.
Inheritance is in a sense automatic. When we activate a type, for example, on the basis of
an input word such as cabbage, the pattern represents features of the instance as well as
the type.
Mary wanted that camera really bad, so she went and bought a gun.
Phil went into a restaurant and sat down at a table. A waiter came over after a few
minutes, but he said he wasn't ready to order.
• Will the knowledge be in a usable form if it's not tied to perception and action?
• Is it possible to build in commonsense knowledge, or will it have to be learned?
This function dominates thinking about learning but may be less important
than the other two for most animals.
Induction
(supervised or reinforcement)
Feedforward networks
• Appropriate for problems with no interacting bottom-up and top-down effects;
pattern association
• Usually trained with error-driven learning, a form of supervised learning in
which the change in weights depends on the error, ultimately the difference
between the target and the actual output for each output unit
• Networks with no hidden units, for example, perceptrons
• Networks with hidden layers, usually trained with backpropagation
Perceptrons
Pattern association problems
• Given a representation of one kind of entity (the input), generate a representation
of another (the output).
• Both representations often take the form of patterns, that is, vectors of numbers.
In this case, the inputs and outputs are represented in a distributed way.
• Training (supervised): expose the learner to a number of input patterns and their
correct associated output patterns.
• Generalization: given an unfamiliar input pattern, respond with an appropriate
output pattern.
• Examples
o Perceptual input, category output (pattern classification)
o State (perceptual) input, action Q-value output
o Word input, meaning output
o Meaning input, word output
o English input, Spanish output
• Implementation in feedforward neural networks
N
y = δ( ∑ wj xj + b)
j=1
1 if x > 0
δ(x) = {
0 otherwise
Δw = ηxp
Δw = -ηxp
o The three cases can all be expressed with this general rule:
Δw = ηxp(tp - yp)
o If the points in the category are broken up by points not in the category,
there is no way for the network to solve the problem.
• Two inputs
o The two weights and the bias define a line
w1x1 + w2x2 + b = 0
with slope -w1/w2 and y-intercept -b/w2. Points on one side of the line turn
on the output unit; points on the other side turn it off. Because the
perceptron defines an inequality (two possibilities for each line), three
values are required rather than the two required to define a line.
o The line defined by the weights and bias divides the input space into two
regions. A perceptron in 2-space can only learn to separate two sets of
points that are on either side of a line.
• In general, a perceptron can only solve a pattern classification problem if the sets
of patterns are linearly separable; that is, if in N-space, there is a hyperplane of
N-1 dimensions which separates the two sets of points. N values (N-1 weights and
the trainable bias) are needed to specify the desired behavior because in addition
to the hyperplane, we need to say on which side of the hyperplane points are in
the category.
• Problems which perceptrons can't solve
o Examples of non-linearly separable sets of patterns
Exclusive OR: (0, 0), (1, 1); (1, 0), (0, 1)
Connectivity
o Solving the problems with additional input dimensions, for example, for
exclusive OR an input unit that codes for whether the other two inputs are
the same
o Solving the problems with hidden units
Networks with hidden units with linear activation functions are
equivalent to networks without hidden units
Hidden units must have non-linear activation functions, for
example, a simple threshold function (like perceptron output units)
or sigmoidal or gaussian functions.
One possibility: provide "enough" hidden units connected by
random, hard-wired weights to the input units.
Another possibility: train the input-to-hidden weights. But how?
g(h) = 1 / (1 + e-h)
• A formal way to derive the delta rule for the more general case
o Gradient descent learning: learning by moving in the direction which
looks locally to be the best
o For supervised neural network learning, the best direction to move in
"weight space": for each weight, how the global error changes with respect
to that weight
o A global error function: for each pattern the sum of the errors over all of
the output units
E = ∑j ½ (tj - xj)2
2. Using the chain rule, we can decompose this derivative into two
that are easier to calculate:
-(tj - xj)
g′(hj)
∂Ij/∂wji = ∂(∑kxkwjk)/∂wji = xi
Backpropagation
• Problems that are not linearly separable can't be solved by a perceptron (or a
network learning with the delta rule).
• How hidden units can solve non-linearly separable problems.
Δwji = ηδjxi
o For hidden units (k indexes the units in the next highest layer):
Unsupervised learning
Auto-association and content-addressable memories
• Auto-association: one form of unsupervised learning
o Patterns are associated with themselves
o Purposes: dimensionality reduction, data compression, pattern completion
o Implementation: Hopfield nets (pattern completion only), other constraint
satisfaction (settling) networks with hidden layers, feedforward nets (can
be trained with backpropagation)
o Content-addressable memories
Desired behavior:
When part of a familiar pattern enters the memory system, the
system fills in the missing parts (recall).
When a familiar pattern enters the memory system, the response is
a stronger version of the input (recognition).
When an unfamiliar pattern enters the memory system, it is
dampened (unfamiliarity).
When a pattern similar to a stored pattern enters the memory
system, the response is a version of the input distorted toward the
stored pattern (assimilation).
When a number of similar patterns have been stored, the system
will respond to the central tendency of the stored patterns, even if
the central tendency itself never appeared (prototype effects).
o Discrete Hopfield networks
Basic properties
CAM
Potentially completely recurrent
Symmetric weights
Activation rule (θ a threshold, sgn(): 1 if its argument is
positive, -1 otherwise):
E = -½ ∑i ∑j wij xi xj
E = C - ∑(ij) wij xi xj
Learning
Storing Q memories in a Hopfield net:
Capacity of a network
Crosstalk between patterns
Number of random patterns storable is proportional to N if
small percentage of errors tolerated, but it is quite small.
Competitive learning
• What it is
o Winner-take-all output units compete to classify input patterns, only one
(roughly) coming on at a time
o Clustering unlabelled data
o Categorization, vector quantization
• Simple, single-layer competitive learning
o Binary output units fully connected to (usually binary) input units by non-
negative weights
o Only one output unit fires at a time, the one whose input weight vector is
closest to the input vector (i* is the winning unit):
|wi* - x| ≤ |wi - x|
(for all i)
For normalized weights, the winner is always the one with highest input
(dot product of input pattern and weight vector):
wi* ⋅ x ≥ wi
(for all i)
Because losers are not activated, the rule is equivalent to (yi is the
activation of the ith category unit)
Geometric analogy
o Problem of "dead units", units which start out far away from input patterns
and never win
• Feature maps
o Networks in which location of output unit conveys information
o Output units have fixed positions in one-, two-, or three-dimensional grids
o Topology preserving map from the space of possible inputs to the line,
plane, or cube of the output units
A mapping that preserves neighborhood relations.
As two input patterns get closer in input space, the winning output
units get closer in output space.
o Self-organizing feature maps (Kohonen nets), one type of feature map
architecture)
The neighborhood relations in the output array are built into the
learning rule. Weights into many (sometimes all) units are changed
on each update (there may be no dead units).
Winning output unit i*:
|wi* - x| ≤ |wi - x|
(for all i)
Learning rule:
The neighborhood function falls off with the distance |ri - ri*| in the
output array, where the r vectors are the coordinates of the units in
the output space.
Reinforcement learning
Markov decision processes
• The agent and the environment (the world)
• Discrete time
• States
At each time step, the agent's sensory/perceptual system returns a state, xt, a
representation of its current situation in the environment, which may be errorful
and normally misses many aspects of the "real" situation.
• Actions
At each time step, the agent has the option of executing one of a finite set of
possible actions, ut, each of which potentially puts it in a new state.
• Reinforcements: rewards and punishments
In response to the agent's action in a particular state, the world provides a
reinforcement.
• The reinforcement function in the world: r(x,u)
• The next-state function in the world: s(x,u)
• An example
• But this bases too much on a single instance of reinforcement. We need to learn in
smaller steps.
• But so far the algorithm can only learn in response to immediate reinforcement.
What about delayed reinforcement?
Q learning
• The real value of an action in a state (optimal Q) depends not only on immediate
reinforcement but also on reinforcements that can be received later as a result of
the next state the agent gets to.
• The value (estimate Q) that an agent stores for each state-action pair should
reflect how much reinforcement it will receive immediately and in the future if it
takes that action in that state.
• Policy: a way of using the stored Q values to select actions.
• More precisely, an optimal Q value for a given state and action is the sum of all
reinforcements received if that action is taken in the state, and then the agent
follows the optimal policy specified by the other Q values. A first definition:
• To approach optimal Q values, the learner starts with 0 or random values for each
state-action pair, then updates the values gradually usually the reinforcement
received and what it thinks is the best Q value for the next state.
An example
γ = 0.8, η = 0.5 and all Q values initialized at 0. In the chart, "new" means the
reinforcement received plus the discounted maximum value of the next state. The "new"
value is combined with the "old" using the learning rate to give the updated Q value
appearing in the next line of the chart. (Note: in this example, in order to illustrate how
the agent can learn to "look ahead", it is effectively picked up after it reaches the goal
state and dropped back in state 1. There is no "natural" way of reaching state 1 from state
4.)
Q
x new u
1,r 2,r 2,l 3,r 3,l 4,l
1 0 0 0 0 0 0 0 r
2 0 0 0 0 0 0 0 r
3 0 0 0 0 0 0 1 r
4 0 0 0 .5 0 0 0 l
1 0 0 0 .5 0 0 0 r
2 0 0 0 .5 0 0 .4 r
3 0 .2 0 .5 0 0 1 r
4 0 .2 0 .75 0 0 0 l
Making decisions
o How is the agent to pick an action? One possibility is exploitation, to pick
the action that has the highest Q-value for the current state.
o But the agent can only learn about the value of actions that it tries. Thus it
should try a variety of actions. This may involved ignoring what it thinks
is best some of the time: exploration.
o Exploration makes more sense early on in learning when the agent doesn't
know much.
o One possibility for selecting an action; pick the "best" action with
probability P = 1 - e-E a, where a is the number of training samples (the
"age" of the agent). Here is how the probability of selecting the "best"
action depends on age when E is 0.1.
Implementing Q learning
• A lookup table: a Q value for each state-action pair
• But in the real world the number of states may be very large, even infinite.
Distributed representations of states permit
o More efficient coding
o Generalization to novel states
• A neural network
o Inputs are distributed representations of states.
o Outputs Q values for each action (represented locally).
o Weights represent associations of state features with actions.
o Error-driven learning: for the selected action, the target is the "new" Q
value from the Q learning rule.
Concept learning
Concepts (categories)
• Features
o Sufficient features and decision trees
o Necessary features
o Typical features
o Prohibited features
• Which features are relevant for concepts (of particular types)?
• Where does the set of features come from?
• One-shot learning
• Quine's problem
o A negative example
Object 1 is to the left of object 2.
Object 1 is a rectangle.
Object 2 must not be yellow.
o In our example the information content of a tree that classifies the data is
o For the "scales" branch below this decision point, the three remaining
attributes classify the four examples as follows
o habitat: land(R: 1), water(F: 3)
o flies?: yes(F: 1), no(F: 2, R: 1)
o breathes air?: yes(F: 1, R: 1), no(F: 2)
o For each of these, the information content remaining after the decision is:
o habitat: (0.1)(0) + (0.3)(0) = 0
o flies?: (0.1)(0) + (0.3)(0.918) = 0.275
o breathes air?: (0.2)(1) + (0.2)(0) = 0.2
Some implications of AI
Philosophy and cognitive science
• Is intelligence something that can be defined and studied abstractly, independently
from the details for the intelligent agent? Does intelligence require a body?
• Are there different intelligences (logics?), each with its own advantages for a
given environment, that could collaborate in solving problems?
• Where does the mind stop and the "outside world" begin?
• Are there fundamental differences between human and "animal" intelligence, or
are all of the differences just a matter of degree?
• How much of human intelligence is cultural, as opposed to genetic?
Society
• How can AI give certain groups of people (more) power over other groups of
people?
• Who benefits from applied AI (medicine, law, design, education, information
retrieval, natural language processing, stock market, marketing, transportation,
entertainment, agriculture, military)?
• Who funds AI research?
• How might AI be used to further
o free, independent media; equal access to information; a better informed
public?
o a public capable of making rational economic and political decisions?
o international (intercultural) understanding (tolerance)?
o equal access to (or equitable distribution of) the world's resources?
o protection of the environment?
In more practical terms neural networks are non-linear statistical data modeling tools.
They can be used to model complex relationships between inputs and outputs or to find
patterns in data.
More complex neural networks are often used in Parallel Distributed Processing.
Background
There is no precise agreed definition among researchers as to what a neural network is,
but most would agree that it involves a network of simple processing elements (neurons)
which can exhibit complex global behavior, determined by the connections between the
processing elements and element parameters. The original inspiration for the technique
was from examination of the central nervous system and the neurons (and their axons,
dendrites and synapses) which constitute one of its most significant information
processing elements (see Neuroscience). In a neural network model, simple nodes (called
variously "neurons", "neurodes", "PEs" ("processing elements") or "units") are connected
together to form a network of nodes — hence the term "neural network." While a neural
network does not have to be adaptive per se, its practical use comes with algorithms
designed to alter the strength (weights) of the connections in the network to produce a
desired signal flow.
These networks are also similar to the biological neural networks in the sense that
functions are performed collectively and in parallel by the units, rather than there being a
clear delineation of subtasks to which various units are assigned (see also
connectionism). Currently, the term Artificial Neural Network (ANN) tends to refer
mostly to neural network models employed in statistics, cognitive psychology and
artificial intelligence. Neural network models designed with emulation of the central
nervous system (CNS) in mind are a subject of theoretical neuroscience.
[edit] Models
Neural network models in artificial intelligence are usually referred to as artificial neural
networks (ANNs); these are essentially simple mathematical models defining a function
. Each type of ANN model corresponds to a class of such functions.
The word network in the term 'artificial neural network' arises because the function f(x) is
defined as a composition of other functions gi(x), which can further be defined as a
composition of other functions. This can be conveniently represented as a network
structure, with arrows depicting the dependencies between variables. A widely used type
The first view is the functional view: the input x is transformed into a 3-dimensional
vector h, which is then transformed into a 2-dimensional vector g, which is finally
transformed into f. This view is most commonly encountered in the context of
optimization.
The second view is the probabilistic view: the random variable F = f(G) depends upon the
random variable G = g(H), which depends upon H = h(X), which depends upon the
random variable X. This view is most commonly encountered in the context of graphical
models.
The two views are largely equivalent. In either case, for this particular network
architecture, the components of individual layers are independent of each other (e.g., the
components of g are independent of each other given their input h). This naturally enables
a degree of parallelism in the implementation.
Networks such as the previous one are commonly called feedforward, because their graph
is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such
networks are commonly depicted in the manner shown at the top of the figure, where f is
shown as being dependent upon itself. However, there is an implied temporal dependence
which is not shown. What this actually means in practice is that the value of f at some
point in time t depends upon the values of f at zero or at one or more other points in time.
The graphical model at the bottom of the figure illustrates the case: the value of f at time t
only depends upon its last value. Models such as these, which have no dependencies in
the future, are called causal models.
See also: graphical models
[edit] Learning
However interesting such functions may be in themselves, what has attracted the most
interest in neural networks is the possibility of learning, which in practice means the
following:
Given a specific task to solve, and a class of functions F, learning means using a set of
observations, in order to find which solves the task in an optimal sense.
This entails defining a cost function such that, for the optimal solution f * ,
(no solution has a cost less than the cost of the optimal
solution).
For applications where the solution is dependent on some data, the cost must necessarily
be a function of the observations, otherwise we would not be modelling anything related
to the data. It is frequently defined as a statistic to which only approximations can be
made. As a simple example consider the problem of finding the model f which minimizes
When some form of online learning must be used, where the cost is partially
minimized as each new example is seen. While online learning is often used when is
fixed, it is most useful in the case where the distribution changes slowly over time. In
neural network methods, some form of online learning is frequently also used for finite
datasets.
While it is possible to arbitrarily define some ad hoc cost function, frequently a particular
cost will be used either because it has desirable properties (such as convexity) or because
it arises naturally from a particular formulation of the problem (i.e., In a probabilistic
formulation the posterior probability of the model can be used as an inverse cost).
Ultimately, the cost function will depend on the task we wish to perform. The three main
categories of learning tasks are overviewed below.
There are three major learning paradigms, each corresponding to a particular abstract
learning task. These are supervised learning, unsupervised learning and reinforcement
learning. Usually any given type of network architecture can be employed in any of those
tasks.
A commonly used cost is the mean-squared error which tries to minimise the average
error between the network's output, f(x), and the target value y over all the example pairs.
When one tries to minimise this cost using gradient descent for the class of neural
networks called Multi-Layer Perceptrons, one obtains the well-known backpropagation
algorithm for training neural networks.
Tasks that fall within the paradigm of supervised learning are pattern recognition (also
known as classification) and regression (also known as function approximation). The
supervised learning paradigm is also applicable to sequential data (e.g., for speech and
gesture recognition). This can be thought of as learning with a "teacher," in the form of a
function that provides continuous feedback on the quality of solutions obtained thus far.
In unsupervised learning we are given some data x, and the cost function to be minimised
can be any function of the data x and the network's output, f.
The cost function is dependent on the task (what we are trying to model) and our a priori
assumptions (the implicit properties of our model, its parameters and the observed
variables).
As a trivial example, consider the model f(x) = a, where a is a constant and the cost C =
(E[x] − f(x))2. Minimising this cost will give us a value of a that is equal to the mean of
the data. The cost function can be much more complicated. Its form depends on the
application: For example in compression it could be related to the mutual information
between x and y. In statistical modelling, it could be related to the posterior probability of
the model given the data. (Note that in both of those examples those quantities would be
maximised rather than minimised)
Tasks that fall within the paradigm of unsupervised learning are in general estimation
problems; the applications include clustering, the estimation of statistical distributions,
compression and filtering.
More formally, the environment is modeled as a Markov decision process (MDP) with
states and actions with the following probability
distributions: the instantaneous cost distribution P(ct | st), the observation distribution P(xt
| st) and the transition P(st + 1 | st,at), while a policy is defined as conditional distribution
over actions given the observations. Taken together, the two define a Markov chain (MC).
The aim is to discover the policy that minimises the cost, i.e. the MC for which the cost is
minimal.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Tasks that fall within the paradigm of reinforcement learning are control problems, games
and other sequential decision making tasks.
Training a neural network model essentially means selecting one model from the set of
allowed models (or, in a Bayesian framework, determining a distribution over the set of
allowed models) that minimises the cost criterion. There are numerous algorithms
available for training neural network models; most of them can be viewed as a
straightforward application of optimization theory and statistical estimation.
Most of the algorithms used in training artificial neural networks are employing some
form of gradient descent. This is done by simply taking the derivative of the cost function
with respect to the network parameters and then changing those parameters in a gradient-
related direction.
• Choice of model: This will depend on the data representation and the application.
Overly complex models tend to lead to problems with learning.
• Learning algorithm: There are numerous tradeoffs between learning algorithms.
Almost any algorithm will work well with the correct hyperparameters for
training on a particular fixed dataset. However selecting and tuning an algorithm
for training on unseen data requires a significant amount of experimentation.
• Robustness: If the model, cost function and learning algorithm are selected
appropriately the resulting ANN can be extremely robust.
With the correct implementation ANNs can be used naturally in online learning and large
dataset applications. Their simple implementation and the existence of mostly local
dependencies exhibited in the structure allows for fast, parallel implementations in
hardware.
[edit] Applications
The utility of artificial neural network models lies in the fact that they can be used to
infer a function from observations. This is particularly useful in applications where the
complexity of the data or task makes the design of such a function by hand impractical.
The tasks to which artificial neural networks are applied tend to fall within the following
broad categories:
Application areas include system identification and control (vehicle control, process
control), game-playing and decision making (backgammon, chess, racing), pattern
recognition (radar systems, face identification, object recognition and more), sequence
recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial
applications, data mining (or knowledge discovery in databases, "KDD"), visualization
and e-mail spam filtering.
Neural network software is used to simulate, research, develop and apply artificial
neural networks, biological neural networks and in some cases a wider array of adaptive
systems.
The feedforward neural networks are the first and arguably simplest type of artificial
neural networks devised. In this network, the information moves in only one direction,
forward, from the input nodes, through the hidden nodes (if any) and to the output nodes.
There are no cycles or loops in the network.
The earliest kind of neural network is a single-layer perceptron network, which consists
of a single layer of output nodes; the inputs are fed directly to the outputs via a series of
weights. In this way it can be considered the simplest kind of feed-forward network. The
sum of the products of the weights and the inputs is calculated in each node, and if the
value is above some threshold (typically 0) the neuron fires and takes the activated value
(typically 1); otherwise it takes the deactivated value (typically -1). Neurons with this
kind of activation function are also called McCulloch-Pitts neurons or threshold neurons.
In the literature the term perceptron often refers to networks consisting of just one of
these units. They were described by Warren McCulloch and Walter Pitts in the 1940s.
A perceptron can be created using any values for the activated and deactivated states as
long as the threshold value lies between the two. Most perceptrons have outputs of 1 or -1
with a threshold of 0 and there is some evidence that such networks can be trained more
quickly than networks created from nodes with different activation and deactivation
values.
Perceptrons can be trained by a simple learning algorithm that is usually called the delta
rule. It calculates the errors between calculated output and sample output data, and uses
this to create an adjustment to the weights, thus implementing a form of gradient descent.
Single-unit perceptrons are only capable of learning linearly separable patterns; in 1969
in a famous monograph entitled Perceptrons Marvin Minsky and Seymour Papert showed
that it was impossible for a single-layer perceptron network to learn an XOR function.
They conjectured (incorrectly) that a similar result would hold for a multi-layer
perceptron network. Although a single threshold unit is quite limited in its computational
power, it has been shown that networks of parallel threshold units can approximate any
continuous function from a compact interval of the real numbers into the interval [-1,1].
This very recent result can be found in [Auer, Burgsteiner, Maass: The p-delta learning
rule for parallel perceptrons, 2001 (state Jan 2003: submitted for publication)].
A single-layer neural network can compute a continuous output instead of a step function.
A common choice is the so-called logistic function:
With this choice, the single-layer network is identical to the logistic regression model,
widely used in statistical modelling. The logistic function is also known as the sigmoid
function. It has a continuous derivative, which allows it to be used in backpropagation.
This function is also preferred because its derivative is easily calculated:
y' = y(1 − y)
A two-layer neural network capable of calculating XOR. The numbers within the neurons
represent each neuron's explicit threshold (which can be factored out so that all neurons
have the same threshold, usually 1). The numbers that annotate arrows represent the
weight of the inputs. This net assumes that if the threshold is not reached, zero (not -1) is
output. Note that the bottom layer of inputs is not always considered a real neural
network layer
The universal approximation theorem for neural networks states that every continuous
function that maps intervals of real numbers to some output interval of real numbers can
be approximated arbitrarily closely by a multi-layer perceptron with just one hidden
layer. This result holds only for restricted classes of activation functions, e.g. for the
sigmoidal functions.
Multi-layer networks use a variety of learning techniques, the most popular being back-
propagation. Here the output values are compared with the correct answer to compute the
value of some predefined error-function. By various techniques the error is then fed back
through the network. Using this information, the algorithm adjusts the weights of each
connection in order to reduce the value of the error function by some small amount. After
repeating this process for a sufficiently large number of training cycles the network will
usually converge to some state where the error of the calculations is small. In this case
one says that the network has learned a certain target function. To adjust weights properly
one applies a general method for non-linear optimization that is called gradient descent.
For this, the derivative of the error function with respect to the network weights is
calculated and the weights are then changed such that the error decreases (thus going
downhill on the surface of the error function). For this reason back-propagation can only
be applied on networks with differentiable activation functions.
In general the problem of teaching a network to perform well, even on samples that were
not used as training samples, is a quite subtle issue that requires additional techniques.
This is especially important for cases where only very limited numbers of training
samples are available. The danger is that the network overfits the training data and fails to
capture the true statistical process generating the data. Computational learning theory is
concerned with training classifiers on a limited amount of data. In the context of neural
networks a simple heuristic, called early stopping, often ensures that the network will
generalize well to examples not in the training set.
Other typical problems of the back-propagation algorithm are the speed of convergence
and the possibility of ending up in a local minimum of the error function. Today there are
practical solutions that make back-propagation in multi-layer perceptrons the solution of
choice for many machine learning tasks.
[edit] ADALINE
Adaptive Linear Neuron or later called Adaptive Linear Element. It was developed by
Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in
1960. It's based on the McCulloch-Pitts model. It consists of a weight, a bias and a
summation function.
Operation: yi = wxi + b
Its adaptation is defined through a cost function (error metric) of the residual e = di − (b +
wxi) where di is the desired input. With the MSE error metric the
While the Adaline is through this capable of simple linear regression, it has limited
practical use.
There is an extension of the Adaline, called the Multiple Adaline (MADALINE) that
consists of two or more adalines serially connected.
RBF networks have the advantage of not suffering from local minima in the same way as
multi-layer perceptrons. This is because the only parameters that are adjusted in the
learning process are the linear mapping from hidden layer to output layer. Linearity
ensures that the error surface is quadratic and therefore has a single easily found
minimum. In regression problems this can be found in one matrix operation. In
classification problems the fixed non-linearity introduced by the sigmoid output function
is most efficiently dealt with using iterated reweighted least squares.
RBF networks have the disadvantage of requiring good coverage of the input space by
radial basis functions. RBF centres are determined with reference to the distribution of
the input data, but without reference to the prediction task. As a result, representational
resources may be wasted on areas of the input space that are irrelevant to the learning
task. A common solution is to associate each data point with its own centre, although this
can make the linear system to be solved in the final layer rather large, and requires
shrinkage techniques to avoid overfitting.
Associating each input datum with an RBF leads naturally to kernel methods such as
Support Vector Machines and Gaussian Processes (the RBF is the kernel function). All
three approaches use a non-linear kernel function to project the input data into a space
where the learning problem can be solved using a linear model. Like Gaussian Processes,
and unlike SVMs, RBF networks are typically trained in a Maximum Likelihood
framework by maximizing the probability (minimizing the error) of the data under the
model. SVMs take a different approach to avoiding overfitting by maximizing instead a
margin. RBF networks are outperformed in most classification applications by SVMs. In
regression applications they can be competitive when the dimensionality of the input
space is relatively small.
The self-organizing map (SOM) invented by Teuvo Kohonen uses a form of unsupervised
learning. A set of artificial neurons learn to map points in an input space to coordinates in
an output space. The input space can have different dimensions and topology from the
output space, and the SOM will attempt to preserve these.
Contrary to feedforward networks, recurrent neural networks (RNs) are models with bi-
directional data flow. While a feedforward network propagates data linearly from input to
output, RNs also propagate data from later processing stages to earlier stages.
The Hopfield network is a recurrent neural network in which all connections are
symmetric. Invented by John Hopfield in 1982, this network guarantees that its dynamics
will converge. If the connections are trained using Hebbian learning then the Hopfield
network can perform as robust content-addressable memory, resistant to connection
alteration.
The Echo State Network (ESN) is a recurrent neural network with a sparsely connected
random hidden layer. The weights of output neurons are the only part of the network that
can change and be learned. ESN are good to (re)produce temporal patterns.
A stochastic neural network differs from a regular neural network in the fact that it
introduces random variations into the network. In a probabilistic view of neural networks,
such random variations can be viewed as a form of statistical sampling, such as Monte
Carlo sampling.
Biological studies showed that the human brain functions not as a single massive
network, but as a collection of small networks. This realisation gave birth to the concept
of modular neural networks, in which several small networks cooperate or compete to
solve problems.
[edit] Committee of machines
The CoM is similar to the general machine learning bagging method, except that the
necessary variety of machines in the committee is obtained by training from different
random starting weights rather than training on different randomly selected subsets of the
training data.
Spiking neural networks (SNNs) are models which explicitly take into account the timing
of inputs. The network input and output are usually represented as series of spikes (delta
function or more complex shapes). SNNs have an advantage of being able to
continuously process information. They are often implemented as recurrent networks.
Networks of spiking neurons -- and the temporal correlations of neural assemblies in such
networks -- have been used to model figure/ground separation and region linking in the
visual system (see e.g. Reitboeck et.al.in Haken and Stadler: Synergetics of the Brain.
Berlin, 1989).
Gerstner and Kistler have a freely-available online textbook on Spiking Neuron Models.
Spiking neural networks with axonal conduction delays exhibit polychronisation, and
hence could have a potentially unlimited memory capacity.
Dynamic neural networks not only deal with nonlinear multivariate behaviour, but also
include (learning of) time-dependent behaviour such as various transient phenomena and
delay effects. Meijer has a Ph.D. thesis online where regular feedforward perception
networks are generalized with differential equations, using variable time step algorithms
for learning in the time domain and including algorithms for learning in the frequency
domain (in that case linearized around a set of static bias points).
Artificial neural network models have a property called 'capacity', which roughly
corresponds to their ability to model any given function. It is related to the amount of
information that can be stored in the network and to the notion of complexity.
[edit] Convergence
In applications where the goal is to create a system that generalises well in unseen
examples, the problem of overtraining has emerged. This arises in overcomplex or
overspecified systems when the capacity of the network significantly exceeds the needed
free parameters. There are two schools of thought for avoiding this problem: The first is
to use cross-validation and similar techniques to check for the presence of overtraining
and optimally select hyperparameters such as to minimise the generalisation error. The
second is to use some form of regularisation. This is a concept that emerges naturally in a
probabilistic (Bayesian) framework, where the regularisation can be performed by putting
a larger prior probability over simpler models; but also in statistical learning theory,
where the goal is to minimise over two quantities: the 'empirical risk' and the 'structural
risk', which roughly correspond to the error over the training set and the predicted error in
unseen data due to overfitting.
Confidence analysis of a neural network
Supervised neural networks that use an MSE cost function can use formal statistical
methods to determine the confidence of the trained model. The MSE on a validation set
can be used as an estimate for variance. This value can then be used to calculate the
confidence interval of the output of the network, assuming a normal distribution. A
confidence analysis made this way is statistically valid as long as the output probability
distribution stays the same and the network is not modified.
By assigning a softmax activation function on the output layer of the neural network (or a
softmax component in a component-based neural network) for categorical target
variables, the outputs can be interpreted as posterior probabilities. This is very useful in
classification as it gives a certainty measure on classifications.
First, I'd like to do a bit of philosophizing as a way to lead into the logic. Philosophers
and logicians have a lot of overlap in what they do. Many logicians are also philosophers,
and all philosophers are logicians to some extent (some much more so than others).
Given that there is such a connection between philosophers and logicians, I find it
striking just how radically different the fields are. Philosophers are interested in finding
deep truths about the world, be they epistemological, metaphysical, ethical, etc. Logicians
(qua logicians) are only interested in using a set of rules to manipulate arbitrary symbols
that have no relevance to the world.
The (sometimes difficult) marriage between philosophy and logic comes from the fact
that everyone in the world (except, I would argue, people who are commonly called
"crazy") accepts the truths proven by logic to be universally true and unquestionable.
Philosophy needs logic because in order to establish that a philosophical doctrine is true,
one needs to show that the doctrine is universally and unquestionably true. One needs to,
in other words, make a demostration that everyone would accept as proof that the
proposition is true. To do that, the philosopher needs logic.
[TOP]
Logic accomplishes this magical universal acceptance because it makes only little tiny
steps. It does silly little things like:
[TOP]
Shorthand Sentences
Logicians, though, are very lazy people. They don't like to write long derivations in
English because English sentences can be fairly long. So what they do instead is a kind of
shorthand. If you give a logician a sentence like
he will pick a letter and assign it to that sentence. He now knows that the letter is just
shorthand for the sentence. The way I learned logic, capital letters are used for sentences
(with the exception of U, V, W, X, Y, and Z; I'll get to those later).
So let's just start at the beginning of the alphabet and use the letter "A" to represent the
sentence "The dog is brown." While we're at it, let's use the letter "B" to represent the
sentence "The dog weighs 15 lbs."
In addition to saving time and ink, this practice of using capital letters to represent whole
sentences has a couple of other advantages. The first is that to a logician, not every word
is as interesting as every other. Logicians are extremely interested in the following list of
words:
and
or
if...then
if and only if
not
They call these words "connectives." This is because you can use them to connect
sentences that you already have together to make new sentences.
When you write in English, those words don't stand out; they just get lost in the middle of
sentences. Logicians want to make sure the words look special, so they take the whole
rest of the sentence (the part they don't care about) and use a single letter to represent
that. Then their favorite words stand out. Let's rewrite our earlier example about the dog
using our logician's shorthand:
ASSUMING: A
AND ASSUMING: B
I CONCLUDE: A and B
The other advantage of using capital letters to represent sentences is that you ignore all
the information that isn't relevant to what you're trying to do. For the derivation I did
above, it didn't matter that the sentences were both about some dog. It didn't matter that
they were about weight or color. They could have just as easily been sentences about how
tall the dog is or about a cat or a person or a war or whatever. And if we can do the
derivation for A and B, then we can do the same exact derivation for C and D or E and N
or any other sentences we like.
[TOP]
Connectives
Now, as I said before, logicians are lazy. They really don't want to have anything to do
with English. So instead of using the English words:
and
or
if...then
if and only if
not
(Sometimes they also use a triple equals sign for '<->', but I can't type that.)
1. The symbols: ^, v, ->, and <-> are called "two-place connectives." This is because
they connect two sentences together into a more complicated sentence.
2. The symbol: ~ is called a "one-place connective" because you only add it to one
sentence. (You cannot join multiple sentences together with it.) To negate a
sentence, all you have to do is stick a ~ on the front.
3. When you join two sentences with a two-place connective, you ALWAYS put
parentheses around it. So it is NOT appropriate to write this:
A^ B
Nn7&% mm)]mm (
[TOP]
Parentheses
I know that a lot of books and instructors claim that it is okay to drop the outermost
parentheses in a sentence. I've done it myself many times. And 95% of the time it won't
cause you trouble if you're careful. But let's say we started with this 'sentence'
A^ B
and decided to negate it. Well, the way to negate a sentence is to stick a ~ on the front, so
let's do that:
~A ^ B
But wait! What we did there was just negate the A. We wanted to negate the whole
sentence. If we were really sharp, then we might notice that somebody had given us an
illegitimate sentence that was missing parentheses, and so we would add the parentheses
before adding the ~, to get:
~(A ^ B)
which is what we wanted.
It seems silly to make such a big deal about parentheses when we're dealing with simple
sentences, but when you're doing a 30-line derivation and you're tired, it's easy to make a
mistake just like that on line 17 and get yourself into real trouble. It's better to just
remember the simple rule and always add parentheses when you have a two-place
connective.
. . .
Let's take a deep breath and then go quickly over what we have so far.
[TOP]
Using Connectives
^ (and)
v (or)
-> (if...then)
<-> (if and only if)
~ (not)
which you can add to a sentence.
A simple sentence is one that has no connectives. For example: A (the dog is brown).
A complex sentence is a sentence which is made up of one or more simple sentences and
one or more connectives. Some examples are:
(A ^ B)
(A v B)
(A -> B)
(A <-> B)
~A
You can use connectives on complex sentences just as you can on simple sentences. Let's
introduce a new simple sentence "it is raining," and let's call our new sentence C. We now
have a lot more sentences that we can make. (Keep in mind, we have no idea yet which of
these sentences are true or false; we also don't yet know how these sentences relate.) For
example:
(C ^ B)
~C
(B v C)
(C v B)
(B -> C)
(A -> C)
(C -> A)
(B <-> C)
(~B <-> C)
~(B <-> C)
~~C
((A ^ B) v C)
These can get a little complicated. That last sentence is especially scary looking; we'll
come back to it in a little while. For now, here is a quick run-down of how to use
connectives to make complex sentences from simple ones.
[TOP]
Subsentences
~(B <-> C)
~~C
((A ^ B) v C)
C
with
<->
The complex sentence
~B
is called a "subsentence" of the larger sentence
(~B <-> C)
because it is a smaller sentence inside the large one.
~
but they are not equally important. In this case the
<->
is much more important than the
~
Remember how the sentence was made by taking the two smaller sentences
~B
C
and connecting them with a
<->
The <-> is therefore called the "main connective" of the sentence. Main connectives are,
without a doubt, absolutely the most important idea in logic. The hardest skill to learn in
logic is to identify the main connective of a sentence. Make sure you understand what
main connectives are.
(~B <-> C)
with one that looks similar,
~(B <-> C)
This new sentence is very different. It was made by negating
(B <-> C)
The main connective of
~(B <-> C)
is therefore
~
and
(B <-> C)
is just a subsentence of
~(B <-> C)
[TOP]
Complicated sentences
Now let's take a closer look at the most complicated sentence on our list and see if we can
make it more manageable. The way to analyze a complicated sentence is to start at the
outside and work your way in.
(~(A v B) <-> C)
with a
->
So the way to build our ugly sentence is to start with these two less ugly sentences:
((A ^ ~B) v ~C)
(~(A v B) <-> C)
and connect them with the main connective
->
We can then analyze each subsentence if we like.
I told you before that simple sentences are represented by the capital letters A through T,
and that U, W, X, Y, and Z are saved for something else. (I rarely use V because it looks
to much like the symbol for 'or'.) U, W, X, Y, and Z are used as shorthand for other
sentences in logic (some books use italic letters and others use Greek letters, but since I
only have plain text to work with, I use the end of the alphabet). I call these "sentence
variables."
[Note: it is also legal to use capital variables to stand for simple sentences. So you can
take the simple sentence
B
and use the letter Z to stand for it.]
This can be useful in analyzing complicated sentences. For example, if we have the scary
looking sentence
((((A ^ ~B) v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A)
we can start using sentence variables to stand for subsentences. So if U stands for
(A ^ ~B)
Then we have
(((U v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A)
and if V stands for
(U v (B <-> C))
then we have
((V -> (~(C v D) ^ ~(~A -> ~~D))) v A)
If W stands for
~(C v D)
we have
((V -> (W ^ ~(~A -> ~~D))) v A)
and if X stands for
~(~A -> ~~D)
we have
((V -> (W ^ X)) v A)
and if Y stands for
(V -> (W ^ X))
we have
(Y v A)
So we know where our main connective is. And by substituting back in for the sentence
variables, we can recreate our sentence in managable chunks.
It is very important to keep track of what sentence variables stand for when you're doing
this kind of substitution. This can be a major source of error if you're not keeping close
track of what every letter stands for.
. . .
Now that we know all the details of the language of symbolic logic, it's time to actually
do symbolic logic.
The first step with every sentence is to identify the main connective. The reason is
simple:
In symbolic logic, the main connective of a sentence is the only thing that you can
work with.
Let's look at our complicated sentence from earlier
(~(A v B) <-> C)
There is no
->
in either subsentence, but the sentence as a whole is still first and foremost an implication
because of what its main connective is. So when you're trying to figure out how in the
heck you can work with this ugly sentence
(((A ^ ~B) v ~C) -> (~(A v B) <-> C))
you need to remember that it is an implication and treat it just as one.
[TOP]
Here's a quick course on what the connectives mean. (I assume you have some familiarity
with them.)
The
is TRUE whenever is FALSE whenever
sentence
A A is true A is false
~A A is false A is true
A is false; or B is false; or both A and
(A ^ B) A is true and B is true
B are false
A is true; or B is true; or both A and
(A v B) A is false and B is false
B are true
A and B are both true; or A and B A is false and B is true, or A is true
(A <-> B)
are both false and B is false
A is false; or B is true; or A is false
(A -> B) A is true and B is false
and B is true
This last one is a little weird, so let's think about it. If we translate it back into English,
we get
Let's say that the dog is brown and the dog weighs 15 lbs. Does that disprove the if...then
statement? Certainly not!
What if the dog is brown but the dog weighs 25 lbs.? That does disprove the statement.
What if the dog turns out to be white? Then we cannot disprove the inference because it
only makes a prediction about a brown dog. If the dog isn't brown, then we can't test the
prediction.
(A -> B)
false is to make A true and B false at the same time. Given any other values of A and B,
the sentence comes out true.
[TOP]
Now we're finally ready to learn the rules of logic. There are exactly 12 - no more, no
less. Each connective has two rules associated with it, and there are two special rules.
[TOP]
1. Assumptions
The first special rule is the rule of assumptions. It is deceptively easy. The rule is:
Well that makes sense. Let's say you and I are detectives trying to solve a mystery. I could
say something like "let's assume for the time being that the dog is brown." Once I said
that, we could discuss what that would mean. Anything we conclude from that
assumption is perfectly okay, as long as we remember that it was under the assumption
that the dog is brown. In other words, whatever we do prove under the assumption that
the dog is brown must be followed by a disclaimer "assuming that the dog is brown."
Eventually, we would want to prove something about the case that doesn't depend on the
dog being brown. Logicians call this "discharging" the assumption. Fortunately, some of
our other rules tell us how to discharge assumptions.
1. When I do derivations, I number each new line. I start new assumptions using curly
brackets
{
}
4. And then I stop indenting.
However, line (1) is legal because it is outside the curly brackets, and so is line
(4).
This can get complicated if you have assumptions inside of assumptions.
[TOP]
2. -> Introduction
If you assume
X
and then you derive
Y
then you are entitled to discharge the assumption and write
(X -> Y)
That makes sense. Let's just say that we assumed
A
[TOP]
3. ^ Elimination
Let's learn one more rule for now. This one is called "^ elimination."
If you have
(X ^ Y)
then you are entitled to
X
and you are also entitled to
Y
That makes sense too. Lets say we knew for a fact that
(A ^ B)
A Derivation
Now let's take an example of a derivation. Suppose I wanted to prove that this is a logical
truth
((A ^ B) -> A)
I would start by identifying the main connective, which is a ->. I know how to introduce a
new ->, I assume the left and then derive the right. Let's try it:
{
1) (A ^ B) [assumption]
2) A [^elim on 1]
[If (the dog is brown and the dog weighs 15 lbs) then the dog is brown]
[TOP]
4. Repetition
There's one other special rule. It's called "repetition." The rule simply says that if you
have
X
Then you are entitled to write
X
Provided that it was not inside a closed curly bracket.
[TOP]
5. ^ Introduction
The other rule with ^ is called "^ introduction." It says, if you have
X
and you also have
Y
then you are entitled to
(X ^ Y)
That makes sense too. Let's say that I have already proven
E
1) (A ^ B) [assumption]
2) A [^elim on 1]
4) C [assumption]
[TOP]
6. -> Elimination
The other rule for -> is called "-> elimination." It says that if you have
X
and you have
(X -> Y)
then you are entitled to
Y
That makes sense too. If I know
A
1) (A ^ B) [assumption]
2) A [^elim on 1]
4) C [assumption]
8) (A ^ B) [assumption]
[TOP]
7. <-> Introduction
The next two rules have to do with <->. The first is called "<-> introduction." It states
that if you have
(X -> Y)
and you have
(Y -> X)
then you are entitled to
(X <-> Y)
This one is a little tricky to explain, and the best way (I'm sorry to say) is truth tables. So
you should try all the possible combinations for X and Y and convince yourself that if
(X -> Y)
and
(Y -> X)
are both true, then
(X <-> Y)
must be true too.
[TOP]
8. <-> Elimination
The next rule is called "<-> elimination." This one says that if you have
(X <-> Y)
And you have
X
Then you are entitled to
Y
OR
If you have
(X <-> Y)
And you have
Y
Then you are entitled to
X
This makes sense because if you know
(X <-> Y)
then you know that X and Y have the same truth value. So if you know one of them is
true, then the other must also be true.
A New Derivation
Let's start a new derivation.
1) (A ^ B) [assumption]
2) A [^elim on 1]
3) B [^elim on 1]
4) (B ^ A) [^intro on 2 and 3]
6) (B ^ A) [assumption]
7) B [^elim on 6]
8) A [^elim on 6]
9) (A ^ B) [^intro on 7 and 8]
[TOP]
9. ~ Introduction
[TOP]
10. ~ Elimination
A Quick Derivation
1) (A ^ ~A) [assumption]
2) A [^elim on 1]
3) ~A [^elim on 1]
[TOP]
11. v Introduction
That seems a little strange. Normally you wouldn't think you can just go throwing any old
sentence into a derivation. But remember
(X v Y)
is true as long as X is true OR Y is true OR both are true. So if you already know that X
is true, then the disjunction of X and anything else will be true.
A Short Derivation
1) A [assumption]
2) A [repetition of 1]
[TOP]
12. v Elimination
The last rule is a little tricky. It's "v elimination." It says if you have
(X v Y)
and you have
(X -> Z)
and you have
(Y -> Z)
Then you are entitled to
Z
[Most of the time this means that when you have a disjunction that you don't know what
to do with, you have to derive an implication for each side of the disjunction before you
can go on.]
The rule is hard to do with derivations, but it is actually not too hard to understand if you
take an example.
1) A [assumption]
2) A [repetition of 1]
5) (A -> A) [assumption]
6) C [assumption]
7) C [repetition of 6]
11) B [assumption]
12) C [assumption]
}
14) (C -> C) [->intro on 12-13]
[Not the most efficient way to prove (C -> C), but it is valid.]
There are a lot of other rules people try to tell you, but anything you can do with those,
you can do with these 12 rules.
[TOP]
The reason I like these rules is that with these rules you can do any derivation using the
same five steps:
Step 1: Find the main connective of the sentence you are trying to derive.
Step 3: When you're in the middle of a derivation and you don't know what to do,
find the main connective of the sentence you have and eliminate it.
Step 4: Along the way you may have to derive subsentences using steps 1 through
3.
Step 5: If all else fails, you may have to do a "~ elimination" [I'll explain this step
a little later].
If you use those five steps, you should always know which rule to use. The reason is that
there are ONLY four things you are ever allowed to do in a derivation:
2. Use the sentence you are on to eliminate the main connective of another sentence
(AS LONG AS THAT OTHER SENTENCE ISN'T CLOSED OFF IN CURLY
BRACKETS).
[TOP]
Now that we have the steps for doing derivations, let me try to explain that confusing
business about discharging assumptions. I'm going to approach this from a slightly
different angle this time.
Of the 12 rules I gave you, 8 are pretty straightforward. They are what I would call the
"Mundane Rules." The way Mundane Rules work is: they say "if you have X and Y and
Z, then you are entitled to U."
The tricky thing with Mundane Rules is knowing what you "have."
You "have" any sentence that is written down on a line of the derivation except those
which are closed off in curly brackets (which are gone forever once the brackets close).
Being "entitled" to something just means that you can legally write it down as the next
line of the derivation.
If you have
X
then you are entitled to
X
If you have
X
and you have
Y
then you are entitled to
(X ^ Y)
Simple enough. (I went into more detail on _why_ this is a sound rule in the last e-mail.)
If you have
(X ^ Y)
then you are entitled to
X
Or, if you prefer, you are also entitled to
Y
So far so good.
If you have
X
and you have
(X -> Y)
then you are entitled to
Y
This is actually the same thing as Modus Ponens, so you can call it that if you prefer.
Since I don't speak Latin, I prefer calling it "-> elimination" because that is more
descriptive of what the rule is doing.
If you have
(X -> Y)
and you have
(Y -> X)
then you are entitled to
(X <-> Y)
This one is a little tricky to explain. Let's assume somehow we have
(X -> Y)
Under what conditions could that be true? There are 3 possibilities:
X is true and Y is true
If we have
(X <-> Y)
and we have
X
then we are entitled to
Y
OR:
If we have
(X <-> Y)
and we have
Y
then we are entitled to
X
If you have
X
then you are entitled to
(X v Y)
and you are also entitled to
(Y v X)
This is a little tricky too. We "have" X, which means X must be true. Now we can just,
out of the blue, pick any sentence we like and put it into a disjunction with X. Why can
we do that? Well, let's say we pick a FALSE sentence. Is that still okay?
Yes, it is! Even if Y is false, the disjunction with X is still true, so we haven't written a
false sentence, and we are still okay.
If you have
(X v Y)
and you have
(X -> Z)
and you have
(Y -> Z)
then you are entitled to
Z
So much for the Mundane Rules. Mundane Rules are useful in derivations because they
let you move from one step to the next. They tell you what you can do with the sentences
you have. They also can give you a hint as to what you need to do next. For example, if
you have
(X v Y)
and you want to eliminate the v, but you don't have
(X -> Z)
(Y -> Z)
yet, then you'd better go get those two sentences.
The problem with the Mundane Rules is that they only let you play around with sentences
you already HAVE. You can't get anything NEW out of them.
So far we've gone over the 8 Mundane Rules. There are 12 rules in total. Of the 4
remaining, one is a 'Special Rule' and three are 'Fun Rules'.
[TOP]
Special Rule
You are free to assume anything you like at any time as long as you do these things:
1. Use curly brackets and indentation to keep track of what you have assumed.
[Discharging an assumption just means you close the curly brackets and stop
indenting. So you can forget about the assumption.]
[TOP]
If you assume
X
and then, on that assumption, you derive
Y
You can discharge the assumption you made at X and then you are entitled to
Z
If you assume
X
And then, on that assumption, you derive
Y
You can discharge the assumption you made at X and then you are entitled to
(X -> Y)
This is, in my opinion, the most important and fundamental rule in logic. It is the
foundation of all logic. [It's also really important for derivations. If you look up at the
Mundane Rules, a lot of them require you to have sentences of the form (X -> Y) to apply
them.]
The last two Fun Rules are closely related. One is ~ introduction:
If you assume
X
And then, on that assumption, you derive
Y
and
~Y
you can discharge the assumption you made at X; then you are entitled to
~X
[You may have been taught this rule as a reductio ad absurdum.] The idea is that if
assuming X leads you to a contradiction, then there must've been something contradictory
ABOUT X ITSELF. So X must be false. If X is false, then by definition ~X is true: no ifs,
ands, buts, or assumptions about it.]
The last Fun Rule is ~ elimination:
If you assume
~X
and then, on that assumption, you derive
Y
and
~Y
you can discharge the assumption you made at ~X and then you are entitled to:
X
The idea here is basically the same. ~X is contradictory and therefore false, so X is
proven true.
I have another nickname for ~ elimination. It is what I call the "Fallback Rule." With
every other rule, the way it works is by getting what you want by introducing the main
connective or by using what you have by eliminating the main connective. But take a
look back at what I just did with ~ elimination. I just proved X is true. There's no way to
tell by looking at X that you can prove it by eliminating a ~, but you can. [Actually, ANY
sentence you like can be proven with ~ elimination, but it is sometimes hard to do.]
So this is where Step 5 of the derivations comes from. If you are trying to prove some
sentence X, the first thing to try is to try to introduce the main connective of X. But if you
run into a dead end doing that, then assume ~X and try to derive a contradiction.
Those are the rules reviewed and better organized so that they make sense, and the
difficult bit about discharging assumptions is (I hope) a little clearer.
One other thing to watch out for. Some logic problems ask you to prove that a certain
sentence is a logical truth. On those problems, you have to discharge all your assumptions
and prove that the sentence is true with no assumptions (that is, write it without indenting
and outside of all the curly brackets). I'll do an example of a derivation like that in a
minute.
[TOP]
Deriving a Conclusion
Other logic problems give you a list of "givens" or "hypotheses" and ask you to derive a
conclusion from them. In those problems, what they are saying is that you need to assume
the hypotheses, but not discharge those assumptions. Let me give you an example:
Given:
A
(A -> B)
(~B v C)
Prove:
C
Here we go:
1) A [assumption, given]
4) A [repetition of 1]
5) (A -> B) [repetition of 2]
6) B [->elim on 4&5]
Now what do we do? We have a disjunction on 3 that we don't know what to do with, so
we need to eliminate it. But in order to eliminate it, we need to get (~B -> something) and
(C -> something).
7) ~B [assumption]
Let's see if we can derive C. If we derive (~B -> C) then we'll be most of the way to
finishing the problem. How can we derive C? Well, we should try to introduce the main
connective. But wait! There is no main connective. C is just a simple sentence. So what
can we do? I guess we have to do step 5, try ~elimination.
8) ~C [assumption]
9) B [repetition of 6]
10) ~B [repetition of 7]
This may seem like a bit of sleight of hand, like I'm trying to pull the wool over your
eyes. How can I use the assumption I made at line 7 as part of the contradiction? I just did
a ~elimination to prove C, but there was nothing contradictory about ~C itself; the
contradiction was that I had B and then I assumed ~B. This is the familiar refrain
"anything can be proven from a contradiction." Once I assumed ~B, I could've proven
(~B -> anything-I-want), I chose to prove (~B -> C) because I eventually want to get C.
Now we have made some progress on eliminating the disjunction we had on line 3: (~B v
C). We have (~B -> C), now we need (C -> C), so let's go get it.
13) C [assumption]
Notice that I can repeat 13 because I have not yet discharged that assumption. However, I
cannot repeat the C on line 11 because I closed off line 11 already, so it is gone forever.
Not all of those repetitions were necessary, since we "had" those lines already (they
hadn't been closed off), but I added them for clarity.
You should go back and double-check the derivation to make sure that I never broke the
rules by using a line that was closed off and that I didn't break any other rules. Also, make
sure that I discharged all the assumptions except the three I was given at the start.
[TOP]
Deriving a Sentence
As I said before, the other type of problem is where you are handed a sentence and told to
derive it. In this problem, you can make any assumptions you need, but you have to
discharge all of them and end up with the sentence you're looking for at the end. This
usually involves finding the main connective of the sentence you're supposed to prove
and then introducing it (sometimes you have to find other sentences too: for example if
the main connective is ^, you need to prove each half of the sentence and then do an
^intro).
Sometimes trying to do this will get you to a dead end, and then you may try to assume
the negation of the sentence you're trying to get and see if you can find a contradiction.
The main connective is a ->, so let's introduce it. To do that we need to assume the left
and derive the right.
Here we have a disjunction, so we need to eliminate it. That means we need to find two
entailments. It would be great if we had ((~A ^ ~C) -> (B -> ~C)) and ((~C <-> B) -> (B
-> ~C)), so let's try to get those. First, let's work on ((~A ^ ~C) -> (B -> ~C)).
3) ~A [^elim on 2]
4) ~C [^elim on 2]
We actually don't need line 3, but it's good practice to get both sides of an ^ while you
can just in case you might need them later. Now we have ~C, but what we want is (B ->
~C), so let's work toward getting that.
5) B [assumption]
6) ~C [repetition of 4]
} [closes 5-6]
So now what we need to finish the velimination on line 1 is to derive ((~C <-> B) -> (B
-> ~C)).
10) B [assumption]
If you wanted to, you could make this a little more clear by repeating (~C <-> B) and
then doing the <->elim, but it's not necessary since we have both (~C <-> B) and B we
are entitled to ~C.
} [closes 10-11]
} [closes 9-12]
Now we can do our velimination on line 1 because we have ((~C <-> B) -> (B -> ~C))
and ((~A ^ ~C) -> (B -> ~C)). If you want, for clarity, you can repeat line 1 and line 8,
but it's not necessary.
As always, you should go back and double-check the derivation once you're done.
The hardest thing about doing derivations is figuring out what to do next. When you have
a lot of random rules with Latin names to choose from, it's difficult. This set of rules
helps you to know what to do by either introducing what you're trying to get or
eliminating what you have.
My advice to you is to try to do some of the derivations in your book or that you had for
your class using these rules. Any derivation is possible with them. It takes a lot of time to
learn logic and have it sink in, but if you take it slowly enough and practice, it will
become easier. Get comfortable with these 12 basic rules and the 5-step method for doing
derivations.
[TOP]
Many (probably most) places, you don't learn these 12 rules with logic. Even though I
think these rules make the most sense and allow a straightforward approach to solving
any problem in symbolic logic, more advanced students may want to study the rules such
as Modus Tollens and DeMorgan's Law.
The twelve rules I've presented here are systematic and straightforward, and all of them
move by baby steps. The way I think of these rules (in some cases this is not historically
accurate) is that logicians noticed that when doing derivations, they often repeated the
same steps over and over. Eventually, someone decided that rather than doing these same
five or ten steps, you can take shortcuts.
(A -> B)
And I have:
~B
And I am trying to get:
~A
Here's what I'd have to do. Since I'm trying to find ~A, I'll do a ~introduction on A:
{
2) ~B [assumption, given]
3) A [new assumption]
4) B [->elim on 1 and 3]
5) ~B [repetition of 2]
6) ~A [~intro 3-5]
We have to do this so often that we just call these five steps "Modus Tollens." Modus
Tollens is a shortcut rule. There are several others too, some more involved.
The following is a list of the major rules, together with a justification of why each of
them is valid and a short example of how you might use some of the more challenging
ones.
[TOP]
1. Modus Ponens
This is one of the most straightforward laws in logic. It states that if you have
(X -> Y)
The reason it works is that we are given (X -> Y). Which means that X cannot be true at
the same time Y is false. So if X is true (which is the other given), then Y must be true as
well, so we are free to conclude Y is true.
Example: "If it is raining, then there are clouds" and "it is raining" together imply "there
are clouds."
[TOP]
2. Modus Tollens
This law is just the flip side of modus ponens. It states that if you have
(X -> Y)
~Y
~X
The reason this works is that we are again given (X -> Y). This means that X cannot be
true at the same time Y is false. So if Y is false (which is the other given), then X must be
false as well. So we are free to conclude X is false (or ~X is true).
Example: "If it is raining, then there are clouds" and "there are no clouds" together imply
"it is not raining."
[TOP]
DeMorgan came up with a couple sets of equivalencies. The first is that if you have
~(X ^ Y)
(~X v ~Y)
(~X v ~Y)
~(X ^ Y)
The reason this works is that our starting point is ~(X ^ Y), which is the negation of (X ^
Y). Now, (X ^ Y) can only be true if X is true and Y is also true. So (X ^ Y) will be false
if X is false or if Y is false. That is, (X ^ Y) will be false if (~X v ~Y) is true. So ~(X ^ Y)
is equivalent to (~X v ~Y).
Example: "My dog is fat, or my cat is fat" is equivalent to "It is not true that both my dog
and cat are thin."
[TOP]
~(X v Y)
is interchangeable with
(~X ^ ~Y)
The only way in which ~(X v Y) can be true is if X and Y are both false. So the two
expressions can be interchanged just like in the first law.
Example: "My dog is fat and my cat is fat" is equivalent to "It is not true that my dog or
cat is thin."
[TOP]
5. Hypothetical Syllogism
(X -> Y)
(Y -> Z)
(X -> Z)
Here's why:
We know that "if X is true, then Y is true." And we know that "if Y is true, then Z is true."
But we don't know anything about whether any of the letters are actually true or not.
Let's assume (or hypothesize) for a second that X is true. Then, by modus ponens, Y is
true. And then by modus ponens again, Z is true. So: If we assume X is true, then we
conclude Z is true. Since we didn't know X was true, we cannot take Z home with us, but
we can say that "If X was true, then Z would be true." This is equivalent to saying "If X,
then Z" or (X -> Z).
Example: "If it is raining, then there are clouds" together with "if there are clouds, then
the sun will be blocked" imply "if it is raining, then the sun will be blocked."
[TOP]
6. Disjunctive Syllogism
(X v Y)
~X
then you can conclude
Here's why:
We know first of all that "X or Y is true." We also know that X is false. If X or Y is true,
and X is false, then Y has no choice but to be true. So we can conclude that Y is true.
Example: "My dog is fat or my cat is fat" together with "my dog is thin" imply "my cat is
fat."
[TOP]
(Y ^ ~Y)
then you can conclude that your assumption was false, and
~X
Proof by Contradiction
http://mathforum.org/library/drmath/view/62852.html
[TOP]
8. Double Negation
~~X
9. Switcheroo
(I've heard that this was actually named after a person, but I don't know that for certain.)
(X v Y)
(~X -> Y)
To understand why, let's think about (~X -> Y). This says that ~X cannot be true at the
same time that Y is false. Or, to put that another way, X cannot be false at the same time
Y is false.
So (~X -> Y) can only be false when X and Y are both false. Similarly, the only way for
(X v Y) to be false is to have X and Y both false. So the two expressions are true unless X
and Y are both false, so they have the same "truth conditions" and are therefore
equivalent (i.e. interchangeable).
Example: "My dog is fat, or my cat is fat" is equivalent to "If my dog is thin, then my cat
is fat."
(This one is hard to wrap your mind around, but think about what must be true/false
about the world in order to make each statement true or false and it should eventually
become clear.)
[TOP]
[TOP]
11. Simplification
[TOP]
If there's a rule you don't understand, try to use the twelve systematic rules to figure out
how the rule works. Once you see the steps in deriving the rule and you know why it is a
valid shortcut, you won't have any trouble using it. And remember, if you get stuck and
don't know what to do, you can always fall back on the twelve systematic rules.
In normal binary logic, the answer to a question like "Is Joe tall?"
would have to be either "yes" or "no" - either 1 or 0. In terms of
attributes, Joe would either have "tallness," or he wouldn't.
This is one of the things that makes binary logic break down so easily
when you try to apply it to the real world, where people are "sort of
tall," food is "mostly cooked," cars are "pretty fast," jewelry is
"very expensive," patients are "barely conscious," and so on.
In fuzzy logic, Joe can have a tallness value of, say, 0.9, which can
combine with values for other attributes to produce "conclusions" that
look more like "The clothes have a dryness value of 0.91" than "The
clothes are dry" or "The clothes are not dry." It is frequently used
to control physical processes - washing or drying clothes, toasting
bread, bringing trains to smooth stops at the right places, keeping
planes on course, and so on. It's also used to support decisions -
whether to buy or sell stock, whether to support or oppose a
particular political position, and so on.
For example, if A has a truth value of 0.2 and B has a truth value of
0.6, then the maximum of these is 0.6 and the disjunction (A v B) will
carry the truth value of 0.6.
The way this is extended to fuzzy logic is to say that ~A has a truth
value equal to 1 minus the truth value of A.
Next, let's talk about -> (if, then). In Boolean logic, (A -> B) is
ONLY false if A is true AND B is false; in all other cases it is true.
Or, (A -> B) carries a value of 1 UNLESS A has a value of 1 and B has
a value of 0. So, if A has a value of 0, then (A -> B) will definitely
have a value of 1; or if B has a value of 1 then (A -> B) will
definitely have a value of 1.
Fuzzy logic uses this definition. The truth value of (A -> B) is equal
to the maximum of the truth value of ~A and the truth value of B.
For example, if B has a truth value of 0.5 and A has a truth value of
0.4, then ~A has a value of 0.6. The maximum of B (0.5) and ~A (0.6)
is 0.6 so (A -> B) will have a value of 0.6.
The last connective, <-> (if and only if), is the hardest to extend to
fuzzy logic.
My first guess for how to apply this to fuzzy logic was that it is
simply an equals relation. So if A and B have equal values, then
(A <-> B) will have a value of 1; otherwise it will have a value of 0.
This is okay as a first guess, but the problem is that now (A <-> B)
can only have 1 or 0 as a value, which is not really very fuzzy at
all.
Let's go back to Boolean logic for a minute and think a bit more about
(A <-> B). This is read "A if and only if B." It means that if we know
A is true, then we can conclude that B is true AND we if we know B is
true, then we can conclude that A is true. In other words, (A <-> B)
is equivalent to the conjunction of (A -> B) and (B -> A) or to put it
more formally:
Specifically:
So first you figure out the value of (A -> B) (which is the maximum
of ~A and B), then you figure out the value of (B -> A) (which is the
maximum of ~B and A), and then you take the minimum of those.
Let's say that A and B both have truth values of 0. What does
(A <-> B) have in this case?
Finally, we take the minimum of the first two steps above. Step one
gave us 1 and step two gave us 1, so the minimum is 1.
Finally, we take the minimum of the first two steps above. Step one
gave us 0 and step two gave us 1, so the minimum is 0.
For a fuzzy logic example, let's say that A has a value of 0.7 and B
has a value of 0.5. What is the value of (A <-> B)?
Finally, we take the minimum of the first two steps above. Step one
gave us 0.5 and step two gave us 0.7, so the minimum is 0.5. So in
this example (A <-> B) has a truth value of 0.5.