Artificial Intelligence Vrs Statistics

AI versus Statistics : Some common topics.
Karina Gibert(1) Jorge Rodas(2) Javier Gramajo(2)

karina@eio.upc.es jr@lsi.upc.es jgramajo@lsi.upc.es
(1) Technical University of Catalonia.

Statistics and Operation Research Department.
Pau Gargallo 5.
08028 Barcelona, Spain.
(2) Technical University of Catalonia.

Informatic Systems and Languages Departament.
Jordi Girona Salgado 1-3, M. C6, D. 201.
08034 Barcelona, Spain.
January 9, 2008
Abstract
This document addresses the existence of some topics, tradition-
ally studied by both Statistics and Artificial Intelligence and tries to
establish resemblance and differences between the techniques proposed
in one field and another. To finish, possibilities of those techniques in
the modern information society are studied.
Keywords: Clustering, Classification, Artificial Intelligence, Statis-
tics, Data Mining, Knowledge Discovery.
1
1 Introduction
The document focuses to some formal problems which have been studied
either from an Statistical point of view and from the Artificial Intelligence
point of view.
Different techniques were proposed by both sciences to solve same situa-
tions and a discussion on those solutions is presented here.
Authors do not pretend to be exhaustive at all. But to suggest a reflection
on some application aspects of those techniques.
The structure of the document is the following: First of all a very brief
history of both Statistics and Artificial Intelligence is introduced in sections
§2 and §3 where there are described and this allows the reader to understand
next sections. An introduction to Machine Learning is given in section §4
and an overview of Inductive Learning and Conceptual Clustering in section
§5. Some solutions proposed by Statistics and Artificial Intelligence sciences
are discussed and differences and resemblance are analyzed in sections §6
and §7. Finally, a brief summary of Knowledge Discovery in Databases and
Data Mining history, new trends related to the information society and new
technologies, and some expectations about its future is introduced in section
§8.
2 Statistics Essentials
Statistics is a very ancient science and covers a very broad field, from pure
probability theory to the modern theories of exact statistics. However, in
this section we present a few historical references mainly those which are
relevant for the document. For detailed history of Statistics see Droesbeke
§[15], Everitt §[18].
The term Statistics derives from latin word Status, which refers to the
political and social situation, to the State. The origin of the Statistics can be
set in the first counting–oriented works of Ancient Empires: China, Egypt,
Greece,. . . experimented soon the need of counting inhabitants, houses, fam-
ilies, quantities of picked fruits in a certain year, etc.
So, Statistics began as the science of collecting the economics and demo-
graphics data which were relevant to the State. It evolved along years and
today can be defined as the science concerned with collection and analysis
2
of data, in order to extract its contained information and to present it in an
understandable and synthetic way.
At the beginning only a descriptive purpose held. Later, sampling tech-
niques were used and knowledge about the whole population was inferred
from the analysis of the sample. Sampling theory and mathematical Statis-
tics are concerned with those goals.
The end of s. XVIII constituted a fertile scientific corpus. It is Darwin
time; in that period Galton presented (1877) the first works on regression
analysis, and Pearson presented, among other works, in 1901 a preliminary
version of principal components analysis. His main disciple was Fisher (1890–
1962) whose works can be considered the establishment of modern statistics.
About the 1930, Psychometry set out new problems which triggered de-
veloping of Multi-variant Data Analysis. Measure of latent factors which
could not be directly observed, like intelligence, memory, . . . motivated the
factor analysis due to Spearman (1904). Hotelling (1933) generalized this
ideas and Pearson’s ones in the multi-variant technique of principal compo-
nent analysis. Those techniques are mainly descriptive and try to show a
perspective of the relationships among the variables altogether.
In this line, Fisher and also Mahalanobis presented, in 1936, the first
works about discriminant analysis in which it exists a response variable
telling the class of every object; the best linear combination of all the vari-
ables for distinguishing the classes is found.
On the other hand, the formation of and distinguishing between diffe-
rent classes of objects (clustering) has been in use for very long. Statistics
faced this problem and in 1757 Andanson introduced the principles of using
distances between objects to group the more similar in classes and iterate the
process with the classes to obtain a hierarchy. Clustering techniques require
lots of calculations. They regain actuality when computers became powerful:
in 1963 Sokal and Sneath presented The Numerical Taxonomy which can be
considered the first modern formulation of clustering. Modifications of these
basic algorithms have been presented till today, but basic ideas still remain.
3 About Artificial Intelligence

Artificial intelligence (AI) is also broad field, but much more younger than
Statistics. AI as a formal discipline has been around for a little over thirty
3
years. We can consider that the establishment of AI as a discipline occurred
during the period from the famous Dartmouth Conference of 1956 when Mc-
Carthy coined the term on .
Artificial Intelligence is concerned with getting computers to do tasks
that require human intelligence.
A reasonable characterization of the general field is that it is intended
to make computers do things, that when done by people, are described as
having indicated intelligence §[5].
We believe that AI techniques pretends to build software and/or hardware
that do what human and animals do. Basically AI pretends machines to be
able to learn in the way that human or animals learn, to communicate, to
reason, to classify, to make decisions, etc.
However there are many tasks which we might reasonably think require
intelligence - such as complex arithmetic - which computers can do very
easily. Conversely, there are many tasks that people do without even thinking
- such as recognizing a face - which are extremely complex to automate. AI
is concerned with these difficult tasks, which seem to require complex and
sophisticated reasoning processes and knowledge.
The Artificial Intelligence history could be graphically represented by a
nice metaphor, a simple plant, such as : Nopal, for example §[58]. Nopal
(Opuntia spp), also known as the Prickly Pear Cactus, is a member of the
Cactus family (see fig. 1).
The AI, like whatever plant, has :
• roots : Parallelism and Von Neumann,
• stems : Micro-distributed AI, Evolutionary AI, etc.,
• fruits : Natural Language Processing Techniques, Neural Networks,

etc.,
• nutriments : Neurophysiology, Linguistics, etc., and
• supports : Robot-Building-Simulation, Computational Techniques and

Cognitive Science.
The AI beginning was disorganized but today it is a systematized dis-

cipline. First AI began under the Von Neumman paradigm and sequential
4
computational techniques; it was naturally supported by the mathematics
and logic disciplines developed at that moment. Then AI was based on the
symbol idea, introduced by Newell and Simon in 1956. In their work on the
Logic Theorist §[59], a program that proved logic theorems by searching a
tree of subgoals was presented. The program made extensive use of heuris-
tics to prune its search in a symbolic space of solutions. With this success,
the idea of heuristic search soon became dominant within the still tiny AI
community. AI was broken into five key topics : search, pattern recogni-
tion, learning, planning and induction; in 1961 §[54] by Minsky. Most of the
serious work in AI according to this breakdown was concerned with search.
Eventually, after much experimentation, search methods became well un-
derstood, formalized and analyzed, and became celebrated as the primary
method of Artificial Intelligence.
At the end of the era of establishment, in 1963, Minsky generated an
exhaustive annotated bibliography §[55] of literature directly concerned with
construction of artificial problem-solving systems. There are two main points
of interest here : an introduction to artificial problem solving systems and AI
bibliography. This bibliography includes many items on cybernetics, neuro-
science, bionics, information and communication theory, and first generation
connectionism.
Then advances in AI were explosive, especially when the first successes
of the application in diagnostic oriented problem solving (i.e. MYCIN, 1976,
diagnose infections) and other techniques such as : expert systems, knowl-
edge representation, machine learning (see §4) , reasoning, natural language
processing, etc were publicized. However symbolic representations showed
serious limitations when facing complex real big problems, mainly because
almost IA problems are NP-complete. As a consequence, about later 70’s,
the production of these techniques become to decrease and other paradigms
were explored to find new solutions to the typical AI problems. Actually
what ”new” techniques do is to change tools used to solve problems but
objectives still remain.
Considering the parallel paradigm (computer’s parallel architecture), some-
thing called micro-distributed AI appeared about the 70’s and some authors
called it, by its implicit metaphor : artificial neural networks (ANN). ANN
are architectures of simple operator’s networks that imitate the natural neu-
ral network (NNN) structure. ANN were born in 1943 when McCulloch and
Pitts described them by first time. These architectures were introduced in
5
the AI context and were used to solve the typical AI problems. ANN are
based on an unrepresentativeness paradigm where the concept of program-
ming is substituted by concept of training. At the beginning using these
techniques gave some successes (i.e. pattern recognition problems). Minsky
and Papert published Perceptrons (1969); where they shown that ANN had
extremely limited representation ability. That is why many have people de-
serted of ANN and only few researchers continued with Minsky and Papert
efforts, most notably Teuvo Kohonen, Stephen Grossberg, James Anderson,
Kunihiko Fukushima, Rumelhart, and McCleland. The interest in ANN re-
emerged only after some of their important theoretical results were attained
in the early eighties (1982).
Recent approaches were inspired by a recent learning algorithm known as
back propagation.
Back propagation has some problems; it is slow to learn in general, and
there is a learning rate which needs to be tuned by hand in most cases. These
problems combine to make back propagation, which is the cornerstone of
modern neural network research, inconvenient for use in embodied or situated
systems.
In between the parallel paradigm and symbol paradigm, evolutionary
AI and the macro-distributed AI appear. The first one is characterized by
genetic algorithms and the second one by the multiagents systems and other
techniques.
Evolutionary AI, with a parallel heuristic search, inside a big solutions
space, searches optimum solutions based on a reward and punishment tactic.
The macro-distributed AI problems share with the micro-distributed AI the
idea of an operator’s network, but now it is not a simple network, since
it is compound by computational co-operative agents of great complexity
themselves. Multiagent systems are on the bases of reactive robots, which is
the hard area of AI.
4 Machine Learning.
A set of techniques that involves searching a very large space of possible
hypotheses to determine one that best fits the observed data and any prior
knowledge held by the learner. The goal of Machine Learning is creates
systems that improve its own task performance by the experience acquisition
6
from data.
Machine Learning draws on ideas from a diverse set of disciplines: arti-
ficial intelligence, probability, statistics, computational complexity, informa-
tion theory, and philosophy §[56].
Data analysis techniques have been traditionally used for such task in-
clude regression analysis, cluster analysis, numerical taxonomy, multi-dimensions
analysis, other multi-variate statistical methods, stocastic models, time series
analysis, nonlinear estimation techniques, and others (e.g., Daniel and Wood,
1980; Tukey 1986; Morganthaler and Tukey, 1989: Diday, 1989: Sharma,
1996). These techniques are used for solve a lot of practical problems. How-
ever, they are oriented to extraction of quantitative and statistical data char-
acteristics.
For example, an statistical analysis can determine variables covariances
and variables correlations in data. However it could not characterizes de-
pendencies in an abstract conceptual level. To do this task, a data analy-
sis systems have to be equipped with a substantial amount of background
knowledge, and be able to perform symbolic reasoning tasks involving that
knowledge and the data.
In order to solve some of these limitations, researchers look for ideas
and methods developed in Machine Learning field. It is a natural source
of ideas for this purpose. The essence of Machine Learning is to develop
computational models for acquiring knowledge from facts and background
knowledge. These efforts had been gathered in a new research area, known
as Data Mining or Knowledge Discovery in Databases (e.g. Michalski, Baskin
and Spackman, 1982; Zhuravlev and Gurevitch 1989; Michalski et al, 1992;
Van Mechelen et al, 1993; Fayyad et al, 1996; Evangelos and Han 1996).
5 Inductive Learning and Conceptual Clus-

tering overview.
We can consider two main learning tasks in Machine Learning:
• Inductive Learning : The human being creates patterns to make an

attempt to understand his environment. This process is called inductive
learning §[34].
7
In the learning process, humans and animals (cognitive systems) ob-
serve their environment and recognizes similarities between objects and
events in it. They group similar objects in classes and make rules that
predict the behavior of no classified items of that class.
• Conceptual clustering : It is a Machine Learning task defined by Michal-

ski in 1980. A conceptual clustering systems accepts a set of object
descriptions (events, observations, facts) and produces a classification
scheme over the observations. These systems not require a ”teacher” to
preclassify objects, but use and evaluation function to discover classes
with ”good” conceptual description.
The automation of Inductive Learning and Conceptual Clustering pro-

cesses has been extensive researched in Machine Learning (see §[56]), an
Artificial Intelligence research area, as we know.
At this point, we already define Inductive Learning and Conceptual Clus-
tering, so we can define two learning systems techniques: Supervised Learn-
ing and Unsupervised Learning, both of them are related with the Inductive
Learning and the Conceptual Clustering respectively.
Supervised Learning. Techniques in supervised learning, look for the defi-

nitions of classes made by the teacher . In unsupervised learning, they make
a summary of the training set as a simple set of teacher’s descriptions or
newly discovered classes with their descriptions.
There are several forms of representing patterns that can be discovered
by machine learning, and each one has techniques that can be used to infer
the output structure from the knowledge representation data. These struc-
tures could take the form of decision trees and classification rules and they
are the basic knowledge representation styles that many machine learning
methods use. Some of these representation examples are : complex varieties
of rules, special forms of trees, and instance-based representations. Finally
some learning schemes generate clusters or instances.
Unsupervised Learning. In unsupervised learning, or learning from ob-

servation and discovery, the system has to find its own classes in a set of
states, without any help of a teacher. Practically, the system has to find
some clustering of the set of states S. The data mine system is supplied ob-
jects, as in supervised learning, but now, no classes are defined. The system
8
has to observe the examples, and recognize patterns (e.g. class descriptions)
by itself. Hence, this learning form is also called learning by observation and
discovery (see §[49]). The result of an unsupervised learning process is a set
of class descriptions, one for each discovered class, that together cover all
objects in the environment. These descriptions form a high-level summary
of the objects in the environment.
Holsheimer and Siebes believe that unsupervised learning is actually not
different from supervised learning, when only positive examples of a single
class are provided. Thus, we search a description that describes all objects in
the environment. If different classes exist in the environment, this description
will be composed of the descriptions of these newly discovered classes.
6 Clustering
In apprehending the world, men constantly employ three methods of organi-
zation, which pervade all of their thinking:
• the differentiation of experience into particular objects and their attri-
butes;
• the distinction between whole objects and its parts and
• the formation and distinction of different classes of objects.
Classically, the third task is defined as a clustering problem, i.e. the
problem of identifying the natural distinguishable groups of similar objects
in a set.
As said before, human have been doing that almost from the beginning.
And at the beginning, finding the clusters of a group of objects was almost
an art, with an important dose of common sense from the author himself.
In s. XVIII clustering problems became important in the context of
Biology and Zoology in order to organize the species of living beings. The
most famous work on this line was the clustering of living beings made by
Linneus, still valid today.
In 1757, for the first time, Andanson established the principles of an
objective and systematic way for making clusters in a set of objects. The
method was based on the measure of distances between objects; more simi-
lar objects were clustered together and the process is iterated on the classes.
9
Finally a hierarchy among the classes is produced. The most used represen-
tation of a clustering process is the hierarchical tree, also called dendrogram
in the statistical context.
However, using this techniques was tedious, because of the high number
of calculations required.
When computers became powerful, clustering techniques received major
boost. In 1963, Sokal and Sneath presented the first modern formulation of
numerical taxonomy. The basic algorithm was that of Andanson, but many
ways of measuring the distance (euclidean, χ2 , inertias. . . ) between objects
and of creating the classes were studied. Other families of clustering methods
appeared:
• partition methods: in which no dendrogram is formed (dynamic clouds

Diday §[12], k-means MacQueen§[48] agglomerative Volle §[70],. . . )
• pyramidal cluster: in which the tree is no binary anymore Biday-Britto

§[13]
• additive trees: based on graphs Roux §[66]
but all of them shared some common characteristics: the use of distances
between individuals and the metrics structure implied; the use of numerical
variables to describe the objects.
However, it is clear that in certain applications some non-numerical vari-
ables are really relevant. The introduction of categorical variables for clus-
tering required the use of special distances (χ2 Benzecri §[3], . . . ) and arouse
discussion on the interpretation of the results.
On the other hand, from the AI point of view, research in learning in-
cluded also the design of algorithms for identifying the clusters of a given set
of objects as one of the intelligent human tasks to be performed by machines.
In that context statistical techniques were considered poor: on the one hand
because qualitative variables (categorical ones in statistical terminology) were
mainly used in that field; on the other hand, the main goal for AI was not
only to identify the classes, but to be able to understand the process used
for doing that. In that sense, statistical methods, based on mathematical
concepts (i.e. metrics) were not understandable enough. Using these argu-
ments, Michalsky presented in 1983 the conceptual clustering in which logic
concepts were associated with classes and generalized (on the basis of logic
10
operations) in order to include the maximum number of observed entities (or
objects) and the minimum number of non observed ones.
Conceptual clustering is a completely new approach to the clustering
problem. The basis is not mathematics anymore, but logics. Elements to
deal with are not measures on the objects anymore, but qualitative values
and logic concepts. The goal is not only discovering of classes, but being able
to understand the clustering process which produced them; also being able
of getting conceptual descriptions of the produced classes.
On the bases of this algorithm, some other methods were proposed: COB-
WEB, Fisher §[23] 1987 performs an incremental conceptual clustering based
on an utility measure related with classical information measure (Gluck and
Corter category utility).
Autoclass (Cheeseman and Stutz 1995) is a comprehensive Bayesian clus-
tering scheme that uses the finite mixture model, with prior distributions on
all parameters.
As a matter of fact, two main families of algorithms can be distinguished:
Statistical ones were originally designed for numerical variables and they are
based on distances between objects and the properties of metrics spaces (or
similar); AI ones were originally designed for symbolic management (cate-
gorical variables) and they are based on logic conceptual description, gener-
alization of concepts, and measures of quality related with information gain.
Both deal with the same problem, both are used to identify the natural
classes of a given group of objects, when no prior knowledge is able. But as
it can be seen, nature of those methods is really different.
7 Classifying
Classifying is the problem of having a set of well-known classes and assign
the corresponding class label to a given unclassified object. As it is obvious,
this is different from a clustering problem since in clustering, the structure of
the target domain is unknown and must be discovered. In classification the
structure of the target domain is known: the existent classes is known and
the goal is to characterize the classes in order to decide the class of a new
object.
Classification typically involves two steps: first the system is trained on
a set of data, and then it is used to classify a new set of unclassified cases.
11
From a Statistical point of view, this is solved using the discriminant
analysis techniques, introduced by Fisher and Mahalanobis in 1936. The in-
put of the algorithm is a set of examples (i.e. objects for which the belonging
class is already known). The class of each object is considered as the response
variable. The problem is reduced to find the linear combination of variables
which best fits the response variable, that is to find the linear discriminant
function. So, once found this function, and given the values of a new object,
identifying its class is reduced to use the discriminant function to determine
the class number. The discriminant function is found by maximizing the
inertia inter classes and minimizing the inertia intra-classes. Of course the
solution is found using algebraic paradigm, by finding eigen vectors of some
matrices. The main problem of these techniques is to face domains in which
non linear discriminant function is suited.
Neural nets seems to be able to find non linear discriminant functions us-
ing a big set of objects described exclusively by numerical variables. However,
they act as black boxes and they do not provide clear conceptual interpreta-
tions of the classes.
From the AI point of view, the classifying problems are included in su-
pervised learning, since there is a training set where the classes are known
in advance and it is used to learn how to label new objects.
Decision trees approach to the problem of learning from a set of instances.
They were introduced by Quinlan with ID3 algorithm in 1986. Main idea
is to build a decision tree: The nodes in a decision tree involve testing (ei-
ther comparing a particular attribute with a constant, or comparing a pair
of attributes, or evaluating some function of several attributes); leaf nodes
represent classes which include all instances that reach the leaf, or represent
a probability distribution over all possible classifications. To find the deci-
sion tree, ID3 searches from single to complex hypotheses to test in a node
until one consistent with the data is found; the consistence is evaluated by a
measure related to the entropy. To classify a new instance, it is routed down
the tree according to the values tested in successive nodes, and when a leaf is
reached the instance is classified into the class represented by the leaf. The
great advantage of decision trees is that the meaning of the classes is clear
upon the sequence of tests evaluated from the root of the tree to the leaf
which represents the class. Main problem is that when a number of variables
are considered, the tree is too big and heuristic criteria are needed to prune
it and to guarantee enough objects in leaves so as to represent real classes.
12
Sometimes the growth of a decision tree is too big, this is an important
disadvantage. There are some researches in conversions methods for decision
trees to others (see §[62]).
In 1993 appears C4.5, a modification of ID3. It starts with large sets of
cases belonging to known classes. The cases, described by any mixture of
categorical and numeric properties, are scrutinized for patterns that allow
the classes to be reliably discriminated. These patterns are then expressed
as models, in the form of decision trees or sets of if-then rules that can be
used to classify new cases, with emphasis on making the models understand-
able as well as accurate. The j48 Classifier, from 1999, is a decision tree and
ten-fold cross-validation estimates of its performance. The ID3 family algo-
rithms (ID3, C4.5, J48) infers decision trees by growing then form the root
downward, greedily selecting the next best attribute for each new decision
branch added to the tree.
Bayesian methods provide one of the oldest methods to perform super-
vised classification. A Bayesian classifier is trained by estimating the condi-
tional probability distributions of each attribute, given the class label, from
the database.
Unfortunately the learning efficiency, so precious to the success, is lost
when the database is not complete; that is, it records some entries as un-
known. In this case the exact estimation of each conditional probability
distribution, required to define a classifier, is a mixture of the estimations
that can be computed in each database, and is generated by the combination
of all possible values of the missing data.
Robust Bayesian Classifier (Roc) give several measures; one of them is
coverage measure, this one is the proportion of the cases that the classifier is
able to assign to a class and it is computed as the ratio between the number
of classified cases and the total number of cases in the database.
8 Knowledge Discovery in Databases and Data

Mining.
What should AI and Machine Learning community call Knowledge Discov-
ery in Databases to Data Mining? The name data mining which was already
used in the database community seemed unsexy, and besides statisticians
13
used data mining as a pejorative term to criticize the activity. Mining is
unglamorous and there is no indication what are we mining for. Knowledge
mining and knowledge extraction did not seem much better, and database
miningTM was trademarked by HNC for their Database Mining Worksta-
tionTM. So, we have Knowledge Discovery in Databases, which emphasized
the discovery aspect and the focus of discovery on knowledge. The term
”Knowledge Discovery in Databases” (KDD for short) became popular in the
AI and Machine Learning community. However, the term data mining be-
came much more popular in the business press. As of november 1999, search
on www.altavista.com gives about 100,000 pages for data mining, compared
to 18,000 for knowledge discovery. Currently, both terms are used essentially
as synonyms, as in the name of the main journal for the field – Data Mining
and Knowledge Discovery (Kluwer). Sometimes knowledge discovery process
is used for describing the overall process, including all the data preparation
and postprocessing while data mining is used to refer to the step of applying
the algorithms to the clean data §[19].
Around 1989, some topics such as : Fuzzy Rules, Learning from relational
(structured) data, Integrated Systems, Privacy, etc; attracted the AI and
Machine Learning community attention. In those days, Expert Database
Systems and discovery of fuzzy rules for example, seemed like good ideas
at the time, have disappeared from current research lexicon because they
were examples of technology without a clear application. Some important
areas turned out to be much harder than Machine Learning community
thought in 1989. Learning from structured data is still very difficult and
current best methods from the Inductive Logic Programming community
[http://www.cs.bris.ac.uk/ ILPnet2/] are still too slow to be used on large
databases. Interestingness of discovered patterns is still a hard problem, and
it still requires significant amount of using domain knowledge. CYC §[43]
which held a lot of promise in 1989, did not produce the expected results.
On the other hand, we now have the internet which is the largest reposi-
tory of general knowledge, although still with very imperfect query system.
However, great progress was achieved in faster hardware and bigger disks,
enabling data miner to deal with much larger problems. Of curse, the major
development in computers over the last 10 years is the revolution brought
by the Internet. It has shifted the attention of data miners problems of e-
commerce and Internet personalization. It also brought much more attention
to Text Mining, and resulted in a number of good text mining systems. An-
14
other major advance was a holistic understanding of the entire Knowledge
Discovery Process §[4], which encompasses many steps from data acquisition,
cleaning, preprocessing, to discovery step, to post-processing of the results
and their integration into operational systems.
8.1 From General Tools to Domain Specific Solutions.

In 1989 there were only a few data mining tools, produced by researchers
to solve a single task, such as C4.5 decision tree §[61] and SNNS neural
network, or parallel-coordinate visualization §[36]. These tools were difficult
to use and required significant data preparation.
The second generation data mining systems, called suites, were devel-
oped by data mining vendors, starting from around 1995. These tools were
driven by the realization that the knowledge discovery process requires mul-
tiple types of data analysis, and most of the effort is spent in data cleaning
and preprocessing. The suites such as SPSS Clementine, SGI Mineset, IBM
Intelligent Miner, or SAS Enterprise Miner allowed the user to perform sev-
eral discovery tasks (usually classification, clustering, and visualization) and
also supported data transformation and visualization. An important ad-
vance, pioneered by Clementine, was a GUI which allowed users to build
their knowledge discovery process visually.
By 1999, there are over 200 tools available for many different tasks (see
http://www.kdnuggets.com/software/). However, even the best data mining
tools addressed only a part of the overall business problem. Data still had
to be extracted from legacy databases, cleaned and preprocessed, and model
results had to be delivered to the right channels and, most importantly, inte-
grated with the specific application or business logic. Successful development
of such applications in areas like direct marketing, telecom, and fraud detec-
tion, led to emergence of data-mining-based vertical solutions. Examples
of such systems include HNC Falcon for credit card fraud detection, IBM
Advanced Scout for basketball game analysis, and NASD KDD Detection
system §[37].
8.2 New Trends and some Expectations for the future.

It is clear that nowadays, new technologies increased significantly our ca-
pabilities of producing, collecting and storing data. Enormous quantities of
15
data are available to be analyzed in short times. This is an important hand-
icap for either Statistics, Artificial Intelligence, Information Systems, Data
visualization,. . .
In this document, it has been shown that Statistical-like techniques are
mainly useful for numerical variables and produce results based algebra, while
AI-like ones are mainly useful for qualitative ones and produce results based
on logics.
Describing the structure or obtaining knowledge from big sets of data is
known as a difficult task. Combination of Data analysis techniques (clus-
tering among them) , inductive learning (knowledge-based systems), man-
agement of data bases and multidimensional graphical representation must
produce benefits on this line.
Several softwares start to provide data mining tools to support this sit-
uations (Clementine, Intelligent Manager, . . . are some of the most famous
nowadays). They mainly present juxtaposition of existent techniques, allow-
ing comparison of results and selection of the better method for each case.
An interesting didactic effort is WEKA (Waikato Environmental for Knowl-
edge Analysis). This system provide a uniform interface, written in Java, to
many different learning algorithms. It was introduced by Witten and Frank
in their book Data Mining in 1999 (see §[72]).
However, in real applications, it is usual to work with very complex do-
mains §[29], such as mental disorders, sea sponges §[32], thyroids dysfunctions
§[27] . . . , where data bases with both qualitative and quantitative variables
appear; and expert(s) have some prior knowledge (usually partial) of the
structure of the domain — which is hardly taken into account by clustering
methods, and which is difficult to include in a Complete Knowledge Base.
Facing the automated knowledge discovery of ill-structured domains raises
some problems either from a machine learning or clustering point of view. For
example Knowledge Base cannot be constructed since the domains are too
complex and high quantities of implicit knowledge are managed. Classical
learning algorithms use to be NP-complete and cannot work with so big data
matrices. Statistical clustering algorithms need artificial transformation of
data to manage simultaneously numerical and categorical variables and this
produces problems with the interpretation of the results. In real world, mixed
situation between supervised and non supervised are found and no algorithms
can deal with them properly.
Real cooperation between techniques is urgent to improve the existent
16
methods and to explore new possibilities for facing such difficult analysis.
Some works has already been done on this line. As an example, Clustering
based on rules is a methodology developed in §[26] with the aim of finding the
structure of ill-structured domains. In our proposal, a cooperative combina-
tion of clustering and inductive learning is focused to the problem of finding
and interpreting special patterns (or concepts) from large data bases, in order
to extract useful knowledge to represent real-world domains. It gives bet-
ter performance than traditional clustering algorithms or knowledge based
systems approach in analyzing ill-structured domains. See §[26] for details.
The main idea is to take advantage of the partial prior knowledge that
the expert can make explicit and to use an statistical clustering algorithm to
analyze those parts of the domain for which no prior knowledge is provided.
At the end an unique dendrogram is built with all the objects and the results
use to guarantee the meaning of the resulting classes to the expert’s eyes.
This is only one example, but in our opinion this should be the trend.
Juxtaposition of different kind of methods will not help us to face new sit-
uations. Cooperative combinations of those methods is likely to be much
more fruitful for building a new generation of real Knowledge Discovery and
Data Mining techniques, ready to face the more and more big and complex
databases produced in the Information Society.
Finally some expectations about the future provided by the AI and Ma-
chine Learning community and us.
Over the next 10years, we expect :
• to see continuing progress in faster CPU, bigger disks, along with ubiq-
uitous and wireless net connectivity;
• standards to appear for different parts of the knowledge discovery pro-

cess, and greatly facilitate industry growth. Already there are proposed
standards like CRISP for the data mining process, PMML for predictive
model exchange, and Microsoft OLE DB;
• significant applications will appear in e-commerce, especially with real-

time personalization. There will be significant use of intelligent agents;
• great progress in pharmaceuticals and new drugs enabled by knowledge

discovery and bioinformatics;
17
• there will be tighter integration of knowledge discovery modules with
a database system, and most database systems will include a set of
discovery operations;
• and also that the data mining industry will overcome the hype stage,
and will merge with the database industry.
References
[1] Adrians P. and Zantinge D. : Data Mining.
Addison-Wesley, 1996.
[2] Aha D., Kibler D. and Albert M. : Instance-based learning algorithms.

Machine Learning, 6. 1991. 37-66.
[3] Benzécri J.P. : L’analyse des données.

Tome 1: La Taxinomie, Tome 2: L’analyse des correspondances.
1a ed., 1973. Paris: Dunod. Paris, France. 1980.
[4] Brachman, R. and T. Anand. : The Process of Knowledge Discovery in

Databases: A Human-Centered Approach.
In ” Advances in Knowledge Discovery and Data Mining”, ed. U.. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI/MIT Press.
1996.
[5] Brooks, R.A. : Intelligence without Reason.

IJCAI91. 1991.
[6] Cawsey, A. : Databases and Artificial Intelligence 3. Artificial Intelligence

Segment. (1994) http://www.cee.hw.ac.uk/ alison/ai3notes/all.html
[7] Clark P. and Boswell R. : Rule induction with CN2, Some recent inprove-
ments.
In Proc. Fifth European Working Sesion on Learning, Springler. 1991.
151-163.
[8] Clark P. and Niblett T. : The CN2 induction algorithm.

Machine Learning 3(4). 1989. 261-283.
18
[9] Cover T.M. and Hart P.E. : Nearest Neighbor pattern classification.
IEEE Transactions on Information Theory, 13. 1968. 21-27.
[10] Dasarathy B.V. : Nearest Neighbor (NN) Norms, NN pattern classifica-

tion techniques.
IEEE Computer Society Press, Los Alamitos, CA, US. 1990.
[11] De Raedt L. and Dehaspe L. : Clausal Discovery.

Machine Learning, 26. 1997. 99-146.
[12] Diday, E. : La méthode des nuées dynamiques.

Stat. App. 19, n. 2. 19–34.
[13] Diday, E., Brito, P. and Mfoumou., M., : Modelling probabilistic data
by conceptual pyramidal clustering.
Proc. of 4th Int’l Work. on AI&Stats. Florida, US. 1993. 213-218.
[14] Dietterich T. G and Michalski R. S. : A comparative review of selected

methods for learning from examples.
In Michalski et al. [2]. 41-81.
[15] Droesbeke J.J. : Histoire de la Statistique.

Université Libre de Bruxelles.
Universitaires de France, Paris, France. 1990.
[16] Dudani S.A. : The distance-weighted k-nearest neighbor rule.

IEEE Transactions on Systems, Man, and Cybernetics, 6(4). 1975. 325-
327.
[17] Dzeroski S. and Flach P. : Network of Excellence in Inductive Logic Pro-

gramming ILPnet2 Funded by the European Commission under contract
INCO 977102 (1997) http://www.cs.bris.ac.uk/ ILPnet2/
[18] Everitt, B. : Cluster analysis.

ondon: Heinemann Ed. Books Ltd. London. 1981.
[19] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth. : From Data Mining

to Knowledge Discovery in Databases (a survey).
AI Magazine, 17(3): Fall 1996. 1996. 37-54.
19
[20] Fayyad, U. : From Data Mining to Knowledge Discovery: An overview.
Advances in KD and DM, Fayyad, AAAI/MIT. 1996.
ISBN 0-262-56097-6.
[21] Fisher D.: Knowledge Acquisition Via Incremental Conceptual Cluster-
ing
Originally published in Machine Learning 2, Kluwer Academic Publish-
ers, Boston 1987.139-172.
[22] Fisher, D. and Pazzani, M. : Concept formation: Knowledge and expe-
rience in unsupervised learning
Morgan Kauffmann, San Mateo, CA, US. 1991.
[23] Fisher D.H. and Schlimmer J. C. : Models of Incrementals Concept
Learning : A coupled research proposal.
Computer Science Departament, Vanderbilt University.
Nashville, TN. USA. (1997). 10.
http://cswww.vuse.vanderbilt.edu/ dfisher/
[24] Fix E. and Hodges J.L. : Discriminatory analysis. Non parametric dis-
crimination. Consistency properties.
Technical Report 4, US air force school of aviation medicine. TX, US.
1957.
[25] French S. : Decision Theory.
Ellis Horwood, Chichester. 1986.
[26] Gibert, K. : L’us de la Informació Simbòlica en l’Automatització del
Tractament Estadı́stic de Dominis poc Estructurats.
In the Statistics and Operations Research phD. thesis.
UPC, Barcelona, Spain. 1994.
[27] Gibert. K. and Sonicki Z. : Classification Based on Rules and Thyroids
Dysfunctions.
Applied Stochastic Models in Business and Industry.
Appl. Stochastic Models Bus. Ind. 15, 1999. 319-324.
John Wiley & Sons.
[28] Gibert. K., Aluja T. and Cortés C.U.: Knowledge Discovery with Clus-
tering Based on Rules. Interpreting Results.
20
Principles of Data Mining and Knowledge Discovery. J.M. Quafafou Eds.
Lecture Notes in Artificial Intelligence, 1510, Springer-Verlag. Berlin.
1998. 83-92.
[29] Gibert, K. and Cortés, C.U. : Clustering based on rules and knowledge
discovery in ill-structured domains
Computación y Sistemas, México. 1998.
[30] Gibert, K. and Cortés C.U. : Weighing quantitative and qualitative
variables in clustering methods.
Mathware and Soft Computing. Vol. IV, n.3. 1997. 251-266.
Secció de Matemàtiques i Informàtica.
Escola Tècnica Superior d’Arquitectura.
Universitat Politécnica de Catalunya.
[31] Gibert, K. and Cortés C.U. : Combining a knowledge-based system
and a clustering method for a construction of models in ill-structured
domains.
P. Cheeseman and R. W. Oldford (eds), Selecting models from Data:
Artificial Intelligence and Statistics IV, Lect. Not. in Stats. 89.
Springer-Verlag, New York, N.Y. US. 1994. 351-360.
[32] Gibert K. and Cortés, C.U. : KLASS, Una herramienta estadı́stica para
la creación de prototipos en dominios poco estructurados.
proc. IBERAMIA-92. Noriega Eds., México. 1992. 483-497.
[33] Hanson S.: Conceptual Clustering and Categorization.
Machine Learning (An Artificial Intelligence Approach) Volume III, Mor-
gan Kaufmann, San Mateo. 1990. 235-268.
[34] Holland J.H., Holoyoak K.J., Nisbett R.E., and Thagard P.R : Induc-
tion: process of inference, learning and discovery.
Computational models of cognition and perception. MIT Press, Cam-
bridge,MA,US. 1986.
[35] Holsheimer M. and Siebes A.P.J.M. : Data Mining, the search for
knowledge in databases. Computer Science/Departament of Algorith-
mics and Architecture. Centrum voor Wiskunde en Informatica. Ams-
terdam, The Netherlands. Report CS-R9406, ISSN 0169-118X. (1994)
http://www.cwi.nl/static/publications/reports/reports.html
21
[36] Inselberg, A. : The plane with parallel coordinates.
The Visual Computer, 1. 1985. 69-91.
[37] Kirkland, J. : The NASD Regulation Advanced-Detection System.

(ADS) , AI Magazine 20(1): Spring 1999. 1999. 55-67.
[38] Kononenko I. : Semi naive Bayesian classifier.

In Kodratoff, Y. (ed.) Proc. European Working Session on Learning 91,
Porto, 1991. Springer. 1991. 206-219.
[39] Kononenko I. : Inductive and Bayesian learning in medical diagnosis.

Applied Artificial Intelligence, 7. 1993. 317-337.
[40] Krose, B. and Van der Smagt, P.: An Introduction to Neu-

ral Networks. fifth edn, University of Amsterdam.(1993) 13, 57-73.
ftp://ftp.fwi.uva.nl//pub/computer-systems/...neuro-intro.ps.gz
[41] Lavrac̆ N. and Dz̆eroski S. : Inductive Logic Programming, Techniques

and Applications.
Ellis Horwood, Chichester. 1994.
[42] Lebart, L. : Traitement statistique des données.

Dunod, Paris.
[43] Lenat, D. B. : Cyc: A Large-Scale Investment in Knowledge Infrastruc-

ture.
Communications of the ACM 38, no. 11. 1995.
[44] López de Mántaras R. : A distance based attribute selection measure

for decision tree induction.
Machine Learning 6. 81-92.
[45] López de Mántaras R. : Trends in Knowledge Engineering.

proc TECCOMP-91. México. 1991. 20-21.
[46] López de Mántaras R. y Crespo J.J. : El problema de la seleción de

atributos en el aprendizaje inductivo : nueva propuesta y estudio expe-
rimental.
proc. IBERAMIA-90. Limusa-Noriega, México. 1990. 259-271.
22
[47] Lucas P.J.F. : Logic engineering in medicine.
The knowledge engineering review 10(2). Cambridge University Press.
1995. 153-179.
[48] MacQueen J. : Some Methods for Classification and Analysis of Multi-

variate Observations.
proc. 5th Berkeley Symp.
Berkeley, CA, US. 1965. 281-297.
[49] Michalski, R.S. : A theory and methodology of inductive learning.

In Michalski et al. (see [50]). 83-134.
[50] Michalski, R.S. Carbonell J.G. and Mitchell T.M. : Machine Learning,
an Artificial Intellegence approach.
v2. Morgan Kufmann, San Mateo, CA, US. 1986.
[51] Michalski R. S. and Kaufman K. A. : Data Mining and Knnowledge

Discovery : A Review of Issues and a Multistrategy Approach. Chapter 2
of Machine Learning and Data Mining : Methods and Applications. John
Wiley and Sons publishers. (1997).
[52] Michalski R. and Stepp R.: Learning from Observation Conceptual Clus-
tering.
In Machine Learning: An artificial intelligence approach, San Mateo Mor-
gan Kaufmann, 1983. 331-363.
[53] Michie D., Spiegelhalter D.J. and Taylor C.C. : Machine learning, neural
and statistical classification.
Ellis Horwood. 1994.
[54] Minsky, M. : Steps Toward Artificial Intelligence.

Marvin Minsky, Proc. IRE 49, Jan. 1961. 8-30.
[55] Minsky, M. : A Selected Descriptor-Indexed Bibliography to the Liter-

ature on Artificial Intelligence.
Feigenbaum and Feldman book : Computers and Thought. McGraw-Hill,
New York, NY. 1963. 453-523.
[56] Mitchell T. M. : Machine Learning.

Chapter 1. Mc Graw Hill. 1997.
23
[57] Muggleton S. : Inverse entailment and Progol.
New generation computing, special issue on inductive logic programming,
13(3-4). 1995. 245-286.
[58] Negrete J. : El Nopal de la Inteligencia Artificial.

Makatziná. Lecturas en IA. Revista aperiódica. Número III.
Veracruz, México. 1994.
http://www.mia.uv.mx/ jnegrete/pub/NuevosMakatzina/Mk3.html
[59] Newell, A., Shaw, J.C. and Simon H. : Empirical Explorations with the
Logic Theory Machine.
Western Joint Computer Conference 15, 1957. 218-329.
[60] Pearl J. : Probabilistic Reasoning in Intelligent Systems, Networks of

Plausible Inference.
Morgan Kauffman, San Mateo, CA, US. 1988.
[61] Quinlan J. R. : Induction of Decision Trees.

Machine Learning, Volume 1, 1986. 81-106.
[62] Quinlan, J.R. : Generating Production rules from decision trees.

In Proceedings of 10th International Joint Conference on Artificial Intel-
ligence. Milan, Italy. 1987. 304-307.
[63] Quinlan, J.R. : Learning logical definitions from relations.

Machine Learning, Volume 5(3). 1990. 239-266.
[64] Quinlan, J.R. : C4.5, programs for machine learning.

Morgan Kauffman, San Mateo, CA, US. 1993.
[65] Richeldi M. and Rossotto M. : Class driven statistical discretization of

continuous attributes.
In Lavrac̆ N., Wrobel, S. (eds.), Machine Learning: Proc. ECML95,
Springer. 1995. 335-342.
[66] Roux M. : Algorithmes de classification.

Paris: Masson, Paris, France. 1985.
[67] Rumelhart D.E. and McClelland J.L. : Parallel Distributed Processing,

Vol 1: Foundations. MIT press, Cambridge, M.A.,US. 1986.
24
[68] Sokal, R.R. and Sneath, P.H.A. : Principles of numerical taxonomy.
W. H. Freeman & Co. San Francisco, CA, US. 1963.
[69] Tirri H., Kontkanen P. and Myllymki : A Bayesian Framework for Case-
Based Reasoning.
Proceedings of the 3rd european Workshop on Case-Based Reasoning.
Switzerland. 1996.
[70] Volle M. : Analyse des données.

Paris: Economica, Paris, France. 1985.
[71] Weiss S.M. and Kulikowski C.A. : Computer Systems that learn. Mor-
gan Kauffman, San Mateo, CA, US. 1991.
[72] Witten I. H. and Frank E. : Data Mining.

Practical Machine Learning Tools and Techniques with Java Implemen-
tations.
Morgan Kaufmann Publishers. (1999) 265.
25

Artificial Intelligence Vrs Statistics

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Artificial Intelligence Vrs Statistics

Transféré par

Droits d'auteur :

Formats disponibles

AI versus Statistics : Some common topics.

Karina Gibert(1) Jorge Rodas(2) Javier Gramajo(2)

(1) Technical University of Catalonia.

(2) Technical University of Catalonia.

3 About Artificial Intelligence

The AI, like whatever plant, has :

• roots : Parallelism and Von Neumann,

• stems : Micro-distributed AI, Evolutionary AI, etc.,

• fruits : Natural Language Processing Techniques, Neural Networks,

• nutriments : Neurophysiology, Linguistics, etc., and

• supports : Robot-Building-Simulation, Computational Techniques and

The AI beginning was disorganized but today it is a systematized dis-

5 Inductive Learning and Conceptual Clus-

We can consider two main learning tasks in Machine Learning:

• Inductive Learning : The human being creates patterns to make an

• Conceptual clustering : It is a Machine Learning task defined by Michal-

The automation of Inductive Learning and Conceptual Clustering pro-

Supervised Learning. Techniques in supervised learning, look for the defi-

Unsupervised Learning. In unsupervised learning, or learning from ob-

• partition methods: in which no dendrogram is formed (dynamic clouds

• pyramidal cluster: in which the tree is no binary anymore Biday-Britto

• additive trees: based on graphs Roux §[66]

8 Knowledge Discovery in Databases and Data

8.1 From General Tools to Domain Specific Solutions.

8.2 New Trends and some Expectations for the future.

Over the next 10years, we expect :

• standards to appear for different parts of the knowledge discovery pro-

• significant applications will appear in e-commerce, especially with real-

• great progress in pharmaceuticals and new drugs enabled by knowledge

[2] Aha D., Kibler D. and Albert M. : Instance-based learning algorithms.

[3] Benzécri J.P. : L’analyse des données.

[4] Brachman, R. and T. Anand. : The Process of Knowledge Discovery in

[5] Brooks, R.A. : Intelligence without Reason.

[6] Cawsey, A. : Databases and Artificial Intelligence 3. Artificial Intelligence

[8] Clark P. and Niblett T. : The CN2 induction algorithm.

[10] Dasarathy B.V. : Nearest Neighbor (NN) Norms, NN pattern classifica-

[11] De Raedt L. and Dehaspe L. : Clausal Discovery.

[12] Diday, E. : La méthode des nuées dynamiques.

[14] Dietterich T. G and Michalski R. S. : A comparative review of selected

[15] Droesbeke J.J. : Histoire de la Statistique.

[16] Dudani S.A. : The distance-weighted k-nearest neighbor rule.

[17] Dzeroski S. and Flach P. : Network of Excellence in Inductive Logic Pro-

[18] Everitt, B. : Cluster analysis.

[19] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth. : From Data Mining

[37] Kirkland, J. : The NASD Regulation Advanced-Detection System.

[38] Kononenko I. : Semi naive Bayesian classifier.

[39] Kononenko I. : Inductive and Bayesian learning in medical diagnosis.

[40] Krose, B. and Van der Smagt, P.: An Introduction to Neu-

[41] Lavrac̆ N. and Dz̆eroski S. : Inductive Logic Programming, Techniques

[42] Lebart, L. : Traitement statistique des données.

[43] Lenat, D. B. : Cyc: A Large-Scale Investment in Knowledge Infrastruc-

[44] López de Mántaras R. : A distance based attribute selection measure

[45] López de Mántaras R. : Trends in Knowledge Engineering.

[46] López de Mántaras R. y Crespo J.J. : El problema de la seleción de

[48] MacQueen J. : Some Methods for Classification and Analysis of Multi-

[49] Michalski, R.S. : A theory and methodology of inductive learning.

[51] Michalski R. S. and Kaufman K. A. : Data Mining and Knnowledge

[54] Minsky, M. : Steps Toward Artificial Intelligence.

[55] Minsky, M. : A Selected Descriptor-Indexed Bibliography to the Liter-

[56] Mitchell T. M. : Machine Learning.

[58] Negrete J. : El Nopal de la Inteligencia Artificial.

[60] Pearl J. : Probabilistic Reasoning in Intelligent Systems, Networks of