Vous êtes sur la page 1sur 3

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220017784

Data Mining: Practical Machine Learning Tools and Techniques

Chapter · November 2010

CITATIONS READS
5,310 31,882

2 authors, including:

Ian Witten
The University of Waikato
545 PUBLICATIONS   71,219 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

FLAX (Flexible Language Acquisition flax.nzdl.org) View project

EThOS for EAP View project

All content following this page was uploaded by Ian Witten on 21 May 2014.

The user has requested enhancement of the downloaded file.


Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations
b y / a n H. Witten a n d Eibe F r a n k

M o r g a n K a u f m a n n Publishers, 2 0 0 0
416 pages, Paper, $ 4 9 . 9 5
I S B N 1-.55860-552-5

R e v i e w by:
James Geller, N e w Jersey Institute of T e c h n o l o g y
CS D e p a r t m e n t , 323 Dr. King Blvd., N e w a r k , NJ 07 t 0 2
geller@oak.njit.edu
http:Hweb, n j i t . e d u / - g e l l e r /

Story o f the b o o k A w a l k through the contents

Witten and Frank's textbook was one of two T h e greatest strength of this Data M i n i n g
books that 1 used for a data mining class in b o o k lies outside o f the b o o k itself. All the
the Fall o f 2001. T h e b o o k covers all m a j o r algorithms described in this b o o k are
methods o f data mining that p r o d u c e a i m p l e m e n t e d and freely available t h r o u g h
knowledge representation as output. the WEK.A ( W a i k a t o E n v i r o n m e n t for
Knowledge representation is hereby Knowledge Ana lys is) W e b s i te
u n d e r s t o o d as a representation that can be (www.cs.waikato.ac.nz/ml/weka). Chapter 8
studied, understood, and interpreted by o f the book is a tutorial to the i m p l e m e n t e d
h u m a n beings, at least in principle. T h u s , algorithms. T h e integration b e t w e e n the
neural networks and genetic algorithms are b o o k and the W e b site is excellent, and the
excluded f r o m the topics of this textbook. W e b site is alive, thriving and growing.
W e need to say "can be u n d e r s t o o d in T h u s , the n u m b e r o f data mining a l g o r i t h m s
principle" b e c a u s e a large decision tree or a available on the W e b site goes far beyond
large rule set m a y be as hard to interpret as a what is described in the book. Indeed. even
neural network. Neural N e t w o r k s have been added to the
W e b site since the b o o k was first published.
T h e b o o k first develops the basic m a c h i n e W h i l e m a n y books offer an associated W e b
learning and data mining methods. T h e s e site by now, the close linkage between b o o k
include decision trees, classification and and W e b site and the rapid g r o w t h o f the
association rules, s u p p o r t vector machines, W e b site are highly c o m m e n d a b l e .
instance-based learning, Naive Bayes
classifiers, clustering, and numeric A n o t h e r pleasant feature o f the W E K A
prediction based on linear regression, i m p l e m e n t a t i o n is that it is d o n e in Java.
regression trees, and model trees. It then T h i s m a k e s it possible to c o n s t r u c t systems,
goes deeper into evaluation and based on Java, that capitalize on the other
i m p l e m e n t a t i o n issues. Next it moves on to strengths of Java, s u c h as access to relational
deeper c o v e r a g e of issues such as attribute d a t a b a s e s t h r o u g h J D B C and easy access to
selection, discretization, data cleansing, and W e b pages f r o m within Java p r o g r a m s .
c o m b i n a t i o n s o f multiple models (bagging,
boosting, and stacking). T h e final c h a p t e r T a r g e t audience
deals with a d v a n c e d topics such as visual
m a c h i n e learning, text mining, and W e b T h e b o o k is written for a c a d e m i c s and
mining. practitioners and I believe it can be well
understood, even by undergraduate students.

76 SIGMOD R e c o r d , Vol. 31, N o . 1, M a r c h 2002


In fact, it is probably the most accessible have) to strengthen the formulas, without
survey of data mining in print, without necessarily adding new ones.
sacrificing too much of precision and rigor.
The book is written in a highly redundant At a few places, the book could also be
style, which I would like to describe as an improved by adding rnore explanations to
exercise in iterative deepening. Basic figures. Figure 3.6 is a prime example for
concepts are repeated in several chapters. this issue. I found myself spending time
but covered to a deeper level in the later verifying that instance counts in two
chapters. This should make it easy for subfigurcs truly add to the same total (of
students to keep reading it, without having 209). They do. The reader could be spared
to refer back to earlier chapters at every step this effort by a better caption or a better
of the way. On the other hand. for a person description in the body of the text.
that is already familiar with the basics of Similarly, the Apriori algorithm is
data mining, this makes boring reading at introduced in a figure, but only in the
some places. However, I do not recommend "'Further Reading" subsection (following
a streamlining of the book. Instead, I much later) is the name of the algorithm
recommend that readers with some mentioned. A better figure caption would
knowledge of the topic may skip paragraphs help the scholarly advancement of students
that sound familiar without any guilty who might not take the "Further Reading"
feelings. section that seriously.

Reviewer's appreciation In America we say "Actions speak louder


than words". Thus. instead of summarizing
The book goes to great lengths to avoid the book I will describe some actions that I
"formula shock". Formulas are developed intend to take (or that I am already taking).
step-by-step and well explained. Only (1) I am using W E K A for my research.
absolutely necessary formulas are included. (2) If I teach the same course again, I will
In many cases, where the derivation of a use Witten and Frank's book again.
complex result is irrelevant to the actual data (3) If the book appears in a second edition, I
mining issues, the authors defer to statistics will acquire it.
textbooks. While I am greatly in favor of
both these approaches in writing textbooks, I
feel that they have gone too far at a few
places. At a number of places, the authors
avoid introducing "'one more letter" to keep
the text readable. However, the price they
pay for that is that many of their formulas
have no cclual signs. Thus, a sentence is
terminated with a colon and followed with a
formula, which is presumably equal to the
quantity described by the sentence. This is
done on many pages, e.g., 132--135, 137,
196, 207, 222, etc. Not in my wildest
dreams would I have thought that I could
ever criticize a book author for having too
few formulas and too few variables. But
this is exactly what I need to do here. While
I do not recommend eliminating the
previously mentioned redundancy of
description, I do recommend for the next
edition (which this book will undoubtedly

SIGMOD Record, Vol. 31, No. 1, March 2002 77

View publication stats

Vous aimerez peut-être aussi