Académique Documents
Professionnel Documents
Culture Documents
1
3
Log of Brain Weight
2
1
Kangaroo
0
−1 0 1 2 3 4 5
Figure 1: Scatterplot of the log of brain weight plotted against the log of body
weight for 28 animal species.
2 Scatterplots
As noted, a scatterplot is simply a plot of one sequence of numbers against
another. An example is shown in Fig. 1, which plots the logarithm of brain
weights for 28 animal species against the logarithm of their corresponding body
weights. That is, each point on the plot represents one of the 28 animals, with
the height of the point above the horizontal axis corresponding to the log of the
animal’s brain weight in grams, and the distance of the point from the vertical
2
axis representing the log of the animal’s body weight in kilograms. As a specific
example, the point represented as a solid circle corresponds to the numbers for
a kangaroo, with a body mass of 35 kilograms (so log10 body mass ≃ 1.544) and
a brain mass of 56 grams (so log10 brain mass ≃ 1.748). This dataset is one of
many distributed with the R statistical software package discussed further at the
end of this article (see Section 8; these numbers were derived from the Animals
from the MASS package, distributed as part of the base R installation).
It is clear from this plot that brain mass tends to increase with increasing
body mass, with most of the points lying roughly along a straight line, except
for the three points lying off to the right. These points represent dinosaurs,
animals noted for their enormous bodies and tiny brains. One of these dinosaurs
is Diplodocus, whose brain was slightly smaller than that of the kangaroo but
whose body was 334 times heavier. According to one account [2, p. 335]:
3 Transformations
The difference between the informative plot in Fig. 1 and the non-informative
plot in Fig. 2 lies in the use of the log transformation applied to both variables.
Since the original data didn’t come to us in this representation, the key practical
question is, “how should we know to apply these transformations?” For the
specific case of log transformations, a common motivation is often a very wide
dynamic range in the observed data values: if all of the numbers are positive
and the largest one is several orders of magnitude larger than the smallest one,
the logarithmic transformation is often useful in giving us a more informative
view of the data. This condition certainly holds for the body-weight/brain-
weight example: the body weights range from 0.023 kilograms for the mouse to
87,000 kilograms for the Brachiosaurus, a difference of 6 orders of magnitude,
and the brain weights range from 0.4 grams (the mouse again) to 5,712 grams
for the African elephant, a difference of 4 orders of magnitude. More generally,
however, there are an uncountably infinite number of possible transformations
3
5000
Untransformed Brain Weight
4000
3000
2000
1000
0
we could apply to one or both variables and the reality is that none of them may
lead us to an informative scatterplot. Since looking for something that doesn’t
exist can waste a lot of time, what should we do? A simple, practical technique
that is often useful in telling us when such a transformation is or is not likely
to exist is described in Section 5, based on the correlation measures described
in Section 4. First, however, it is important to present a few key ideas about
how transformations work.
First, an important characteristic of the transformations under consideration
here is that they are invertible, meaning that if T is a transformation that takes
us from the variable x to the variable z (mathematically, written as z = T {x}),
there exists a well-defined inverse transformation T −1 that takes us back from
z to x (i.e., x = T −1 {z}). As a practical matter, invertibility means that the
transformation T is information-preserving, since we can always recover our
original data exactly from the transformed data.
4
The log transformation considered in the brain-weight/body-weight example
is invertible: if z = log10 x, then the inverse transformation is given by x = 10z .
It is important to note, however, that the log transformation is only well-defined
and invertible if x is a positive number.
The second characteristic of the data transformations considered here is that
they are order-preserving, meaning that if two values, say x1 and x2 , satisfy the
condition x1 > x2 in the original data, the transformed values, z1 and z2 ,
satisfy the condition z1 > z2 in the transformed data. In fact, this requirement
is related to invertibility, although I won’t go into the mathematical details
here. Suffice it to say that, mathematically, an order-preserving transformation
is strictly increasing, and any function that is strictly increasing and continuous
(i.e., a function whose graph doesn’t include any breaks or other discontinuous
kinks) is invertible. Again, the log transformation is a good example: the log
function is both strictly increasing and continuous, provided the original data
sequence to which it is applied consists of positive numbers.
Essentially, this number gives a measure of the linear association between the
variables x and y that can assume any value between −1 and +1. The following
three special cases have particular practical importance:
5
1. if ρP (x, y) assumes its maximum possible value of +1, the variables satisfy
a linear relationship with a positive slope, i.e.:
2. if ρP (x, y) assumes its minimum possible value of −1, the variables satisfy
a linear relationship with a negative slope, i.e.:
2. if ρS (x, y) assumes its minimum possible value of −1, the variables are
related exactly by an order-reversing transformation, y = T {x}:
6
Brain Weight Body Weight Product- Spearman
Transformation Transformation Moment Rank
Correlation Correlation
7
No transformations Log x Only
Original brain weight
4000
2000
2000
0
0
0 20000 60000 −1 0 1 2 3 4 5
3
2
2
1
1
0
0 20000 60000 −1 0 1 2 3 4 5
8
any relationship between these variables, and the product-moment correlation
value (ρP (x, y) = −0.005) is essentially zero. The fact that the Spearman rank
correlation value is substantially larger (ρS (x, y) = 0.719) suggests that it may
be useful to apply a transformation. The upper right plot shows the results
of applying the log transformation to the body weight data, corresponding to
the second row in Table 1. Here, the product-moment correlation value is sub-
stantially larger (ρP (x, y) = 0.405), giving stronger evidence for a relationship
between the variables. The visual evidence in support of this idea is somewhat
stronger than in the original untransformed plot, but not compelling, in part
because of the influence of the points associated with the largest body weights
and small brain weights.
The lower left plot in Fig. 3 shows the results of applying the log transfor-
mation to the brain weight data only, corresponding to the third row in Table 1.
Here, the product-moment correlation value is again quite small, consistent with
the fact that this plot shows very little evidence for a systematic relationship
between these variables. Note, however, that applying the log transformation
to the brain weight data and leaving the body weight data untransformed does
provide a strong suggestion that the animals with the largest body masses may
be anomalous. Finally, the lower right plot shows the same results presented
in Fig. 1, where the log transformation has been applied to both variables and
corresponding to the last row in Table 1. Here, the product-moment correla-
tion coefficient is actually larger than the Spearman rank correlation coefficient,
and this plot reveals the most compelling evidence in support of a relationship
between brain weight and body weight of any of the four.
where ln x denotes the natural logarithm (i.e., log to the base e ≃ 2.71828).
If we restrict x to positive values, all of these transformations are invertible—
hence information-preserving—and they are strictly increasing—and thus order-
9
preserving—if λ ≥ 0. Restricting consideration to this family of transformations
gives us something more definite to work with than “some transformation,” but
it is still true that any finite value of λ—positive or negative—defines a po-
tentially useful transformation, and that’s too many to evaluate by trial and
error. What is commonly done in practice is to try a few selected special cases
spanning a broadly useful range of λ values, and this is especially appropriate
when looking for transformations to improve the visual appearance of a scat-
terplot. In fact, a slightly simplified strategy would be to apply the following
transformations to x, y, and both variables:
√ √ √
T {x} = 4 x, 3 x, x, log10 x, x (i.e., no transformation), x2 , x3 , and x4 .
10
Linear Relationship Outlying x Value
Observed y values
Observed y values
1
1
0
0
−1
−1
−2
−2
−1.5 −0.5 0.5 1.5 0 10 20 30 40
Observed y value
4
10
3
2
5
1
0
anomalous points and try again, either to interpret the original scatterplot or
to seek a more informative view by applying the correlation trick to the cleaned
dataset.
11
x and y satisfy the relationship y = x2 . Note that while this transformation is
monotonic for positive data values, x assumes both positive and negative values
here, making the transformation non-monotonic and therefore non-invertible:
note that both x = 2 and x = −2 imply y = 4, so that given y alone, we have
no way of determining which of these original x values is correct. The product-
moment correlation value for this example is −0.068 and the Spearman rank
correlation coefficient is 0.040, neither one at all suggestive of the fairly strong
relationship that actually exists between the variables in this example.
http://cran.r-project.org/
All of the results and figures for this article were generated using R, and it is
a software package that I recommend highly, but with the following caution: R
12
is an extremely powerful statistical software package, with a vast and growing
array of add-on packages available for it (over 2,000 at the time of this writ-
ing). For that reason, having access to R is a bit like being given a jet fighter
aircraft, fueled and ready to go, sitting in your driveway: you can do amazing
things with it, but it’s important to learn enough to understand what you are
doing before you sail off into the wild blue yonder. Unlike the fighter aircraft,
when you do something you don’t really understand in R, you won’t be burned
beyond recognition, but the embarassment factor can be acute. Besides the
free documentation available from the CRAN Website listed above, a number
of very good books are also available [1, 6].
References
[1] Michael J. Crawley, The R Book, Wiley, 2007.
This book is over 900 pages long and begins with succinct but
detailed instructions on how to download the R statistical soft-
ware environment, following that with extensive, example-based
discussions of many of R’s features, including some useful details
on how to make R talk to Microsoft Excel. If you work dilligently
through this book, you will know a great deal about both R and
applied statistics when you are done.
[2] Carroll Lane Fenton and Mildred Adams Fenton, The Fossil Book, Double-
day and Company, 1958.
This is a fabulously illustrated book that I have had for years,
and the cited quote gives the best commentary I have ever read
on the small size of the dinosaurs’ brains. It is still available
through Amazon from used book dealers, and an updated pa-
perback edition is available new: The Fossil Book: A Record of
Prehistoric Life, by Patricia Vickers Rich, Thomas Hewitt Rich,
Mildred Adams Fenton, and Carroll Lane Fenton (Dover, 1997).
[3] Richard W. Hamming, Numerical Methods for Scientists and Engineers,
McGraw-Hill, 1962.
13
The main focus of this book is linear regression, and it gives a
thorough introduction to the subject. I have cited it here because
it gives a useful treatment of the use of Box-Cox transformations
in exploratory data analysis, including the recommendation to
look at a few selected values of the λ parameter, along the lines
discussed in this article.
Both Venables and Ripley have been heavily involved in the de-
velopment of R for a long time, and both are acknowledged, along
with many others in the preface of Crawley’s R Book cited above.
Like R, S-Plus is a statistical software package based on the S lan-
guage developed at AT&T, for which John Chambers was given
the Association for Computing Machinery (ACM) Software Sys-
tem Award in 1998, describing it as a development “which has
forever altered the way people analyze, visualize, and maniuplate
data.” Like Crawley’s R Book, this book of Venables and Rip-
ley gives an excellent introduction to R and its use in applied
statistics and data analysis.
14