Build A Better Scatterplot

Build a Better Scatterplot:
Using Correlations to Choose

a More Informative View
Ron Pearson
March 28, 2010
1 Insight, not numbers

In the preface of his 1962 book on numerical computing [3], Richard W. Ham-
ming argued that “the purpose of computing is insight, not numbers.” Much has
changed since 1962: then, the computer—few organizations could afford more
than one—was the “giant mind,” a room full of equipment with a full-time sup-
port staff, and memory was measured in kilobytes; now, personal computers—
unthinkable in 1962—are everywhere, and the capacity of a typical thumb drive
is measured in gigabytes, each equivalent to a million kilobytes. One thing that
has not changed since 1962 is the validity of Hamming’s argument: indeed,
because we are increasingly awash in numbers, the need for insight is greater
than ever. The difficulty is, the more numbers we have, the harder it is to gain
much insight by just looking at them in a table. (The fact that Microsoft Excel,
probably the most popular tool now used for working with numbers, lets you
build a table with 65,536 rows and 256 columns—almost 17 million numbers in
all—only serves to emphasize the importance of Hamming’s point.)
Fortunately, one of the things that people do better than computers is rec-
ognize visual patterns, even when they are partially submerged in a sea of
irrelevant details. As a consequence, we can often gain a great deal of insight
about a collection of numbers by looking at a well-chosen graphical representa-
tion of it. One very useful graphical representation—and the primary focus of
this article—is the scatterplot, constructed by plotting one sequence of numbers
against another. Examples of scatterplots and their construction are given in
Section 2, which also demonstrates that the practical utility of a scatterplot can
depend strongly on how the underlying numbers are represented. This point is
important because numbers don’t always come to us in the most useful repre-
sentation. The key idea behind this article is that correlations—easily computed
from the numbers that do come to us—can sometimes be used to tell us when a
better representation exists. That can save a lot of time and frustration, giving
us an idea from the outset how much we can expect to improve a noninformative
scatterplot using simple transformations. The following sections of this article
1
3
Log of Brain Weight
2
1
Kangaroo
0
−1 0 1 2 3 4 5
Log of Body Weight
Figure 1: Scatterplot of the log of brain weight plotted against the log of body
weight for 28 animal species.
discuss these ideas in detail: informative versus noninformative scatterplots,

what transformations are and what they can do, what correlations are and how
to compute them, how to use correlations to tell whether a transformation might
improve our scatterplot, and how to find a good transformation when it exists.
2 Scatterplots
As noted, a scatterplot is simply a plot of one sequence of numbers against
another. An example is shown in Fig. 1, which plots the logarithm of brain
weights for 28 animal species against the logarithm of their corresponding body
weights. That is, each point on the plot represents one of the 28 animals, with
the height of the point above the horizontal axis corresponding to the log of the
animal’s brain weight in grams, and the distance of the point from the vertical
2
axis representing the log of the animal’s body weight in kilograms. As a specific
example, the point represented as a solid circle corresponds to the numbers for
a kangaroo, with a body mass of 35 kilograms (so log10 body mass ≃ 1.544) and
a brain mass of 56 grams (so log10 brain mass ≃ 1.748). This dataset is one of
many distributed with the R statistical software package discussed further at the
end of this article (see Section 8; these numbers were derived from the Animals
from the MASS package, distributed as part of the base R installation).
It is clear from this plot that brain mass tends to increase with increasing
body mass, with most of the points lying roughly along a straight line, except
for the three points lying off to the right. These points represent dinosaurs,
animals noted for their enormous bodies and tiny brains. One of these dinosaurs
is Diplodocus, whose brain was slightly smaller than that of the kangaroo but
whose body was 334 times heavier. According to one account [2, p. 335]:
Diplodocus must have been a sluggish creature that used moderate

amounts of food, got along without thinking, and let its body take
care of itself. The brain was just able to sort out messages of sight,
hearing, and smell and pass them to ganglia above the shoulders and
hips. These nerve-knots took care of such jobs as walking, swinging
the tail, or reaching out for food. These activities, plus mating, filled
the life of Diplodocus. Thinking was a luxury in which he could not
indulge.
Unfortunately, as noted earlier, numbers don’t always come to us in their

most informative representations. This example is a case in point: the original
dataset gives the untransformed body and brain weights, rather than the log-
transformed quantities plotted in Fig. 1. If we look at these raw data values
instead, we obtain the scatterplot shown in Fig. 2, which offers little suggestion
of any systematic relationship between the variables.
3 Transformations
The difference between the informative plot in Fig. 1 and the non-informative
plot in Fig. 2 lies in the use of the log transformation applied to both variables.
Since the original data didn’t come to us in this representation, the key practical
question is, “how should we know to apply these transformations?” For the
specific case of log transformations, a common motivation is often a very wide
dynamic range in the observed data values: if all of the numbers are positive
and the largest one is several orders of magnitude larger than the smallest one,
the logarithmic transformation is often useful in giving us a more informative
view of the data. This condition certainly holds for the body-weight/brain-
weight example: the body weights range from 0.023 kilograms for the mouse to
87,000 kilograms for the Brachiosaurus, a difference of 6 orders of magnitude,
and the brain weights range from 0.4 grams (the mouse again) to 5,712 grams
for the African elephant, a difference of 4 orders of magnitude. More generally,
however, there are an uncountably infinite number of possible transformations
3
5000
Untransformed Brain Weight
4000
3000
2000
1000
0
0 20000 40000 60000 80000
Untransformed Body Weight
Figure 2: Scatterplot of the untransformed brain weight plotted against the

untransformed body weight for 28 animal species.
we could apply to one or both variables and the reality is that none of them may
lead us to an informative scatterplot. Since looking for something that doesn’t
exist can waste a lot of time, what should we do? A simple, practical technique
that is often useful in telling us when such a transformation is or is not likely
to exist is described in Section 5, based on the correlation measures described
in Section 4. First, however, it is important to present a few key ideas about
how transformations work.
First, an important characteristic of the transformations under consideration
here is that they are invertible, meaning that if T is a transformation that takes
us from the variable x to the variable z (mathematically, written as z = T {x}),
there exists a well-defined inverse transformation T −1 that takes us back from
z to x (i.e., x = T −1 {z}). As a practical matter, invertibility means that the
transformation T is information-preserving, since we can always recover our
original data exactly from the transformed data.
4
The log transformation considered in the brain-weight/body-weight example
is invertible: if z = log10 x, then the inverse transformation is given by x = 10z .
It is important to note, however, that the log transformation is only well-defined
and invertible if x is a positive number.
The second characteristic of the data transformations considered here is that
they are order-preserving, meaning that if two values, say x1 and x2 , satisfy the
condition x1 > x2 in the original data, the transformed values, z1 and z2 ,
satisfy the condition z1 > z2 in the transformed data. In fact, this requirement
is related to invertibility, although I won’t go into the mathematical details
here. Suffice it to say that, mathematically, an order-preserving transformation
is strictly increasing, and any function that is strictly increasing and continuous
(i.e., a function whose graph doesn’t include any breaks or other discontinuous
kinks) is invertible. Again, the log transformation is a good example: the log
function is both strictly increasing and continuous, provided the original data
sequence to which it is applied consists of positive numbers.
4 Two ways of defining correlations

The product-moment correlation coefficient was introduced by Karl Pearson at
the end of the nineteenth century as a measure of the statistical association
between two variables. It is widely available as a built-in procedure in many
software packages, including Microsoft Excel and the open-source statistics pack-
age R discussed briefly in Section 8, along with many, many others (chances are
good that if you are using a package that supports scatterplots, it has the
correlation coefficient available as a built-in function). Mathematically, the
product-moment correlation coefficient ρP (x, y) between two variables, x and y,
is defined as follows. First, assume that we have N observations of each variable,
designated x1 through xN and y1 through yN , and define the mean values as:
N
x1 + x2 + · · · + xN 1 X
x̄ = = xk ,
N N
k=1
N
y1 + y2 + · · · + yN 1 X
ȳ = = yk .
N N
k=1
Given these mean values, the product-moment correlation coefficient is:

PN
k=1 (xk − x̄)(yk − ȳ)
ρP (x, y) = qP .
N PN
k=1 (xk − x̄)2 k=1 (yk − ȳ)2
Essentially, this number gives a measure of the linear association between the
variables x and y that can assume any value between −1 and +1. The following
three special cases have particular practical importance:
5
1. if ρP (x, y) assumes its maximum possible value of +1, the variables satisfy
a linear relationship with a positive slope, i.e.:
yk = axk + b, where a > 0,
2. if ρP (x, y) assumes its minimum possible value of −1, the variables satisfy
a linear relationship with a negative slope, i.e.:
yk = axk + b, where a < 0,
3. if ρP (x, y) = 0, this result suggests a lack of relationship between the

variables x and y.
The wording of these statements is important: if ρP (x, y) = ±1, a linear rela-
tionship necessarily exists between x and y, with a positive or negative slope
depending on the sign of the correlation coefficient. Conversely, if ρP (x, y) = 0,
this result is only a suggestion that x and y may be unrelated: the whole basis
for the correlation trick described in Section 5 is that ρP (x, y) can be approxi-
mately zero, suggesting the lack of relationship between variables, when a very
strong nonlinear relationship exists between the variables.
The key to the correlation trick lies in the fact that correlations can be com-
puted in more than one way. One popular alternative, introduced by Spearman
in the early twentieth century, is the Spearman rank correlation coefficient, de-
fined as the product-moment correlation computed not between the variables x
and y themselves, but between their ranks. That is, define Rx (k) to be 1 if xk is
the smallest of the sequence of x values, Rx (k) as 2 if xk is the second-smallest,
and so forth, culminating in Rx (k) = N if xk is the largest of the N data val-
ues. Defining the ranks Ry (k) analogously for the y variable, the Spearman
rank correlation coefficient is simply the product-moment correlation of these
rank sequences:
ρS (x, y) = ρP (Rx , Ry ).
The key difference between the product-moment and rank correlation coeffi-
cients is that the rank correlation measure tells us about the existence of mono-
tone relationships between the data values, rather than linear ones:
1. if ρS (x, y) assumes its maximum possible value of +1, the variables are
related exactly by an order-preserving transformation, y = T {x}:
v < w implies T {v} < T {w};
2. if ρS (x, y) assumes its minimum possible value of −1, the variables are
related exactly by an order-reversing transformation, y = T {x}:
v < w implies T {v} > T {w};
3. if ρS (x, y) = 0, this result suggests a lack of relationship between the

variables x and y.
6
Brain Weight Body Weight Product- Spearman
Transformation Transformation Moment Rank
Correlation Correlation
None: T {x} = x None: T {y} = y −0.005 0.716

Log: T {x} = log10 x None: T {y} = y 0.405 0.716
None: T {x} = x Log: T {y} = log10 y 0.083 0.716
Log: T {x} = log10 x Log: T {y} = log10 y 0.779 0.716
Table 1: Product-moment and Spearman rank correlations for the brain-

weight/body-weight data in four different representations.
An important practical difference between the Spearman rank correlation coef-

ficient ρS (x, y) and the product-moment correlation coefficient ρP (x, y) is that
ρS (x, y) is invariant with respect to any strictly increasing transformation ap-
plied to either x or y, or both. This invariance is a consequence of the fact
that strictly increasing transformations are order-preserving and do not change
the ranks of the data: Rx (k) = j implies that xk is the j th smallest value in
the original data sequence, and the transformed value zk = T {xk } remains the
j th smallest value in the transformed sequence, implying Rz (k) = j as well.
As a specific example, applying log transformations to either or both variables
has no effect on the Spearman rank correlation coefficient, but its effect on the
product-moment correlation coefficient can be dramatic.
This point is illustrated in Table 1 for the brain-weight/body-weight dataset,
giving values for ρP (x, y) and ρS (x, y) based on four different data transforma-
tions. Here, x is derived from the body weight data, either taking the original
data values (in cases 1 and 3) or applying the log transformation (in cases 2
and 4), and y is derived from the brain weight data, again either by taking the
original values (in cases 1 and 2) or applying the log transformation (in cases
3 and 4). As noted, these transformations have no effect whatsoever on the
Spearman rank correlation value, which is large enough to suggest the presence
of an increasing relationship between these variables: larger, heavier animals
tend to have larger, heavier brains. In marked contrast, the product-moment
correlation values are strongly influenced by these transformations, varying from
essentially zero for the untransformed data, to a value larger than the Spearman
rank correlation value when the log transformation is applied to both variables.
These differences are the basis for the correlation trick described next.
7
No transformations Log x Only
Original brain weight
Original brain weight

4000
4000
2000
2000
0
0
0 20000 60000 −1 0 1 2 3 4 5
Original body weight Log body weight
Log y Only Log x & Log y

Log brain weight
Log brain weight

3
3
2
2
1
1
0
0 20000 60000 −1 0 1 2 3 4 5
Original body weight Log body weight
Figure 3: Scatterplots generated from four different transformations of the brain

weight and/or body weight data values.
5 The correlation trick

The correlation trick exploits the difference between the profound sensitivity of
product-moment correlations to transformations and the Spearman rank corre-
lation coefficient’s complete insensitivity to them. Thus, given data sequences
{xk } and {yk }, each of length N , we compute ρP (x, y) and ρS (x, y) and compare
the results: if the magnitude of ρS (x, y) is large enough to suggest the possibility
of a relationship between these variables but the magnitude of ρP (x, y) is con-
siderably smaller, this observation suggests the possibility that transformations
applied to one or both variables may bring out this relationship.
To see how this works, it is instructive to examine the plots correspond-
ing to all four of the correlation pairs listed in Table 1; these are shown in
Fig. 3. The upper left plot repeats the untransformed data shown in Fig. 2,
corresponding to the first row of Table 1: as noted, this plot gives no hint of
8
any relationship between these variables, and the product-moment correlation
value (ρP (x, y) = −0.005) is essentially zero. The fact that the Spearman rank
correlation value is substantially larger (ρS (x, y) = 0.719) suggests that it may
be useful to apply a transformation. The upper right plot shows the results
of applying the log transformation to the body weight data, corresponding to
the second row in Table 1. Here, the product-moment correlation value is sub-
stantially larger (ρP (x, y) = 0.405), giving stronger evidence for a relationship
between the variables. The visual evidence in support of this idea is somewhat
stronger than in the original untransformed plot, but not compelling, in part
because of the influence of the points associated with the largest body weights
and small brain weights.
The lower left plot in Fig. 3 shows the results of applying the log transfor-
mation to the brain weight data only, corresponding to the third row in Table 1.
Here, the product-moment correlation value is again quite small, consistent with
the fact that this plot shows very little evidence for a systematic relationship
between these variables. Note, however, that applying the log transformation
to the brain weight data and leaving the body weight data untransformed does
provide a strong suggestion that the animals with the largest body masses may
be anomalous. Finally, the lower right plot shows the same results presented
in Fig. 1, where the log transformation has been applied to both variables and
corresponding to the last row in Table 1. Here, the product-moment correla-
tion coefficient is actually larger than the Spearman rank correlation coefficient,
and this plot reveals the most compelling evidence in support of a relationship
between brain weight and body weight of any of the four.
6 Choosing a good transformation

The results just presented demonstrate that the correlation trick can be useful
in deciding when to look for a data transformation to improve the visual char-
acter of a scatterplot, rendering it far more informative than one constructed
from the original data. The question remains of which transformations to con-
sider. Many different transformations have been proposed in mathematics and
statistics to unravel relationships between variables, emphasize certain aspects
of these relationships, or minimize aspects that are troublesome in some appli-
cations. Attempting to give a complete account of these transformations would
turn this article into a book, but the following family of transformations has
been found to be extremely useful. These are the Box-Cox transformations, a
family defined by the numerical transformation parameter λ by the following
expression [4, p. 68]:
λ
(x − 1)/λ λ 6= 0,
Tλ {x} =
ln x λ = 0,
where ln x denotes the natural logarithm (i.e., log to the base e ≃ 2.71828).
If we restrict x to positive values, all of these transformations are invertible—
hence information-preserving—and they are strictly increasing—and thus order-
9
preserving—if λ ≥ 0. Restricting consideration to this family of transformations
gives us something more definite to work with than “some transformation,” but
it is still true that any finite value of λ—positive or negative—defines a po-
tentially useful transformation, and that’s too many to evaluate by trial and
error. What is commonly done in practice is to try a few selected special cases
spanning a broadly useful range of λ values, and this is especially appropriate
when looking for transformations to improve the visual appearance of a scat-
terplot. In fact, a slightly simplified strategy would be to apply the following
transformations to x, y, and both variables:
√ √ √
T {x} = 4 x, 3 x, x, log10 x, x (i.e., no transformation), x2 , x3 , and x4 .
To automate the correlation trick, we could build an automated procedure to

compute the product-moment correlation coefficients between T {x} and T {y}
for all 64 combinations of these transformations and look at those with the
largest magnitudes, comparing the result with the Spearman rank correlations.
We can then use the transformation pair that gives us this largest product-
moment correlation value to generate our scatterplot.
7 Three cases where the trick fails

Like all simple ideas, the correlation trick can fail. The following sections briefly
describe three cases where this can happen.
7.1 Confounding by outliers

In addition to being highly sensitive to data transformations, the product-
moment correlation coefficient is also highly sensitive to outliers, corresponding
to anomalous data points [5, Section 2.2]. Because ranks are generally fairly
resistant to outliers, the Spearman rank correlation coefficient is much less sen-
sitive to outliers in the data. Thus, if outliers are present in x or y or both, we
may see pronounced differences between product-moment and Spearman rank
correlation values, leading us to search for a useful data transformation when
none exists. This point is illustrated in the upper two plots in Fig. 4: the upper
left plot shows 100 observations of two variables, x and y, which exhibit a clear
relationship, reflected in the very large positive values of both their product-
moment and Spearman rank correlations (0.950 and 0.951, respectively). The
upper right plot shows the results when +40 is added to one of the x values, con-
verting this point into an outlier: there, the Spearman rank correlation value is
only slightly affected (0.909 vs. the original 0.951), while the product-moment
correlation coefficient is reduced dramatically, from 0.950 to 0.060. The key
point here is that the correlation trick applied to this example would suggest
that we look for a transformation to enhance the interpretability of our scat-
terplot, when the real difficulty lies in the presence of the single outlying data
point. The good news is that the presence of outliers may be fairly obvious in
a scatterplot of the original data—as it is here—permitting us to remove the
10
Linear Relationship Outlying x Value
Observed y values
Observed y values
1
1
0
0
−1
−1
−2
−2
−1.5 −0.5 0.5 1.5 0 10 20 30 40
Observed x values Observed x values
Non−monotonic Relationship No Relationship

15
5
Observed y value
Observed y value
4
10
3
2
5
1
0
−1.5 −0.5 0.5 1.5 −2 −1 0 1
Observed x value Observed x value
Figure 4: Four scatterplots based on simulated data, illustrating a strong linear

relationship (upper left), the influence of outliers (upper right), a strong non-
monotonic relationship (lower left), and no relationship (lower right).
anomalous points and try again, either to interpret the original scatterplot or
to seek a more informative view by applying the correlation trick to the cleaned
dataset.
7.2 Non-monotonic relationships

Another case where the correlation trick can fail is in situations where an im-
portant relationship exists between variables, but it is not monotonic (that is, it
is neither strictly increasing nor strictly decreasing). The difficulty here is that
a non-monotonic relationship will generally lead to small values for both the
product-moment and the Spearman rank correlations, so the correlation trick
will lead us to conclude that no useful transformation exists when in fact one
does. This point is illustrated in the lower left plot in Fig. 4, where the variables
11
x and y satisfy the relationship y = x2 . Note that while this transformation is
monotonic for positive data values, x assumes both positive and negative values
here, making the transformation non-monotonic and therefore non-invertible:
note that both x = 2 and x = −2 imply y = 4, so that given y alone, we have
no way of determining which of these original x values is correct. The product-
moment correlation value for this example is −0.068 and the Spearman rank
correlation coefficient is 0.040, neither one at all suggestive of the fairly strong
relationship that actually exists between the variables in this example.
7.3 No relationship to be found

Finally, the third case where the correlation trick will not lead us to an infor-
mative data transformation is the obvious case when there is none to be found.
This case is illustrated in the bottom right plot in Fig. 4, which shows two
statistically independent random variables plotted against each other (here, one
variable is Gaussian and the other is lognormal, and this difference in distribu-
tions is responsible for the rather complicated appearance of the plot). There
is no relationship between these data sequences, and this is correctly reflected
in the extremely small values for both the product-moment and Spearman rank
correlations (−0.039 and −0.055, respectively). In this case, applying a log
transformation to the y variable would give both data sequences the same type
of distribution (i.e., it would make them both Gaussian) and this would more
clearly illustrate the lack of relationship between them, but the correlation trick
won’t help us here in finding this transformation.
8 A few words about software

It was noted at the beginning of this article that Microsoft Excel is probably the
best-known and most widely used package for working with numbers. In fact, all
of the ideas described in this article are easily implemented in Excel: scatterplots
are one of the basic display types available in the chart utility, the product-
moment correlation coefficient is available as the built-in statistical function
CORREL, and the Spearman rank correlation coefficient can be computed
using CORREL together with the built-in function RANK to obtain the ranks
for the variables we are interested in examining.
As mentioned in Section 2, the brain-weight/body-weight dataset examined
here is one of the datasets provided with the MASS package in the open-source
statistical language R. For those not familiar with this software, a few words
are in order: anyone can download R at no cost, and versions are available
for both PC’s and Apple Mac’s. Complete information, including extensive
documentation and download instructions can be found at the following Website:
http://cran.r-project.org/
All of the results and figures for this article were generated using R, and it is
a software package that I recommend highly, but with the following caution: R
12
is an extremely powerful statistical software package, with a vast and growing
array of add-on packages available for it (over 2,000 at the time of this writ-
ing). For that reason, having access to R is a bit like being given a jet fighter
aircraft, fueled and ready to go, sitting in your driveway: you can do amazing
things with it, but it’s important to learn enough to understand what you are
doing before you sail off into the wild blue yonder. Unlike the fighter aircraft,
when you do something you don’t really understand in R, you won’t be burned
beyond recognition, but the embarassment factor can be acute. Besides the
free documentation available from the CRAN Website listed above, a number
of very good books are also available [1, 6].
References
[1] Michael J. Crawley, The R Book, Wiley, 2007.
This book is over 900 pages long and begins with succinct but
detailed instructions on how to download the R statistical soft-
ware environment, following that with extensive, example-based
discussions of many of R’s features, including some useful details
on how to make R talk to Microsoft Excel. If you work dilligently
through this book, you will know a great deal about both R and
applied statistics when you are done.
[2] Carroll Lane Fenton and Mildred Adams Fenton, The Fossil Book, Double-
day and Company, 1958.
This is a fabulously illustrated book that I have had for years,
and the cited quote gives the best commentary I have ever read
on the small size of the dinosaurs’ brains. It is still available
through Amazon from used book dealers, and an updated pa-
perback edition is available new: The Fossil Book: A Record of
Prehistoric Life, by Patricia Vickers Rich, Thomas Hewitt Rich,
Mildred Adams Fenton, and Carroll Lane Fenton (Dover, 1997).
[3] Richard W. Hamming, Numerical Methods for Scientists and Engineers,
McGraw-Hill, 1962.
A second edition of this book was published in 1973 and is still

available as a 1987 Dover edition from Amazon. It is a very
practically-oriented book with much material that remains ex-
tremely useful today. The statement about the purpose of com-
puting being “insight, not numbers” is a mantra that first appears
at the beginning of the preface.
[4] J.D. Jobson, Applied Multivariate Data Analysis, Volume I: Regression and
Experimental Design, Springer-Verlag, 1991.
13
The main focus of this book is linear regression, and it gives a
thorough introduction to the subject. I have cited it here because
it gives a useful treatment of the use of Box-Cox transformations
in exploratory data analysis, including the recommendation to
look at a few selected values of the λ parameter, along the lines
discussed in this article.
[5] Ronald K. Pearson, Mining Imperfect Data, SIAM, 2005.

Because I wrote this book, I cannot give an unbiased appraisal
of it, but I cite it here because it gives a reasonably detailed
treatment of outliers and their influence on both product-moment
and Spearman rank correlations.
[6] William N. Venables and Brian D. Ripley, Modern Applied Statistics with
S-PLUS, 4th ed., Springer-Verlag, 2002.
Both Venables and Ripley have been heavily involved in the de-
velopment of R for a long time, and both are acknowledged, along
with many others in the preface of Crawley’s R Book cited above.
Like R, S-Plus is a statistical software package based on the S lan-
guage developed at AT&T, for which John Chambers was given
the Association for Computing Machinery (ACM) Software Sys-
tem Award in 1998, describing it as a development “which has
forever altered the way people analyze, visualize, and maniuplate
data.” Like Crawley’s R Book, this book of Venables and Rip-
ley gives an excellent introduction to R and its use in applied
statistics and data analysis.
14

Build A Better Scatterplot

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Build A Better Scatterplot

Transféré par

Droits d'auteur :

Formats disponibles

Build a Better Scatterplot:

Using Correlations to Choose

1 Insight, not numbers

Log of Body Weight

discuss these ideas in detail: informative versus noninformative scatterplots,

Diplodocus must have been a sluggish creature that used moderate

Unfortunately, as noted earlier, numbers don’t always come to us in their

0 20000 40000 60000 80000

Untransformed Body Weight

Figure 2: Scatterplot of the untransformed brain weight plotted against the

4 Two ways of defining correlations

Given these mean values, the product-moment correlation coefficient is:

yk = axk + b, where a > 0,

yk = axk + b, where a < 0,

3. if ρP (x, y) = 0, this result suggests a lack of relationship between the

v < w implies T {v} < T {w};

v < w implies T {v} > T {w};

3. if ρS (x, y) = 0, this result suggests a lack of relationship between the

None: T {x} = x None: T {y} = y −0.005 0.716

Table 1: Product-moment and Spearman rank correlations for the brain-

An important practical difference between the Spearman rank correlation coef-

Original brain weight

Original body weight Log body weight

Log y Only Log x & Log y

Log brain weight

Original body weight Log body weight

Figure 3: Scatterplots generated from four different transformations of the brain

5 The correlation trick

6 Choosing a good transformation

To automate the correlation trick, we could build an automated procedure to

7 Three cases where the trick fails

7.1 Confounding by outliers

Observed x values Observed x values

Non−monotonic Relationship No Relationship

−1.5 −0.5 0.5 1.5 −2 −1 0 1

Observed x value Observed x value

Figure 4: Four scatterplots based on simulated data, illustrating a strong linear

7.2 Non-monotonic relationships

7.3 No relationship to be found

8 A few words about software

A second edition of this book was published in 1973 and is still

[5] Ronald K. Pearson, Mining Imperfect Data, SIAM, 2005.

Vous aimerez peut-être aussi