Vous êtes sur la page 1sur 11

Stats 202 / Homework 1 / Supratik Bose / Resubmitted July 5,2012

I am resubmitting the homework, since after listening to lecture 3, I


realized that I made mistakes in relative frequency polygon and ECDF
plot. I hope this resubmission will be accepted.

1) Read Chapter 1 (all), Chapter 2 (only sections 2.1, 2.2 and 2.3), and Chapter 3 (only
3.1, 3.2, and 3.3).

Answer: Completed

2) Do Chapter 2 textbook problem #2 on page 89.

Classify the following attributes as binary, discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have
more than one interpretation, so briefly indicate your reasoning if you think there may be
some ambiguity.

Example: Age in years. Answer: Discrete, quantitative, ratio

(a) Time in terms of AM or PM.


Answer: Binary, Qualitative, Ordinal
Reason: Here I have assumed Time = {AM(or Forenoon), PM (or Afternoon)} and NOT
continuous time like 6hour 20 min 15 sec 12 millisecond AM. It is still ordinal as AM
comes before PM in a day.

(b) Brightness as measured by a light meter.


Answer: Continuous, Quantitative, Ratio
Reason: Absolute measurement of illumination in Lux or other unit.

(c) Brightness as measured by people’s judgments.


Answer: Discrete, Qualitative, ordinal
Reason: very dark, dark, normal, bright, very bright etc. Ordinal since there is a sense of
more or less bright

(d) Angles as measured in degrees between 0◦ and 360◦.


Answer: Continuous, Quantitative, Ratio
Reason: Absolute measurement of angle to arbitrary precision
(e) Bronze, Silver, and Gold medals as awarded at the Olympics.
Answer: Discrete, Qualitative, ordinal

(f) Height above sea level.


Answer: Continuous, Quantitative, Ratio
Reason: Even though there is variation of mean seal level (difficulty in defining
“abslotue zero”), if we consider Height above sea level as a measurement determined by
a calibrated altimeter / barometer, then it can be considered as a ratio.

(g) Number of patients in a hospital.


Answer: Discrete, Quantitative, Ratio

(h) ISBN numbers for books. (Look up the format on the Web.)
Answer: Discrete, Qualitative, Nominal
Reason: It cannot be ordinal since ISBN numbers are sold in batches to a publisher and
comparing to ISBN numbers from two different publishers it may not be possible to say
which book came in the market first.

(i) Ability to pass light in terms of the following values: opaque, translucent, transparent.
Answer: Discrete, Qualitative, Ordinal

(j) Military rank.


Answer: Discrete, Qualitative, Ordinal

(k) Distance from the center of campus.


Answer: Continuous, Quantitative, Ratio
Reason: Absolute measurement of distance from center of the campus.
It can also be Discrete, Qualitative, ordinal if one uses terms like “very near”, “near”,
“far”, “very far” to describe distance from the center of the campus.

(l) Density of a substance in grams per cubic centimeter.


Answer: Continuous, Quantitative, Ratio
Reason: Absolute measurement of density can go to arbitrary precision, even when
expressed in grams – from that point it is continuous.

(m) Coat check number. (When you attend an event, you can often give
your coat to someone who, in turn, gives you a number that you can
use to claim your coat when you leave.)
Answer: Discrete, Qualitative, Nominal
Reason: It is nominal if a distinct token is chosen from a bunch of tokens and attached to
the coat and the counter-part is given to the person. It could be “Ordinal” if coat
numbers were monotonically increasing in the order the coats were checked in.

3) This question uses the data here. Download it to your computer.

a) Read in the data in R using data<-read.csv("myfirstdata.csv",header=FALSE). Note,


you first need to specify your working directory using the setwd() command. Determine
whether each of the two attributes (columns) is treated as qualitative (categorical) or
quantitative (numeric) using R. Explain how you can tell using R.
Answer:

> setwd("G:/Shared/Stats202/Homework1");
> getwd();
[1] "G:/Shared/Stats202/Homework1"
> data <- read.csv("myfirstdata.csv",header=FALSE);
> data[1:5,]
V1 V2
1 0 0
2 0 3
3 0 1
4 1 2
5 0 0
> is.factor(data[,1])
[1] FALSE
> is.factor(data[,2])
[1] FALSE

b) Use the command plot() in R to make a plot for each column by entering plot(data[,1])
and plot(data[,2]). Explain exactly what is being plotted in each of the two cases. Include
these two plots in your homework.

Answer:
> plot(data[,1])

Below is the plot with index in X axis and Column 1 of data in Y axis
25
20
15
data[, 1]

10
5
0

0 500 1000 1500 2000

Index
> plot(data[,2])

Below is the plot of index in X axis and column 2 of data in Y axis.


50
40
30
data[, 2]

20
10
0

0 500 1000 1500 2000

Index
c) Use the R functions mean(), max(), var() and quantile(,.25) to compute the mean,
maximum, variance and 1st quartile respectively of the data in the first column. Show
your R code and the resulting values.

Answer:

> mean(data[,1])
[1] 1.593
> max(data[,1])
[1] 27
> var(data[,1]);
[1] 4.526614
> quantile(data[,1], probs = seq(0, 1, 0.25), na.rm =
FALSE, names = TRUE, type = 7)
0% 25% 50% 75% 100%
0 0 1 2 27

4) Chapter 3 textbook problem #2 on page 142: Identify at least two advantages and two
disadvantages of using color to visually represent information.

Answer:

Advantages:

(1) Because of ubiquitous acceptance of symbolic meaning of some colors, (e.g,,


“traffic signal” colors : red for warning, orange/yellow for moderate, green for go etc) –
use of color in any information display may be used to draw user’s attention to value of
features that require attention (e.g., red for out of bound features) or to value of features
that are well acceptable (green for well performing etc).

(2) Color is also used to easily represent “order” in numeric values. Typically, range of
the displayed variable is mapped to a color map (similar to the wavelength / temperature
of visible light). “Warm” colors near red are used to represent larger numeric values
where as “cool” colors near blue are used to represent smaller numeric values. For
example, in radio therapy treatment planning colour based contour map is shown to
highlight low dosed and high-dosed region within patient anatomy.

The website http://colorusage.arc.nasa.gov/

Disdvantages:

(1).If color is not used judiciously, it may clutter a display or even provide wrong
information.
(2) In some instances, for example in 2D X-ray projection images and 3D computed
tomography images, gray scale values convey better information including anatomy
boundary etc due to its similarity to X-ray film based imaging based on which most
physicians are trained.

(2) .Color blind people may miss information if it is displayed only via color.

5) This question uses a sample of 1500 California house prices at:

http://sites.google.com/site/stats202/homework-1/CA_house_prices.csv
and a sample of 10,000 Ohio house prices at
http://sites.google.com/site/stats202/homework-1/OH_house_prices.csv

Download both data sets to your computer. Note that the house prices are in thousands of
dollars.

Answer: Completed

a) Use R to produce a frequency histogram for only the California house prices. Use
intervals of width $500,000 beginning at 0 and ending at $3.5 million. Include the R
commands and the plot. Put your name in the title of the plot.

> caHousePrice <-


read.csv("CA_house_prices.csv",header=FALSE);
> caHousePrice[1:5,]
[1] 500 1242 1001 897 720
> caHousePrice <- caHousePrice * 1000
> caHousePrice[1:5,]
[1] 500000 1242000 1001000 897000 720000
> hist(caHousePrice[,1], br = c(500000*0:7), labels = TRUE,
xlab = "Price in $", main = "Histogram of CA House Prices:
Supratik Bose")
Histogram of CA House Prices: Supratik Bose
600 593

526
500
400
Frequency

300

246
200
100

57 64

10 4
0

0 500000 1000000 2000000 3000000

Price in $

b) Use R to produce a plot showing relative frequency polygons for both the California
prices and the Ohio prices on the same graph. Include a legend. Use the midpoints of the
intervals from the previous exercise. (The first point should be at $250,000 and the last at
$3.25 million). Include the R commands and the plot. Put your name in the title of the
plot.

Answer:

caHousePrice <-
read.csv("CA_house_prices.csv",header=FALSE);
caHousePrice[1:5,];
caHousePrice <- caHousePrice * 1000;
caHousePrice[1:5,];
hist(caHousePrice[,1], br = c(500000*0:7), labels = TRUE,
xlab = "Price in $", main = "Histogram of CA House Prices:
Supratik Bose")
caHousePrice <-
read.csv("CA_house_prices.csv",header=FALSE);
caHousePrice[1:5,];
caHousePrice <- caHousePrice * 1000;
caHousePrice[1:5,];
ohHousePrice <-
read.csv("OH_house_prices.csv",header=FALSE);
ohHousePrice[1:5,];
ohHousePrice <- ohHousePrice * 1000;
ohHousePrice[1:5,];
ohHist <- hist(ohHousePrice[,1], br =
c(500000*0:7),plot=FALSE);
ohHist$counts <- ohHist$counts / sum(ohHist$counts);
ohColor <- "red"
caHist <- hist(caHousePrice[,1], br =
c(500000*0:7),plot=FALSE);
caHist$counts <- caHist$counts / sum(caHist$counts);
caColor <- "blue"
plot(ohHist$mids, ohHist$counts, col= ohColor, pch=21, xlab
= "Price in $", ylab = "relative frequency", main =
"Relative Frequency Polygon of CA and OH House Prices:
Supratik Bose")
lines(ohHist$mids, ohHist$counts,col= ohColor,lty=1)
points(caHist$mids, caHist$counts, col= caColor,pch=22)
lines(caHist$mids, caHist$counts,col= caColor,lty=2)
legend('topright',c('OH','CA'), col = c(ohColor , caColor
), pch=21:22, lty=1:2)
c) Use R to plot the ECDF of the California houses and Ohio houses on the same graph.
Include a legend. Include the R commands and the plot. Put your name in the title of the
plot.
plot(ecdf(ohHousePrice[,1]), verticals= TRUE,do.p = FALSE,
col.h=ohColor,col.v=ohColor,lwd=2, xlab = "Price in $",
ylab = "ECDF", main = "ECDF of CA and OH House Prices:
Supratik Bose")
lines(ecdf(caHousePrice[,1]), verticals= TRUE,do.p = FALSE,
col.h=caColor,col.v=caColor,lwd=4)
legend('bottomright',c('OH','CA'), col = c(ohColor ,
caColor ), ,lwd=c(2,4))
ECDF of CA and OH House Prices: Supratik Bose
1.0
0.8
0.6
ECDF

0.4
0.2

OH
0.0

CA

0e+00 1e+06 2e+06 3e+06

Price in $

Vous aimerez peut-être aussi