Vous êtes sur la page 1sur 7

STA4806/ASS2/0/2016

Department of Statistics
STA4806: Advanced Research Methods in Statistics

Assignment 2, 2016
2

REMEMBER ALL YOUR ASSIGNMENTS ARE TYPED IN LATEX NOT WORD

ASSIGNMENT 02
Unique Nr.: 828027
Fixed closing date:08 July 2016

QUESTION 1

Consider each of the following research problems and answer the following question for each problem.

Which statistical technique is applicable? Clearly state which variables are the dependent and which are
the independent variables. If applicable for a specific problem, what information can be added to improve
the survey and its results?

(a) RESEARCH PROBLEM NUMBER 1

In a survey of 200 patients who suffered heart attacks, various observations were made on each patient:
age, height, weight, pulse rate, blood pressure, cholesterol level, blood sugar level and red cell count.
The medical researcher is interested in isolating the variables that contribute most to the occurrence of
heart attacks. (4)

(b) RESEARCH PROBLEM NUMBER 2

A researcher wants to compare apple trees of six different rootstocks. Data on the trunk circumference
at 4 years and at 15 years, the extension growth at 4 years and the weight of the tree at 15 years are
available for 20 trees from each of the six different rootstocks. (3)

(c) RESEARCH PROBLEM NUMBER 3

Annual financial data are collected for 25 bankrupt firms approximately two years prior to their
bankruptcy and for 25 financially sound firms at about the same time. For each firm, data on the
following four variables were obtained:
1 = (net income)/(total assets); 2 = (cash flow)/(total debt);

3 = (current assets)/(current liabilities) and 4 = (current assets)/(net sales).

An analyst wants to construct a model which can be used to predict the financial soundness of a firm
in terms of bankruptcy or non-bankruptcy. (3)

[10]
3 STA4806/ASS2/0

QUESTION 2

A provider of IT services would like to test perceptions of the prices that it charges for its services to
customers. A survey was completed with 148 of its customers, which among other things contained 7
questions regarding pricing:

Q3.2 Overall perception of prices charged


Q3.2.1 Perception of prices charged for computer repairs
Q3.2.2 Perception of prices charged for software upgrades
Q3.2.3 Perception of prices charged for network installations
Q3.2.4 Perception of prices charged for network maintenance
Q3.2.5 Perception of prices charged for hardware upgrades
Q3.2..6 Perception of prices charged for website hosting

Respondents were asked to rate these questions on a scale of 1 to 5, where 1 = Very high prices, 2 =
High Prices, 3 = Moderate Prices, 4 = Low prices and 5 = Very low prices. Specifically, the service
provider would like to determine the effect that each service’s price has on the overall price perception
that customers have. They would like to do this by using regression analysis with Q3.2 as the dependent
variable, and Q3.2.1 to Q3.2.6 as the independent variables.

On each of the 7 questions, respondents were allowed to answer “Don’t know” if they felt that they could
not provide a perception of the prices.

Customers could fall into one of 5 segments according to the type of service they receive from the service
provider.

The following tables contain results for the missing value analysis on these variables.

Table 2.1 Missing values per variable.


Valid Missing Total
N Percent N Percent N Percent
Q3.2 139 93.9% 9 6.1% 148 100.0%
Q3.2.1 120 81.1% 28 18.9% 148 100.0%
Q3.2.2 135 91.2% 13 8.8% 148 100.0%
Q3.2.3 121 81.8% 27 18.2% 148 100.0%
Q3.2.4 92 62.2% 56 37.8% 148 100.0%
Q3.2.5 100 67.6% 48 32.4% 148 100.0%
Q3.2..6 92 62.2% 56 37.8% 148 100.0%
4

Table 2.2: Number of missing values per case


Number of missing values per case
Frequency Percent Valid Percent Cumulative Percent
0 56 37.8% 37.8% 37.8%
1 30 20.3% 20.3% 58.1%
2 31 20.9% 20.9% 79.1%
3 12 8.1% 8.1% 87.2%
4 5 3.4% 3.4% 90.5%
5 4 2.7% 2.7% 93.2%
6 1 0.7% 0.7% 93.9%
7 9 6.1% 6.1% 100.0%
Total 148 100% 100

Table 2.3 Customer segment cross-tabled by missing values in variables


Q_3_2_1 Q_3_2_2 Q_3_2_3 Q_3_2_4 Q_3_2_5 Q_3_2_6 Total
Segment A Count 6 3 8 14 20 15 66
Segment B Count 1 0 0 2 2 0 5
Segment C Count 5 0 0 11 5 16 37
Segment D Count 3 1 4 10 7 4 29
Segment E Count 4 0 6 10 5 12 37
Total Count 19 4 18 47 39 47 174

Using the above tables, discuss the missing values in the dataset with specific reference to:

(a) The type of missing data (3)

(b) The extent of the missing data and the effect it might have on the regression analysis. (10)

(c) The possible reason why 9 cases were deleted between running the first two tables (Tables 2.1 and 2.2)
and running the third table (Table 2.3). (5)

(d) The randomness of the missing data (specifically referring to whether customer segment influences the
missing data process). (5)

(e) Any other analyses that might have added valuable information to the missing value
analysis. (5)

(f) If you can deduce an imputation method to be used from the results provided, which method would
you use? (2)

[30]
5 STA4806/ASS2/0

QUESTION 3

Holzinger and Swineford (1939) gave 24 psychological tests to 145 seventh and eighth grade students in
a Chicago suburb. The data are typical of the ability tests that have been used throughout the history of
factor analysis. The factor-analytic problem itself is concerned with the number and kind of dimensions
that can be used to describe the ability. The data is on myunisa under additional resources, sub-directory
assignment 2. Export data into SPSS and answer the following questions.

(a) Calculate the mean and standard deviation of the 24 variables. (15)

(b) Do reliability analysis and give the reliability coefficent for each variable. (20)

(c) Do the correlation matrix of the data and comment on key features. (10)

(d) Do exploratory factor analysis of the data and label the factors and ensure that you:
(i) find the number of factors,
(ii) identify poor factor indicators,
(iii) identify poorly measured factors, and
(iv) label your factors and interpret your final factor solution.
(25)

(d) Calculate the composite variables of each factor (by taking the average) and test whether the mean
differs by:

(i) Gender. (15)

(ii) Group. (15)

[100]
6

QUESTION 4

The change of water (a liquid) in the soil to water vapour (a gas) is called evaporation. Heat and wind
help water to evaporate more quickly.

Evaporation, is amongst others, a function of air and soil temperatures, relative humidity and wind. Since
these factors vary considerably throughout the day it is not clear which variables are important.

The data set Soil evaporation data set.sav found on myunisa under additional resources , sub-directory
assignment 2 consist of 46 observations . We want to find the effect of these variables decribed below on
daily water evaporation in the soil.

MAXST: Maximum daily soil temperature, measured in degrees Fahrenheit


MINST: Minimum daily soil temperature, measured in degrees Fahrenheit
MAXAT: Maximum daily air temperature, measured in degrees Fahrenheit
MINAT: Minimum daily air temperature, measured in degrees Fahrenheit
DAT: The difference between the maximum and minimum daily air temperature, measured in degrees
Fahrenheit
DH: The difference between the maximum and minimum daily relative humidity, measured as the
percentage of water vapour per cubic metre of air
Relative humidity is defined as the proportion of water vapour present in the air at a certain stage, relative
to the maximum proportion of vapour that can be held by the air. Relative humidity rises when air
temperature falls.
WIND: A measure of the cumulative strength of wind throughout the day in miles per day.
EVAP: Daily soil evaporation, measured as the difference between the percentage of water per cubic
metre of soil measured at dusk and also at dawn.
We would like you to perform a multiple regression analysis using EVAP as the dependent variable.

(a) Find the descriptive statistics of the variables (minimum, maximum, mean, standard deviation and
coeffcient of variation) and discuss the statistics . (10)

(b) Calculate the correlation coefficient matrix and comment on the key features. (8)

(c) Is the sample size adequate for a multiple regression model involving all the independent variables?
Explain. (3)
7 STA4806/ASS2/0

(d) Consider the correlation matrix you obtain in part (b). Which of the seven independent variables do
you think might be possible predictors of EVAP? Why? (3)

(e) Fit a multiple regression model containing all the seven independent variables. Ensure that you include
the collinearity statistics. Give the statistics which indicate that the regression model containing all
seven independent variables gives a significant fit and explain why. (8)

(f) Looking at the table "coefficients" of the output you generated in (e), comment on the t-values of the
variables. (3)

(g) What can you deduce from the "tolerance values" of the output obtained in (e)? Compare your answer
with the conclusion that can be drawn from the values of the
"condition indexes". (4)

(h) Do a backward elimination model and ensure you output include the collinearity statistics, collinearity
diagnostic and residual analysis statistics. Which of the two models would you recommend and why?
(8)

(i) Write down the estimated regression equation for the chosen model. (1)

(j) Using the backward elimination model, comment on the statistical significance of the parameter
estimates. How do these conclusions compare with the conclusions drawn
in (a)? (3)

(k) Is there any multicollinearity present in the backward elimination regression model? Justify your
answer. (2)

(l) Are there outliers or influential observations in the data set? Justify. (3)

(m) What can be done to improve the results of the regression analysis? (4)

[60]

[200]

Vous aimerez peut-être aussi