Kyle Peterson Data Analysis 2

Kyle Peterson
Data Analysis CW2: Statistical Testing

Various statistical tests were used to interpret information from two data sets of mean sea level
heights. The height of mean sea level is a function of the height of the satellite above the sea
surface, the height of the satellite above a coordinate reference ellipsoid, and a tidal correction.
Primary data set
The primary data given was of 140 radar measurements taken over several months from a satellite
as the observed height of mean sea level above the same coordinate reference ellipsoid. The mean,
the standard deviation, the standard error of the mean and the normal distribution confidence
intervals of this primary data before outliers were removed was calculated as follows:
Mean:
Standard deviation:
Standard error of the mean:
95% Confidence interval:
35.827m
0.533m
0.089m
(34.781m, 36.873m)
(34.453m, 37.201m)
Can a satellite in space really measure within 10cm accuracy? Maybe with overdetermined leastsquares adjusted differential GPS tied to control points, but with radar? I doubt it. Furthermore,
errors are typically greater in the height direction and the sea surface is turbulent and semireflective. With 95% and 99% confidence, two outliers from the primary data were determined.
Table showing the two outliers of the primary data of both 95% and 99% confidence
Pass number
Mean sea level
119
40.098m
130
40.108m
These are clearly noticeable by an inspection of the data as the maximum value in the primary data
set below these two outliers is 36.107m. Perhaps these represent ships or a flock of birds passing
over the target area and reflecting the radar signal a few meters above sea surface.
Graph of the primary data set of radar measurements of mean sea level heights and clearly
showing the two outliers outside of both the 95% and 99% normal confidence limits
The mean, the standard deviation and the standard error of the mean of the primary data after
outliers were removed was calculated as follows:
Mean:
Standard deviation:
35.765m
0.134m
0.022m
As expected, the standard errors have been reduced by removing the outliers. The mean of the
primary data has also been reduced by about 6cm. This can give an indication of the precision and
reliability of the primary data set precise for radar measurements from space but could be biased.
A Chi-squared Test can be used to test for normality of the primary data set, i.e. to ascertain whether
the primary data has a good fit with the normal distribution or not. This was done in such a way that
instead of arbitrarily defining categorical class intervals to bin together observed and expected
frequencies, a simple method of classification was contrived to achieve the following statistical
constraint: to have the most number of bins yet each bin must have an expected frequency of at least
5. There are 138 data observations after outliers were removed. Therefore, the classing method can
quite elegantly be defined as having 23 bins where each bin has an expected frequency of 6. Using
the normal distribution to compute the bin intervals and merging bins where the number of
observations is less than 5, chi-squared values and critical values of chi-squared can be calculated.
Chi-squared value after outliers removed:
Critical value of chi-squared for 95% confidence:
Critical value of chi-squared for 99% confidence:
Chi-squared value before outliers removed:
21.000
23.685
29.141
265.500
There are 14 degrees of freedom in this calculation. The null hypothesis that the primary data set
(after outliers are removed) is normally distributed is accepted and hence there is not enough
evidence to say that the primary data set is not normally distributed. Just as a comparison with
the data before outliers were removed, there is enough evidence to say that the primary data before
outliers were removed is not normally distributed according to the Chi-squared Test.
Secondary data set
The secondary dataset was taken over a later period. Solar activity (or other factors) may have
affected the 22 radar measurements taken. The mean, the standard deviation, the standard error of
the mean and the normal distribution confidence intervals of this secondary data (before any
potential outliers are removed) was calculated as follows:
Mean:
Standard deviation:
35.787m
0.215m
0.036m
(35.365m, 36.209m)
(35.232m, 36.342m)
The mean of the secondary data is greater than the mean of the primary data after the two outliers
are removed. There were no observations outside of the 95% confidence intervals. The observation
of pass number 3 and height of 36.220 is between the 95% and 99% confidence limits but this is the
maximum value of the range next to 36.064m. This observation is strictly not an outlier. Good data
should not be thrown away, especially when we have only 22 observations and the variance of the
secondary data is greater than the primary data after outliers were removed, where variance is the
square of the standard deviation. These variances will be compared.
Comparison of data sets

Despite being 16% of observations of the primary data, the secondary data set of 22 radar
measurements can be compared to the primary data set of 138 observations (after outliers removed).
Difference in the means:
Difference in the variances:
Pooled variance:
0.022m
0.029
0.022
The difference in the variances of the primary and secondary data sets is greater than their pooled
variance, where the pooled variance is expressed as a function of each variance and the respective
number of observations of each data set. An F Test can be used to compare the primary data set and
secondary data set. The value of F is the ratio of the variances of the 138 primary data observations
(after outliers are removed) and the 22 secondary data observations.
Value of F after outliers removed:
Critical value of F for 95% confidence:
Critical value of F for 99% confidence:
2.595
1.859
2.105
The value of F is greater than the critical values, thus there is a significant difference (but the cause
cannot be determined from the F Test). Because the variance of the secondary data is greater than
the primary data, it can be said that the secondary data is significantly noisier than the primary
data. The difference in the means of about 2cm seems somewhat negligible, but this is not an
accurate comparison. A T Test was used to compare the means of the primary and secondary data.
Value of T after outliers removed:
Critical value of T for 95% confidence:
Critical value of T for 99% confidence:
0.645
1.975
2.607
The value of T is less than the critical values, therefore the means are not significantly different.
This is what was expected, but it can be said with 99.5% confidence that the means are not
different, i.e. the same. The difference in the variances and the means can be visualised when the
primary and secondary data is graphed as a histogram. The following histogram uses equal bin
intervals of 0.1m and centred on observations of the mean sea level above the coordinate reference
ellipsoid in meters rounded off to the first decimal, i.e. regular 10cm intervals.
Histogram comparing the primary data set and secondary data set of mean sea level as heights (m)
above the ellipsoid from satellite radar measurements after outliers have been removed
Although there is enough evidence to say that the secondary data is significantly noisier than the
primary data, there is not enough evidence to say that this noise is because of high solar activity.
There is also not enough evidence to say that it is not caused by solar activity. This may be caused
by high solar activity or by other factors such as an instrumental bias in the satellite or atmospheric
factors. The variance of the primary data may also be less than the secondary data due to the central
limit theorem. This is because there may simply not be enough observations in the secondary data to
meaningfully compare the secondary data set to the primary data set.
Error propagation
The standard error of mean sea level can be predicted from the given parameters of the satellite, two
equations to model the relationship between the parameters, and their standard errors. The first
given equation is the following expression for the height of mean sea level (hMSL):
hMSL = hE A T
Where hE is the height of the satellite above the coordinate reference ellipsoid, A is the height of the
satellite above the sea surface, and T is a tidal correction. This can be partially differentiated with
respect to each term and error propagation can be applied to determine the standard error of hMSL.
hMSL = 1
hE
hMSL = 1
A
hMSL = 1
T
2
2
hMSL = hMSL hE 2 + hMSL 2 A 2 + hMSL 2 T2
hE
A
T
hMSL2 = hE 2 + A 2 + T2
...(1)
This is expected since although not arithmetically independent as an increase in the height of the
satellite from the ellipsoid will increase the height of the satellite from the sea surface (ceteris
parabus) for example, they are algebraically linearly independent. Geometrically though, this
assumes co-linearity i.e. that the position of the satellite, the coordinate position of the sea surface
on the ellipsoid, the target of the radar signal on the sea surface and the primary orientation axis of
the tidal correction are all in a straight line. The standard error of the height of the satellite from the
ellipsoid is not given and is derived from the following given equation:
hE = X2 + Y2 + Z2 - R
Where X, Y and Z are the geocentric coordinates of the satellite and R is our Earth's radius or rather
the radius of the ellipsoid at the coordinate position of the target point on the sea surface. The radius
of an ellipsoid varies depending on the coordinate position! Partial differentiation with chain rule:
let u = X2 + Y2 + Z2
hE =
1
u
2 u
hE = hE . u
X u X
hE =
X
2
X X + Y2 + Z2
Likewise for Y and Z and assuming R is constant (i.e. spherical), error propagation is as follows:
hE2 = hE 2 X 2 +
X
hE2 =
X2
X 2 +
X2 + Y2 + Z2
hE 2 Y 2 + hE 2 Z2
Y
Z
Y2
Y 2
X2 + Y2 + Z2
Z2
Z2
X2 + Y2 + Z2
...(2)
Therefore a predicted standard error of mean sea level can be calculated by substituting the given
satellite parameters into Equation 2 and substituting this derived standard error of ellipsoidal height
of the satellite into Equation 1 since the standard errors of A and T are also given. Three standard
errors of MSL (MSL is the population not the mean of the sample data) have now been calculated:
Standard error of MSL from primary data:
Standard error of MSL from secondary data:
Predicted standard error of MSL:
0.022m
0.036m
0.15m
There are limits to how these standard errors can meaningfully be compared because a) radar
measurements of the sea surface from a satellite in space cannot be to millimetre accuracy, b) an
approximation of the geocentric coordinates of the satellite is made, and c) wave heights in the sea
surface are variable due to differences in wind speed or perhaps tsunamis. Furthermore, an F Test
cannot be used to compare the variance of the primary data with the predicted variance because
there were no degrees of freedom for calculating the predicted standard error of mean sea level by
error propagation. Regardless, the difference between this predicted standard error of MSL and the
standard deviation of the primary data (0.134m) is less than 2cm and therefore a pragmatic
estimation can be made that there is no significant difference between the standard deviation of the
mean sea level of the primary data and the predicted standard error of MSL because satellite radar
measurements simply cannot be that accurate. Therefore a corollary to the statement that the
secondary data is significantly noisier than the primary data can be made that the secondary data is
significantly noisier (more erroneous) than the predicted standard error of mean sea level.
Conclusion
A primary and secondary data set of satellite radar measurements of mean sea level heights above
an ellipsoid were analysed. Two outliers were removed from the primary data set. The mean and
standard errors of the primary data set was calculated before and after the two outliers were
removed. A Chi-squared Test was used to compare the primary data with the normal distribution.
There is not enough evidence to say that the primary data set is not normally distributed. The mean
and standard errors of the secondary data set was calculated and compared to the primary data set.
An F Test was used to say that the secondary data is significantly noisier than the primary data and a
T Test was used to say that the mean of the secondary data is not significantly different to the mean
of the primary data (after outliers were removed). By error propagation and geometric assumptions,
the standard error of mean sea level was predicted. This prediction was not significantly different to
the standard deviation of the primary data, but significantly different to the standard deviation of the
secondary data set.

Kyle Peterson Data Analysis 2

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Kyle Peterson Data Analysis 2

Transféré par

Droits d'auteur :

Formats disponibles

Kyle Peterson

Data Analysis CW2: Statistical Testing

Mean sea level

Comparison of data sets

Vous aimerez peut-être aussi