Académique Documents
Professionnel Documents
Culture Documents
INTRODUCTION
It enables you:
o
DEFINITIONS
1. Statistics: is the study of how to collect, organizes, analyze, and
interpret data.
2. Data: the values recorded in an experiment or observation.
3. Population: refers to any collection of individual items or units
that are the subject of investigation.
4. Sample: A small representative sample of a population is called
sample.
5. Observation: each unit in the sample provides a record, as a
measurement which is called observation.
6. Sampling: getting sample from a population
7. Variable: the value of an item or individual is called variable
8. Raw Data: Data collected in original form.
9. Frequency: The number of times a certain value or class of
values occurs.
10. Tabulation: can be defined as the logical and systematic
arrangement of statistical data in rows and columns.
11. Frequency Distribution: The organization of raw data in table
form with classes and frequencies.
12. Class Limits: Separate one class in a grouped frequency
distribution from another. The limits could actually appear in
the data and have gaps between the upper limit of one class and
the lower limit of the next.
13. Class Boundaries: Separate one class in a grouped frequency
distribution from another.
14. Cumulative Frequency: The number of values less than the
upper class boundary for the current class. This is a running
total of the frequencies.
15. Histogram: A graph which displays the data by using vertical
bars of various heights to represent frequencies.
16. Frequency Polygon: it is a line graph. The frequency is placed
along the vertical axis and the class midpoints are placed along
the horizontal axis. These points are connected with lines.
17. Pie Chart: Graphical depiction of data as slices of a pie. The
frequency determines the size of the slice. The number of
degrees in any slice is the relative frequency times 360 degrees.
18. Central tendency - a typical or representative value for a
dataset.
VARIABLES
Continuous variables
Variables can be
Independent
Dependent
Methods of sampling:
o
Cluster sampling
Non-random sampling
Convenient Sampling
Purposive Sampling
Quota Sampling
Lottery method
SCALES OF MEASUREMENT
o
It is quantitative in nature
The individual units are equidistant from one point to the other.
PROCESSING OF DATA
Classification is the process of arranging data on the basis of some common characteristics possessed by
them.
Descriptive statistics
Inferential statistics
Descriptive statistics are concerned with describing the characteristics of frequency distributions. The
common methods in descriptive analyses are:
o
Measures of dispersion
Scatterplot
The inferential statistics helps to decide whether the outcome of the study is a result of factors planned
within design of the study or determined by chance. Common inferential statistical tests are:
o
T-tests
Chi-squire test
Pearson correlation
Frequency Distribution
Frequency distribution is a statistical table containing groups of values according to the number of times a
value occurs.
It is not in order.
The first class and the last class are fixed by seeing the lowest and highest values.
Lowest and highest numbers of each class are called class limits (upper & lower).
Inclusive methods
2.
Exclusive method
PRESENTATION OF DATA
1.
Tabular presentation
2.
Diagrammatic Presentation
3.
Graphical Presentation
Water samples
1
2
3
4
Amount of O2 in mL
4.5
6.9
6.2
5.3
It is a visual form of presentation of statistical data in which data are presented in the form of diagrams such
as bars, lines, circles, maps
2.
3.
It more attractive
2.
3.
It saves time
4.
4.
2.
3.
Common Types
1.
Line Diagram
2.
Pie diagram
3.
Bar diagram
Line diagram
E.g. A traffic survey shows the following vehicles passing a particular bus stop during a hour
Vehicles
Frequency
Cars
Lorries
Motor Cycles
Buses
45
22
6
3
Total
76
Pie Diagram
Students
5
20
10
15
Bar Diagram
Example:
yield of
various
vegetables
from a
garden.
2.
3.
Usually, Independent variable is marked on the X-axis and dependent variable on the Y-axis.
4.
Common Types:
1.
Histogram
2.
Frequency Polygon
3.
Frequency curve
Histogram
1.
2.
It is an area diagram
3.
4.
5.
6.
Histogram differs from bar diagram. The bar diagram is one dimensional, whereas histogram is twodimensional.
7.
Uses of histogram
Category
1.
2.
3.
4.
Systolic BP Number of
(mmHg)
Persons7
100-109
110-119
16
120-129
19
130-139
31
140-149
41
150-159
23
160-169
10
170-179
DESCRIPTIVE STATISTICS
Measures of dispersion/variability
A measure of central tendency is a single number used to represent the centre of a grouped
data.
For any symmetrical distribution, the mean, median, and mode will be identical.
Mean
The mean (or average) is found by adding all the numbers and then dividing by how many
numbers you added together.
3,4,5,6,7
3+4+5+6+7= 25
25 divided by 5 = 5
The mean is 5
Advantages of mean
o
Disadvantages of mean
o
Median
When the numbers are arranged in numerical order, the middle one is the median.
50% of observations are above the Median, 50% are below it.
Formula Median = n + 1 / 2.
Example:
3,6,2,5,7
The median is 5
Advantages:
o
Disadvantages:
o
Mode
We usually find the mode by creating a frequency distribution in which we count how often
each value occurs.
If we find that every value occurs only once, the distribution has no mode.
If we find that two or more values are tied as the most common, the distribution has more
than one mode.
Example:
2,2,2,4,5,6,7,7,7,7,8
The mode is 7
Advantages:
o
Disadvantages:
o
Nominal variables
- Mode
Ordinal variables
- Median
- Mean
MEASURES OF VARIABILITY
If there is no variability within populations there would be no need for statistics.
range
variance, and
These indices answer the question: How Spread out is the distribution?
Range
It refers to the difference between the highest and lowest values produced.
For continuous variables, the range is the arithmetic difference between the highest and
lowest observations in the sample. In the case of counts or measurements, 1 should be added
to the difference because the range is inclusive of the extreme observations.
Another statistic, known as the interquartile range, describes the interval of scores bounded by
the 25th and 75th percentile ranks; the interquartile range is bounded by the range of scores
that represent the middle 50 percent of the distribution.
The p-th percentile is the value that has p% of the measurements below it and (100-p)%
above it.
Thus, the 20th percentile is the value such that one fifth of the data lie below it. It is higher
than 20% of the data values and lower than 80% of the data values.
o
E.g. if you are in the 80th percentile on a real GMAT result, you scored better on that
section than 80% of the students taking the GMAT.
Standard deviation
Large standard deviations suggest that scores are probably widely scattered.
Small standards deviations suggest that there is very little deference among scores.
There are eight data points in total, with a mean (or average) value of 5:
To calculate the population standard deviation, first compute the difference of each data point
from the mean, and square the result:
Next divide the sum of these values by the number of values and take the square root to give
the standard deviation:
Variance
Chi-square test is an inferential statistics technique designed to test for significant relationships
between two variables organized in a bivariate table.
Chi-square requires no assumptions about the shape of the population distribution from which a
sample is drawn. However, like all inferential techniques it assumes random sampling.
It can be applied to variables measured at a nominal and/or an ordinal level of measurement.
The research hypothesis (H1) proposes that the two variables are related in the population.
The null hypothesis (H0) states that no association exists between the two cross-tabulated variables in
the population, and therefore the variables are statistically independent.
Formula for computing chi-squire statistic
chi-squire formula
The essence of the chi squire test is to compare the observed frequencies with the frequencies
expected for independence
if the difference between observed and expected frequencies is large, then we can reject the null
hypothesis of independence.
Determining the Degrees of Freedom
df = (r 1)(c 1)
where, r = the number of rows and c = the number of columns