Académique Documents
Professionnel Documents
Culture Documents
Sampling Design
Surveys can gather information through different methods of observation. However, most
surveys employ a questionnaire to measure specific characteristics of the population. There
are two main ways to gather this information.
A census survey collects complete information from all participants in the population.
General criteria of a census survey include:
Establish and maintain a complete list of the primary sampling unit (PSU)
components.
All members of the PSU in must be included
Validation (used to correct for missing and misreported data)
Enforceable and enforced
Sample surveys are used when it is not possible or practical to conduct a census to count each
individual of an entire population. Everyday examples of sample surveys include political
polling, health statistics, and television ratings. Sample surveys are a proven, effective
method for gathering accurate information if they are properly designed and the sample
design is accounted for in the estimation methods.
Sampling method refers to the rules and procedures by which some elements of the
population are included in the sample.
Sampling Estimator. The estimation process for calculating sample statistics is called the
estimator. Different sampling methods may use different estimators.
Objectives of sampling:
(i) To obtain the optimum results, i.e., the maximum information about the characteristics of
the population with the available sources at our disposal in terms of time, money and
manpower by studying the sample values only.
1. Representativeness
2. Accuracy
3. Precision
4. Size
Sampling frame: In statistics, a sampling frame is the source material or device from which a
sample is drawn. It is a list of all those within a population who can be sampled, and may include
individuals, households or institutions.
Sampling unit: A unit is the thing being studied. Usually in social research this is people. There may
also be additional selection criteria used to choose the units to study, such as 'people who have been
police officers for at least five years.'
Sample size: The larger your sample size, the more sure you can be that their answers truly reflect the
population. This indicates that for a given confidence level, the larger your sample size, the smaller
your confidence interval.
Determining sample size is a very important issue because samples that are too large may
waste time, resources and money, while samples that are too small may lead to inaccurate
results.
Basis for
Questionnaire Schedule
Comparison
Schedule is a formalized set of questions,
Questionnaire refers to a technique
statements and spaces for answers,
of data collection which consist of
Meaning provided to the enumerators who ask
a series of written questions along
questions to the respondents and note
with alternative answers.
down the answers.
Filled by Respondents Enumerators
Response Rate Low High
Coverage Large Comparatively small
Cost Economical Expensive
Respondent's
Not known Known
identity
Success relies Honesty and competence of the
Quality of the questionnaire
on enumerator.
Only when the people are literate
Usage Used on both literate and illiterate people.
and cooperative.
Other methods of collecting data through primary data collection are observation,
experimentation, simulation, panel methods and interview methods.
Sampling Techniques:
Probability or Random Sampling Techniques:
A probability sampling method is any method of sampling that utilizes some form of random
selection. In order to have a random selection method, you must set up some process or procedure that
assures that the different units in your population have equal probabilities of being chosen.
1. Simple Random Sampling: The simplest form of random sampling is called simple random
sampling. Simple random sampling is not the most statistically efficient method of sampling and you
may, just because of the luck of the draw, not get good representation of subgroups in a population.
2. Stratified Random Sampling, also sometimes called proportional or quota random sampling,
involves dividing your population into homogeneous subgroups and then taking a simple random
sample in each subgroup.
4. Quota sampling: This is one of the most common forms of non-probability sampling.
Sampling is done until a specific number of units (quotas) for various sub-populations have
been selected. Since there are no rules as to how these quotas are to be filled, quota sampling
is really a means for satisfying sample size objectives for certain sub-populations.
Sampling error: Sampling error is the deviation of the selected sample from the true characteristics,
traits, behaviors, qualities or figures of the entire population. The errors which arise because of
studying only a part of the total population.
Non Sampling error: These are errors which arise from sources other than sampling. The
errors of observation, errors of measurement and errors of responses are some non-sampling
errors.
Data Collection is an important aspect of any type of research study. Inaccurate data
collection can impact the results of a study and ultimately lead to invalid results. Data
collection methods for impact evaluation vary along a continuum. At the one end of this
continuum are quantitative methods and at the other end of the continuum are Qualitative
methods for data collection.
1) Targeted Issues are addressed. The organization asking for the research has the complete
control on the process and the research is streamlines as far as its objectives and scope is
concerned. Researching company can be asked to concentrate their efforts to find data
regarding specific market rather than concentration on mass market.
2) Data interpretation is better. The collected data can be examined and interpreted by the
marketers depending on their needs rather than relying on the interpretation made by
collectors of secondary data.
3) Recency of Data. Usually secondary data is not so recent and it may not be specific to the
place or situation marketer is targeting. The researcher can use the irrelevant seeming
information for knowing trends or may be able to find some relation with the current
scenario. Thus primary data becomes a more accurate tool since we can use data which is
useful for us.
4) Proprietary Issues. Collector of primary data is the owner of that information and he need
not share it with other companies and competitors. This gives an edge over competitors
replying on secondary data.
1) High Cost. Collecting data using primary research is a costly proposition as marketer has
to be involved throughout and has to design everything.
2) Time Consuming. Because of exhaustive nature of the exercise, the time required to do
research accurately is very long as compared to secondary data, which can be collected in
much lesser time duration.
3) Inaccurate Feed-backs. In case the research involves taking feedbacks from the targeted
audience, there are high chances that feedback given is not correct. Feedbacks by their basic
nature are usually biased or given just for the sake of it.
4) More number of resources are required. Leaving aside cost and time, other resources
like human resources and materials too are needed in larger quantity to do surveys and data
collection.
1. The first advantage of using secondary data (SD) has always been the saving of time. New
technology has revolutionized this world. The process has been simplified. Precise
information may be obtained via search engines. All worth library has digitized its collection
so that students and researchers may perform more advance searches.
1. Inappropriateness of the data. Data collected by oneself (primary data) is collected with
a concrete idea in mind. Usually to answer a research question or just meet certain
objectives. In this sense, secondary data sources may provide you with vast amount of
information, but quantity is not synonymous of appropriateness. This is simply because it has
been collected to answer a different research question or objectives.
2. Lack of control over data quality: Government and other official institutions are often a
guarantee of quality data, but it is not always the case.
5. Mail survey
6. Panel method
1. Questionnaire
2. Check list
3. Rating scale
4. Inventories
5. Interview schedule
6. Interview guide
7. Observation schedule
Measurement and Scaling Techniques
1. Nominal Scale: Nominal scales are used for labelling variables, without any quantitative
value. “Nominal” scales could simply be called “labels.” Here are some examples,
below. Notice that all of these scales are mutually exclusive (no overlap) and none of them
have any numerical significance. A good way to remember all of this is that “nominal”
sounds a lot like “name” and nominal scales are kind of like “names” or labels.
2. Ordinal Scale: With ordinal scales, it is the order of the values is what‟s important and
significant, but the differences between each one is not really known.
3. Interval scales are numeric scales in which we know not only the order, but also the exact
differences between the values.
The classic example of an interval scale is Celsius temperature because the difference
between each value is the same. For example, the difference between 60 and 50 degrees is a
measurable 10 degrees, as is the difference between 80 and 70 degrees. Time is another good
example of an interval scale in which the increments are known, consistent, and measurable.
Interval scales are nice because the realm of statistical analysis on these data sets opens
up. For example, central tendency can be measured by mode, median, or mean; standard
deviation can also be calculated.
4. Ratio scales are the ultimate nirvana when it comes to measurement scales because they
tell us about the order, they tell us the exact value between units, AND they also have an
absolute zero–which allows for a wide range of both descriptive and inferential statistics to be
applied.
Arbitrary scale, Rensis Likert Scale, Semantic Differential Scale (SD), Factor scale, Attitude
scales are some of the common scales used to collect data.
Types of hypothesis
1. Simple hypothesis – it is that one in which there exists relationship between two variables
(causation effect – cause and effect) Ex: Smoking leads to cancer
Ex: Smoking and other drugs lead to cancer and chest infection etc.
6. Logical hypothesis – Is is that one which is verified logically. We may have agreement,
disagreement, difference and residue as the outcomes.
When you do a hypothesis test, two types of errors are possible: type I and type II. The risks
of these two errors are inversely related and determined by the level of significance and the
power for the test. Therefore, you should determine which error has more severe
consequences for your situation before you define their risks.
Type I error - When the null hypothesis is true and you reject it, you make a type I error.
The probability of making a type I error is α, which is the level of significance you set for
your hypothesis test. An α of 0.05 indicates that you are willing to accept a 5% chance that
you are wrong when you reject the null hypothesis. To lower this risk, you must use a lower
value for α. However, using a lower value for alpha means that you will be less likely to
detect a true difference if one really exists.
Type II error - When the null hypothesis is false and you fail to reject it, you make a type II
error. The probability of making a type II error is β, which depends on the power of the test.
You can decrease your risk of committing a type II error by ensuring your test has enough
power. You can do this by ensuring your sample size is large enough to detect a practical
difference when one truly exists.
Type I and Type II Errors
PROCEDURE FOR HYPOTHESIS TESTING
1. Making assumptions
2. Stating the research and null hypotheses and selecting (setting) alpha
3. Selecting the sampling distribution and specifying the test statistic
4. Computing the test statistic
5. Making a decision and interpreting the results
1. Null hypothesis - The null hypothesis is a clear statement about the relationship between
two (or more) statistical objects. These objects may be measurements, distributions, or
categories. Typically, the null hypothesis, as the name implies, states that there is no
relationship.In the case of two population means, the null hypothesis might state that the
means of the two populations are equal.
2. Alternative hypothesis - Once the null hypothesis has been stated, it is easy to construct
the alternative hypothesis. It is essentially the statement that the null hypothesis is false. In
our example, the alternative hypothesis would be that the means of the two populations are
not equal.
The significance level is something that you should specify up front. In applications, the
significance level is typically one of three values: 10%, 5%, or 1%. A 1% significance level
represents the strongest test of the three. For this reason, 1% is a higher significance level
than 10%
4. Power - Related to significance, the power of a test measures the probability of correctly
concluding that the null hypothesis is true. Power is not something that you can choose. It is
determined by several factors, including the significance level you select and the size of the
difference between the things you are trying to compare.
Unfortunately, significance and power are inversely related. Increasing significance decreases
power. This makes it difficult to design experiments that have both very high significance
and power.
5. Test statistic - The test statistic is a single measure that captures the statistical nature of
the relationship between observations you are dealing with. The test statistic depends
fundamentally on the number of observations that are being evaluated. It differs from
situation to situation.
6. Distribution of the test statistic - The whole notion of hypothesis rests on the ability to
specify (exactly or approximately) the distribution that the test statistic follows. In the case of
this example, the difference between the means will be approximately normally distributed
(assuming there are a relatively large number of observations).
7. One-tailed vs. two-tailed tests - Depending on the situation, you may want (or need) to
employ a one- or two-tailed test. These tails refer to the right and left tails of the distribution
of the test statistic. A two-tailed test allows for the possibility that the test statistic is either
very large or very small (negative is small). A one-tailed test allows for only one of these
possibilities.
8. Critical value - The critical value in a hypothesis test is based on two things: the
distribution of the test statistic and the significance level. The critical value(s) refer to the
point in the test statistic distribution that give the tails of the distribution an area (meaning
probability) exactly equal to the significance level that was chosen.
9. Decision - Your decision to reject or accept the null hypothesis is based on comparing the
test statistic to the critical value. If the test statistic exceeds the critical value, you should
reject the null hypothesis. In this case, you would say that the difference between the two
population means is significant. Otherwise, you accept the null hypothesis.
10. P-value - The p-value of a hypothesis test gives you another way to evaluate the null
hypothesis. The p-value represents the highest significance level at which your particular test
statistic would justify rejecting the null hypothesis. For example, if you have chosen a
significance level of 5%, and the p-value turns out to be .03 (or 3%), you would be justified
in rejecting the null hypothesis.
STATISTICAL TECHNIQUES
Correlation – It is a statistical measure that indicates the extent to which two or more
variables fluctuate together. A positive correlation indicates the extent to which those
variables increase or decrease in parallel; a negative correlation indicates the extent to which
one variable increases as the other decreases.
Types of correlation
1. Positive correlation occurs when an increase in one variable increases the value in
another. The line corresponding to the scatter plot is an increasing line.
2. Negative correlation occurs when an increase in one variable decreases the value of
another. The line corresponding to the scatter plot is a decreasing line.
3. Perfect correlation occurs when there is a functional dependency between the variables.
In this case all the points are in a straight line.
Regression
Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
Because the other terms are used less frequently today, we'll use the "predictor" and
"response" terms to refer to the variables encountered in this course. The other terms are
mentioned only to make you aware of them should you encounter them in other arenas.
Simple linear regression gets its adjective "simple," because it concerns the study of only one
predictor variable. In contrast, multiple linear regression, which we study later in this course,
gets its adjective "multiple," because it concerns the study of two or more predictor variables.
Regression coefficient : The constant „b‟ in the regression equation (Ye = a + bX) is called as
the Regression Coefficient. It determines the slope of the line, i.e. the change in the value of
Y corresponding to the unit change in X and therefore, it is also called as a “Slope
Coefficient.
2. The value of the coefficient of correlation cannot exceed unity i.e. 1. Therefore, if one of
the regression coefficients is greater than unity, the other must be less than unity.
3. The sign of both the regression coefficients will be same, i.e. they will be either positive
or negative. Thus, it is not possible that one regression coefficient is negative while the other
is positive.
4. The coefficient of correlation will have the same sign as that of the regression
coefficients, such as if the regression coefficients have a positive sign, then “r” will be
positive and vice-versa.
5. The average value of the two regression coefficients will be greater than the value of
the correlation. Symbolically, it can be represented as
6. The regression coefficients are independent of the change of origin, but not of the scale
Parametric Tests: A parametric statistical test is one that makes assumptions about the
parameters (defining properties) of the population distribution(s) from which one's data are
drawn.
Parametric tests are having some advantages and disadvantages. They are,
Advantages :
Disadvantages :
Student t tests:
Introduced by William Gosset in 1908. A t-test is any statistical hypothesis test in which the
test statistic follows a Student's t-distribution under the null hypothesis. It can be used to
determine if two sets of data are significantly different from each other.
One sample t test: In testing the null hypothesis that the population mean is equal to a
specified value μ0, one uses the statistic.
The independent samples t-test is used when two separate sets of independent and
identically distributed samples are obtained, one from each of the two populations being
compared.
Paired t test: A test of the null hypothesis that the difference between two responses
measured on the same statistical unit has a mean value of zero.
The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of three or more independent
(unrelated) groups.
Assumptions
The ANOVA technique applies when there are two or more than two independent groups.
The ANOVA procedure is used to compare the means of the comparison groups and is
conducted using the same five step approach used in the scenarios discussed in previous
sections. Because there are more than two groups, however, the computation of the test
statistic is more involved. The test statistic must take into account the sample sizes, sample
means and sample standard deviations in each of the comparison groups.
If one is examining the means observed among, say three groups, it might be tempting to
perform three separate group to group comparisons, but this approach is incorrect because
each of these comparisons fails to take into account the total data, and it increases the
likelihood of incorrectly concluding that there are statistically significant differences, since
each comparison adds to the probability of a type I error. Analysis of variance avoids these
problemss by asking a more global question, i.e., whether there are significant differences
among the groups, without addressing differences between any two groups in particular
(although there are additional tests that can do this if the analysis of variance indicates that
there are differences among the groups).
EXAMPLE OF ANOVA
A study is designed to test whether there is a difference in mean daily calcium intake in
adults with normal bone density, adults with osteopenia (a low bone density which may lead
to osteoporosis) and adults with osteoporosis. Adults 60 years of age with normal bone
density, osteopenia and osteoporosis are selected at random from hospital records and invited
to participate in the study. Each participant's daily calcium intake is measured based on
reported food intake and supplements. The data are shown below.
Is there a statistically significant difference in mean calcium intake in patients with normal
bone density as compared to patients with osteopenia and osteoporosis?
In order to determine the critical value of F we need degrees of freedom, df1=k-1 and df2=N-
k. In this example, df1=k-1=3-1=2 and df2=N-k=18-3=15. The critical value is 3.68 and the
decision rule is as follows: Reject H0 if F > 3.68.
Substituting:
Finally,
Next,
SSE requires computing the squared differences between each observation and its group
mean. We will compute SSE in parts. For the participants with normal bone density:
Normal Bone (X - (X -
Density 938.3) 938.3)2
1200 261.7 68,486.9
1000 61.7 3,806.9
980 41.7 1,738.9
900 -38.3 1,466.9
750 -188.3 35,456.9
800 -138.3 19,126.9
Total 0 130,083.4
Thus,
Thus,
Thus,
Step 5. Conclusion.
We do not reject H0 because 1.26 < 3.68. We do not have statistically significant evidence at
a =0.05 to show that there is a difference in mean calcium intake in patients with normal bone
density as compared to osteopenia and osterporosis.
Analysis of covariance (ANCOVA) is a general linear model which blends ANOVA and regression.
ANCOVA evaluates whether population means of a dependent variable (DV) are equal
across levels of a categorical independent variable (IV) often called a treatment, while
statistically controlling for the effects of other continuous variables that are not of primary
interest, known as covariates (CV) or nuisance variables.
CONCLUSION
Compiled by
Dr. A. G. Vijayanarayanan
M.Com., MBA, PGDSBSA, Ph.D.,