Vous êtes sur la page 1sur 82

Chapter

2
Data Collection
Data Vocabulary Level of Measurement Time Series and Cross-sectional Data Sampling Concepts Sampling Methods Data Sources Survey Research

Data Vocabulary
Data is the plural form of the Latin datum (a given fact). In scientific research, data arise from experiments whose results are recorded systematically. In business, data usually arise from accounting transactions or management processes. Important decisions may depend on data.
2007 The McGraw-Hill Companies, Inc. All rights reserved.

McGraw-Hill/Irwin

Data Vocabulary
Subjects, Variables, Data Sets
We will refer to Data as plural and data set as a particular collection of data as a whole. Observation each data value. Subject (or individual) an item for study (e.g., an employee in your company).

Variable a characteristic about the subject or individual (e.g., employees income).

Data Vocabulary
Subjects, Variables, Data Sets
Three types of data sets:
Data Set Univariate Variables One Typical Tasks Histograms, descriptive statistics, frequency tallies Scatter plots, correlations, simple regression Multiple regression, data mining, econometric modeling

Bivariate

Two

Multivariate More than two

Data Vocabulary
Subjects, Variables, Data Sets
Consider the multivariate data set with 5 variables 8 subjects 5 x 8 = 40 observations

Data Vocabulary
Data Types
A data set may have a mixture of data types.

Types of Data
Attribute (qualitative)
Verbal Label Coded X = economics X=3 (your major) (i.e., economics)

Numerical (quantitative)
Discrete X=2 (your siblings) Continuous X = 3.15 (your GPA)

Data Vocabulary
Attribute Data
Also called categorical, nominal or qualitative data. Values are described by words rather than numbers. For example, - Automobile style (e.g., X = full, midsize, compact, subcompact). - Mutual fund (e.g., X = load, no-load).

Data Vocabulary
Data Coding
Coding refers to using numbers to represent categories to facilitate statistical analysis.

Coding an attribute as a number does not make the data numerical.


For example, 1 = Bachelors, 2 = Masters, 3 = Doctorate Rankings may exist, for example, 1 = Liberal, 2 = Moderate, 3 = Conservative

Data Vocabulary
Binary Data
A binary variable has only two values, 1 = presence, 0 = absence of a characteristic of interest (codes themselves are arbitrary). For example, 1 = employed, 0 = not employed 1 = married, 0 = not married 1 = male, 0 = female 1 = female, 0 = male The coding itself has no numerical value so binary variables are attribute data.

Data Vocabulary
Numerical Data
Numerical or quantitative data arise from counting or some kind of mathematical operation. For example, - Number of auto insurance claims filed in March (e.g., X = 114 claims). - Ratio of profit to sales for last quarter (e.g., X = 0.0447). Can be broken down into two types discrete or continuous data.

Data Vocabulary
Discrete Data
A numerical variable with a countable number of values that can be represented by an integer (no fractional values). For example, - Number of Medicaid patients (e.g., X = 2). - Number of takeoffs at OHare (e.g., X = 37).

Data Vocabulary
Continuous Data
A numerical variable that can have any value within an interval (e.g., length, weight, time, sales, price/earnings ratios). Any continuous interval contains infinitely many possible values (e.g., 426 < X < 428).

Data Vocabulary
Rounding
Ambiguity is introduced when continuous data are rounded to whole numbers. Underlying measurement scale is continuous. Precision of measurement depends on instrument. Sometimes discrete data are treated as continuous when the range is very large (e.g., SAT scores) and small differences (e.g., 604 or 605) arent of much importance.

Level of Measurement
Four levels of measurement for data:
Level of Measurement Characteristics Example

Nominal
Ordinal Interval Ratio

Categories only
Rank has meaning Distance has meaning Meaningful zero exists

Eye color (blue, brown, green, hazel)


Bond ratings (Aaa, Aab, C, D, F, etc.) Temperature (57o Celsius) Accounts payable ($21.7 million)

Level of Measurement
Nominal Measurement
Nominal data merely identify a category. Nominal data are qualitative, attribute, categorical or classification data (e.g., Apple, Compaq, Dell, HP). Nominal data are usually coded numerically, codes are arbitrary (e.g., 1 = Apple, 2 = Compaq, 3 = Dell, 4 = HP). Only mathematical operations are counting (e.g., frequencies) and simple statistics.

Level of Measurement
Ordinal Measurement
Ordinal data codes can be ranked (e.g., 1 = Frequently, 2 = Sometimes, 3 = Rarely, 4 = Never).

Distance between codes is not meaningful (e.g., distance between 1 and 2, or between 2 and 3, or between 3 and 4 lacks meaning).
Many useful statistical tests exist for ordinal data. Especially useful in social science, marketing and human resource research.

Level of Measurement
Interval Measurement
Data can not only be ranked, but also have meaningful intervals between scale points (e.g., difference between 60F and 70F is same as difference between 20F and 30F). Since intervals between numbers represent distances, mathematical operations can be performed (e.g., average). Zero point of interval scales is arbitrary, so ratios are not meaningful (e.g., 60F is not twice as warm as 30F).

Level of Measurement
Likert Scales
A special case of interval data frequently used in survey research. The coarseness of a Likert scale refers to the number of scale points (typically 5 or 7).
College-bound high school students should be required to study a foreign language. (check one)
Strongly Agree Somewhat Agree Neither Agree Nor Disagree Somewhat Disagree Strongly Disagree

Level of Measurement
Likert Scales
A neutral midpoint (Neither Agree Nor Disagree) is allowed if an odd number of scale points is used or omitted to force the respondent to lean one way or the other. Likert data are coded numerically (e.g., 1 to 5) but any equally spaced values will work.
Likert coding: 1 to 5 scale Likert coding: -2 to +2 scale 5 = Help a lot 4 = Help a little 3 = No effect 2 = Hurt a little 1 = Hurt a lot +2 = Help a lot +1 = Help a little 0 = No effect 1 = Hurt a little 2 = Hurt a lot

Level of Measurement
Likert Scales
Careful choice of verbal anchors results in measurable intervals (e.g., the distance from 1 to 2 is the same as the interval, say, from 3 to 4).

Ratios are not meaningful (e.g., here 4 is not twice 2). Many statistical calculations can be performed (e.g., averages, correlations, etc.).

Level of Measurement
Likert Scales
More variants of Likert scales:
How would you rate your marketing instructor? (check one) Terrible Poor Adequate Good Excellent

How would you rate your marketing instructor? (check one)

Very Bad

Very Good

Level of Measurement
Ambiguity
Grades are usually coded numerically (A = 4, B = 3, C = 2, D = 1, F = 0) and are used to calculate a mean GPA.

Is the interval from 3.0 to 4.0 really the same as the interval from 1.0 to 2.0? What is the underlying reality ranging from 0 to 4 that we are measuring? Best to be conservative and limit statistical tests to those for ordinal data.

Level of Measurement
Ratio Measurement
Ratio data have all properties of nominal, ordinal and interval data types and also possess a meaningful zero (absence of quantity being measured). Because of this zero point, ratios of data values are meaningful (e.g., $20 million profit is twice as much as $10 million). Zero does not have to be observable in the data, it is an absolute reference point.

Level of Measurement
Use the following procedure to recognize data types:
Question Q1. Is there a meaningful zero point? Q2. Are intervals between scale points meaningful? Q3. Do scale points represent rankings? Q4. Are there discrete categories? If Yes Ratio data (all statistical operations are allowed) Interval data (common statistics allowed, e.g., means and standard deviations) Ordinal data (restricted to certain types of nonparametric statistical tests) Nominal data (only counting allowed, e.g. finding the mode)

Level of Measurement
Changing Data by Recoding
In order to simplify data or when exact data magnitude is of little interest, ratio data can be recoded downward into ordinal or nominal measurements (but not conversely). For example, recode systolic blood pressure as normal (under 130), elevated (130 to 140), or high (over 140). The above recoded data are ordinal (ranking is preserved) but intervals are unequal and some information is lost.

Time Series and Cross-sectional Data


Time Series Data
Each observation in the sample represents a different equally spaced point in time (e.g., years, months, days). Periodicity may be annual, quarterly, monthly, weekly, daily, hourly, etc. We are interested in trends and patterns over time (e.g., annual growth in consumer debit card use from 1999 to 2006).

Time Series and Cross-sectional Data


Cross-sectional Data
Each observation represents a different individual unit (e.g., person) at the same point in time (e.g., monthly VISA balances). We are interested in - variation among observations or in - relationships. We can combine the two data types to get pooled cross-sectional and time series data.

Sampling Concepts
Sample or Census?
A sample involves looking only at some items selected from the population. A census is an examination of all items in a defined population. Why cant the United States Census survey every person in the population? - Mobility - Illegal immigrants - Budget constraints - Incomplete responses or nonresponses

Sampling Concepts
Situations Where A Sample May Be Preferred: Infinite Population No census is possible if the population is infinite or of indefinite size (an assembly line can keep producing bolts, a doctor can keep seeing more patients). Destructive Testing The act of sampling may destroy or devalue the item (measuring battery life, testing auto crashworthiness, or testing aircraft turbofan engine life). Timely Results Sampling may yield more timely results than a census (checking wheat samples for moisture and protein content, checking peanut butter for aflatoxin contamination).

Sampling Concepts
Situations Where A Sample May Be Preferred: Accuracy Sample estimates can be more accurate than a census. Instead of spreading limited resources thinly to attempt a census, our budget of time and money might be better spent to hire experienced staff, improve training of field interviewers, and improve data safeguards. Cost Even if it is feasible to take a census, the cost, either in time or money, may exceed our budget. Sensitive Information Some kinds of information are better captured by a well-designed sample, rather than attempting a census. Confidentiality may also be improved in a carefully-done sample.

Sampling Concepts
Situations Where A Census May Be Preferred
Small Population If the population is small, there is little reason to sample, for the effort of data collection may be only a small part of the total cost. Large Sample Size If the required sample size approaches the population size, we might as well go ahead and take a census. Database Exists If the data are on disk we can examine 100% of the cases. But auditing or validating data against physical records may raise the cost. Legal Requirements Banks must count all the cash in bank teller drawers at the end of each business day. The U.S. Congress forbade sampling in the 2000 decennial population census.

Sampling Concepts
Parameters and Statistics
Statistics are computed from a sample of n items, chosen from a population of N items. Statistics can be used as estimates of parameters found in the population. Symbols are used to represent population parameters and sample statistics.

Sampling Concepts
Parameters and Statistics
Parameter or Statistic? Parameter Any measurement that describes an entire population. Usually, the parameter value is unknown since we rarely can observe the entire population. Parameters are often (but not always) represented by Greek letters. Any measurement computed from a sample. Usually, the statistic is regarded as an estimate of a population parameter. Sample statistics are often (but not always) represented by Roman letters.

Statistic

Sampling Concepts
Parameters and Statistics
The population must be carefully specified and the sample must be drawn scientifically so that the sample is representative.

Target Population
The target population is the population we are interested in (e.g., U.S. gasoline prices). The sampling frame is the group from which we take the sample (e.g., 115,000 stations). The frame should not differ from the target population.

Sampling Concepts
Finite or Infinite?
A population is finite if it has a definite size, even if its size is unknown. A population is infinite if it is of arbitrarily large size. Rule of Thumb: A population may be treated as infinite when N is at least 20 times n (i.e., when N/n > 20)
N n

Here, N/n > 20

Sampling Methods
Probability Samples Simple Random Sample Use random numbers to select items from a list (e.g., VISA cardholders).

Systematic Sample
Stratified Sample Cluster Sample

Select every kth item from a list or sequence (e.g., restaurant customers).
Select randomly within defined strata (e.g., by age, occupation, gender). Like stratified sampling except strata are geographical areas (e.g., zip codes).

Sampling Methods
Nonprobability Samples Judgment Sample Use expert knowledge to choose typical items (e.g., which employees to interview).

Convenience Sample

Use a sample that happens to be available (e.g., ask co-worker opinions at lunch).

Sampling Methods
Simple Random Sample
Every item in the population of N items has the same chance of being chosen in the sample of n items.

We rely on random numbers to select a name. =RANDBETWEEN(1,48)

Sampling Methods
Random Number Tables
A table of random digits used to select random numbers between 1 and N. Each digit 0 through 9 is equally likely to be chosen.

Setting Up a Rule
For example, NilCo wants to award cash prizes to 10 of its 875 loyal customers. To get 10 three-digit numbers between 001 and 875, we define any consistent rule for moving through the random number table.

Sampling Methods
Setting Up a Rule
Randomly point at the table to choose a starting point. Choose the first three digits of the selected fivedigit block, move to the right one column, down one row, and repeat. When we reach the end of a line, wrap around to the other side of the table and continue. Discard any number greater than 875 and any duplicates.

Start Here
82134 07139 45056 10244 12940 14458 16829 43939 19093 84434

Table of 1,000 Random Digits


66716 76768 31188 51678 50087 54269 11913 43272 63463 20189 31928 42434 11332 85568 58009 46241 91961 99494 70034 66972 03052 92934 19348 82811 05764 00260 18229 97076 23261 10421 32367 15595 95605 48794 36875 25783 02566 28010 63984 64964

84438 98681 43290 96893 75403

45828 67871 96753 85410 41227

40353 71735 18799 88233 00192

28925 64113 49713 22094 16814

11911 90139 39227 30605 47054

53502 33466 15955 79024 16814

24640 65312 46167 01791 81349

96880 90655 63853 38839 92264

93166 75444 03633 85531 01028

68409 30845 19990 94576 29071

78064 26246 98766 37895 27988

92111 71746 67312 51055 81163

51541 94019 96358 11929 52212

76563 93165 21351 44443 25102

69027 96713 86448 15995 61798

67718 03316 31828 72935 28670

06499 75912 86113 99631 01358

71938 86209 78868 18190 60354

17354 12081 67243 85877 74015

12680 57817 06763 31309 18556

19216 28078 34455 53510 30658

53008 86729 61363 90412 18894

44498 69438 93711 70438 88208

19262 24235 68038 45932 97867

12196 35208 75960 57815 30737

93947 48957 16327 75144 94985

90162 53529 95716 52472 18235

76337 76297 66964 61817 02178

12646 41741 28634 41562 39728

26838 54735 65015 42084 66398

Sampling Methods
With or Without Replacement
If we allow duplicates when sampling, then we are sampling with replacement. Duplicates are unlikely when n is much smaller than N. If we do not allow duplicates when sampling, then we are sampling without replacement.

Sampling Methods
Computer Methods
Excel - Option A Enter the Excel function =RANDBETWEEN(1,875) into 10 spread-sheet cells. Press F9 to get a new sample.

Excel - Option B
Internet

Enter the function =INT(1+875*RAND()) into 10 spreadsheet cells. Press F9 to get a new sample.
The web site www.random.org will give you many kinds of excellent random numbers (integers, decimals, etc). Use Minitabs Random Data menu with the Integer option.

Minitab

These are pseudo-random generators because even the best algorithms eventually repeat themselves.

Using MINITAB to generate random numbers.

Sampling Methods
Row Column Data Arrays
When the data are arranged in a rectangular array, an item can be chosen at random by selecting a row and column. For example, in the 4 x 3 array, select a random column between 1 and 3 and a random row between 1 and 4. This way, each item has an equal chance of being selected.

Sampling Methods
Row Column Data Arrays
Use =RANDBETWEEN function to choose row 3 and column 3 (Target).

Dillard's Dollar General Federated Dept Stores J. C Penney

K-Mart Kohl's May Dept Stores Nordstrom

Saks Sears Roebuck Target Wal-Mart Stores

Sampling Methods
Randomizing a List
In Excel, use function =RAND() beside each row to create a column of random numbers between 0 and 1. Copy and paste these numbers into the same column using Paste Special | Values (to paste only the values and not the formulas).

Sort the spreadsheet on the random number column.

Sampling Methods
Randomizing a List
The first n items are a random sample of the entire list (they are as likely as any others).

Sampling Methods
Systematic Sampling
Sample by choosing every kth item from a list, starting from a randomly chosen entry on the list. For example, starting at item 2, we sample every k = 4 items to obtain a sample of n = 20 items from a list of N = 78 items.

Note that N/n = 78/20 4.

Sampling Methods
Systematic Sampling
A systematic sample of n items from a population of N items requires that periodicity k be approximately N/n. Systematic sampling should yield acceptable results unless patterns in the population happen to recur at periodicity k. Can be used with unlistable or infinite populations.

Systematic samples are well-suited to linearly organized physical populations.

Sampling Methods
Systematic Sampling
For example, out of 501 companies, we want to obtain a sample of 25. What should the periodicity k be? k = N/n = 501/25 20.

So, we should choose every 20th company from a random starting point.

Sampling Methods
Stratified Sampling
Utilizes prior information about the population. Applicable when the population can be divided into relatively homogeneous subgroups of known size (strata). A simple random sample of the desired size is taken within each stratum. For example, from a population containing 55% males and 45% females, randomly sample 120 males and 80 females (n = 200).

Sampling Methods
Stratified Sampling
Or, take a random sample of the entire population and then combine individual strata estimates using appropriate weights. For a population with L strata, the population size N is the sum of the stratum sizes: N = N1 + N2 + ... + NL The weight assigned to stratum j is wj = Nj / n For example, take a random sample of n = 200 and then weight the responses for males by wM = .55 and for females by wF = .45.

Sampling Methods
Cluster Sample
Strata consist of geographical regions. One-stage cluster sampling sample consists of all elements in each of k randomly chosen subregions (clusters). Two-stage cluster sampling, first choose k subregions (clusters), then choose a random sample of elements within each cluster.

Sampling Methods
Cluster Sample
Here is an example of 4 elements sampled from each of 3 randomly chosen clusters (two-stage cluster sampling).

Sampling Methods
Cluster Sample
Cluster sampling is useful when - Population frame and stratum characteristics are not readily available - It is too expensive to obtain a simple or stratified sample - The cost of obtaining data increases sharply with distance - Some loss of reliability is acceptable

Sampling Methods
Judgment Sample
A nonprobability sampling method that relies on the expertise of the sampler to choose items that are representative of the population. Can be affected by subconscious bias (i.e., nonrandomness in the choice).

Quota sampling is a special kind of judgment sampling, in which the interviewer chooses a certain number of people in each category.

Sampling Methods
Convenience Sample
Take advantage of whatever sample is available at that moment. A quick way to sample.

Sample Size
Sample size depends on the inherent variability of the quantity being measured and on the desired precision of the estimate.

Data Sources
Useful Data Sources
Type of Data Examples

U.S. general data


U.S. economic data Almanacs Periodicals Indexes Databases World data Web

Statistical Abstract of the U.S.


Economic Report of the President World Almanac, Time Almanac Economist, Business Week, Fortune New York Times, Wall Street Journal CompuStat, Citibase, U.S. Census CIA World Factbook Google, Yahoo, msn

Survey Research
Basic Steps of Survey Research Step 1: State the goals of the research
Step 2: staff) Develop the budget (time, money,

Step 3: Create a research design (target population, frame, sample size)

Step 4:

Choose a survey type and method of administration

Survey Research
Basic Steps of Survey Research
Step 5: Design a data collection instrument (questionnaire) Pretest the survey instrument and needed

Step 6: revise as

Step 7: needed)
Step 8:

Administer the survey (follow up if

Code the data and analyze it

Survey Research
Survey Types
Type of Survey Characteristics

Mail

You need a well-targeted and current mailing list (people move a lot). Low response rates are typical and nonresponse bias is expected (nonrespondents differ from those who respond). Zip code lists (often costly) are an attractive option to define strata of similar income, education, and attitudes. To encourage participation, a cover letter should clearly explain the uses to which the data will be put. Plan for follow-up mailings.

Survey Research
Survey Types
Type of Survey
Telephone

Characteristics
Random dialing yields very low response and is poorly targeted. Purchased phone lists help reach the target population, though a low response rate still is typical (disconnected phones, caller screening, answering machines, work hours, nocall lists). Other sources of nonresponse bias include the growing number of non-English speakers and distrust caused by scams and spams.

Survey Research
Survey Types
Type of Survey Interviews Characteristics Interviewing is expensive and time-consuming, yet a trade-off between sample size for high-quality results may still be worth it. Interviews must be carefully handled so interviewers must be welltrained an added cost. But you can obtain information on complex or sensitive topics (e.g., gender discrimination in companies, birth control practices, diet and exercise habits).

Survey Research
Survey Types
Type of Survey Web Characteristics Web surveys are growing in popularity, but are subject to nonresponse bias because those who participate may differ from those who feel too busy, dont own computers or distrust your motives (scams and spam are again to blame). This type of survey works best when targeted to a well-defined interest group on a question of self-interest (e.g., views of CPAs on new proposed accounting rules, frequent flyer views on airline security).

Survey Research
Survey Types
Type of Survey Characteristics

Direct Observation

This can be done in a controlled setting (e.g., psychology lab) but requires informed consent, which can change behavior. Unobtrusive observation is possible in some nonlab settings (e.g., what percentage of airline passengers carry on more than two bags, what percentage of SUVs carry no passengers, what percentage of drivers wear seat belts).

Survey Research
Survey Guidelines
Plan What is the purpose of the survey? Consider staff expertise, needed skills, degree of precision, budget.

Design

Invest time and money in designing the survey. Use books and references to avoid unnecessary errors.
Take care in preparing a quality survey so that people will take you seriously.

Quality

Survey Research
Survey Guidelines
Pilot Test Buy-in Pretest on friends or co-workers to make sure the survey is clear. Improve response rates by stating the purpose of the survey, offering a token of appreciation or paving the way with endorsements.
Work with a consultant early on.

Expertise

Survey Research
Getting Advice
Consider hiring a consultant in the early stages. Many resources are available to help - The American Statistical Association - The Research Industry Coalition

- The Council of American Survey Research Organizations

Survey Research
Questionnaire Design
Use a lot of white space in layout.

Begin with short, clear instructions. State the survey purpose.


Assure anonymity.

Instruct on how to submit the completed survey.

Survey Research
Questionnaire Design
Break survey into naturally occurring sections. Let respondents bypass sections that are not applicable (e.g., if you answered no to question 7, skip directly to Question 15). Pretest and revise as needed. Keep as short as possible.

Survey Research
Questionnaire Design
Type of Question
Open-ended question Fill-in-the-blank

Example
Briefly describe your job goals. How many times did you attend formal religious services during the last year? ________ times Which of these statistics packages have you ever used? SAS Visual Statistics SPSS MegaStat Systat Minitab

Check boxes

Survey Research
Questionnaire Design
Type of Question
Ranked choices

Example
Please evaluate your dining experience Excellent Food Good Fair Poor

Service
Ambiance Cleanliness Overall

Survey Research
Questionnaire Design
Type of Question Pictograms Example What do you think of the Presidents economic policies? (circle one)

Likert scale

Statistics is a difficult subject.


Strongly Agree Slightly Agree
Neither Agree Nor Slightly Strongly Disagree Disagree Disagree

Survey Research
Question Wording
The way a question is asked has a profound influence on the response. For example, 1. Shall state taxes be cut?

2. Shall state taxes be cut, if it means reducing highway maintenance?


3. Shall state taxes be cut, it is means firing teachers and police?

Survey Research
Question Wording
Make sure you have covered all the possibilities. For example, Are you married? Yes No
Overlapping classes or How old is your father? unclear categories are a 35 45 problem. For example, 45 55 55 65 65 or older

Survey Research
Coding and Data Screening
Responses are usually coded numerically (e.g., 1 = male 2 = female). Missing values are typically denoted by special characters (e.g., blank, . or *). Discard questionnaires that are flawed or missing many responses. Watch for multiple responses, outrageous or inconsistent replies or range answers. Follow-up if necessary and always document your data-coding decisions.

Survey Research
Sources of Error
Source of Error Characteristics

Nonresponse bias
Selection bias Response error Coverage error Interviewer error Measurement error

Respondents differ from nonrespondents


Self-selected respondents are atypical Respondents give false information Incorrect specification of frame or population Responses influenced by interviewer Survey instrument wording is biased or unclear

Sampling error

Random and unavoidable

Survey Research
Data File Format
Enter data into a spreadsheet or database as a flat file (n subjects x m variables matrix).

Survey Research
Advice on Copying Data
Using commas (,), dollar signs ($), or percents (%) as part of the values may result in your data being treated as text values.
A numerical variable may only contain the digits 0-9, a decimal point, and a minus sign. To avoid round-off errors, format the data column as plain numbers with the desired number of decimal places before you copy the data to a statistical package.

Applied Statistics in Business and Economics


End of Chapter 2

Vous aimerez peut-être aussi