Vous êtes sur la page 1sur 97

Introduction

STATISTICS is the most important science in the whole world; for upon it depends the practical application of every science and of every art; the one science essential to all political and social administration, all education, all organization based on experience, for it only gives results of our experience. -Florence Nightingale-

Basic Concepts
The term statistics originated from the Latin word status, which means state. The original definition was the science dealing with data about the condition of a state or community.
Statistics is the branch of science that deals with the collection, presentation, organization, analysis, and interpretation of data.

Basic Concepts
The population is the collection of all elements under consideration in a statistical inquiry. The sample is a subset of a population. The variable is a characteristic or attribute of the elements in a collection that can assume different values for the different elements.
An observation is a realized value of a variable. Data is the collection of observations.

Basic Concepts
Example Variable Possible Observations S= sex of a student Male, Female E= employment status of an employees Permanent, Temporary, Contractual I = monthly income of a person in greater than or equal
pesos N = number of children of a teacher H = height of a basketball player to zero n = 0, 1, 2, 3, .. h > 0 cms, in.

Basic Concepts
Example
The office of Admissions is studying the relationship between the score in the entrance examination during application and the general weighted average (GWA) upon graduation among graduates of the university from 2005 to 2010. Population: collection of all graduates of the university from the years 2005 to 2010. Variable of Interest: score in the entrance examination and GWA

Basic Concepts
The Department of Health is interested I determining the percentage of children below 12 years old infected by the Hepatitis B virus in Laguna in 2010.
Population: set of all children below 12 years old in Laguna in 2010. Variable of interest: whether or not the child has ever been infected by the Hepatitis B virus.

Basic Concepts
The parameter is a summary measure describing a specific characteristic of the population. The statistic is a summary measure describing a specific characteristic of the sample.

Fields of Statistics
Two Major fields of Statistics 1. Applied Statistics
Is concerned with the procedures and techniques used in the collection, presentation, organization, analysis and interpretation of data.
We study applied statistics in order to learn how to select and properly implement the most appropriate statistical methods that will provide answers to our research problem.

2. Theoretical or Mathematical Statistics is concerned with the development of the mathematical foundations of the methods used in applied statistics.
We study mathematical statistics in order to understand the rationale behind the statistical methods we use in analysis and to establish new theories that will validate the use of new statistical methods or modifications of existing statistical methods in solving problems that are more complex.

Two Major Areas of Interest in Applied Statistics


1. Descriptive Statistics
Includes all the techniques used in organizing, summarizing, and presenting the data on hand.
The data on hand may have come from all the elements of the population so that the analysis using descriptive statistics will allow us to describe the population. The data on hand may also come from the elements of a selected sample. In this case, the analysis using descriptive statistics will only allow us to describe the sample. The methods used in descriptive statistics will not allow us to generalize about the population using the sample data.

Two Major Areas of Interest in Applied Statistics


In descriptive statistics, we use tables and charts, and compute for summary measures like averages, proportions, and percentages. 2. Inferential Statistics
We do not simply describe the sample data. Rather, we use the sample data to form conclusions about the population. Since the sample is only a subset of the population, then we arrive at the conclusions about the population using inferential statistics under conditions of uncertainty.

Two Major Areas of Interest in Applied Statistics


Example 1. A badminton player wants to know his average score for the past 10 games. descriptive 2. Joseph wants to determine the variability of his seven exam scores in Statistics. descriptive

Two Major Areas of Interest in Applied Statistics


3. Based on last year s electricity bills, Mrs. Mercado would like to forecast the average monthly electricity bill she will pay for the next year based on her average monthly bill in the past year. inferential 4. Efren Bata wants to estimate his chance of winning in the next World Championship game in Billiards based on his average scores last championship and the averages of the competing players. inferential 5. Dr. Escape wants to determine the proportion spent on transportation during the past four months using the daily records of expenditure that she keeps. descriptive 6. A politician wants to determine the total number of votes his rival obtained in the sample used in the exit poll. descriptive

Steps in a Statistical Inquiry


Statistical inquiry is a designed research that provides information needed to solve a research problem

1. Describe the characteristic of the elements in the population under study through the computation or estimation of a parameter such as the proportion, total, and average. 2. Compare the characteristics of the elements in the different subgroups in the population through contrasts of their respective summary measures. 3. Justify an assertion made by the researcher about a particular characteristic of the population or subgroups in the population. 4. Determine the nature and strength of relationships among the different variables of interest. 5. Identify the different groups of interrelated variables under study.

Steps in a Statistical Inquiry


6. Reveal the natural groupings of the elements in the population based on the values of a set of variables. 7. Determine the effects of one or more variables on a response variable. 8. Clarify patterns and trends in the values of a variable over time or space. 9. Predict the value of a variable based upon its relationship with another variable. 10. Forecast future values of a variable using a sequence of observations on the same variable taken over time.

Basic Steps in Performing a Statistical Inquiry


1. Identify the problem. The researchers need to define and state the problem in a clear manner so that they can arrive at appropriate solutions and recommendations later on. Plan the study. Some statistical inquiries do not reach completion or do not succeed in arriving at useful information for sound decision making because of the researchers failure to plan the study carefully. Collect the data. The investigators take extra measures to ensure the quality of the data collected. If the collected data were incomplete, outdated, inaccurate, or worse yet, fabricated, then it will be useless to proceed with data analysis. There are different ways of collecting data. These are through surveys, observation, experiments, and use of available documented data.

2.

3.

Basic Steps in Performing a Statistical Inquiry


4. Explore the data. Prior to data analysis, the investigators need to explore and understand the essential features of their data. This process allows them to determine if their data satisfy the assumptions made in the derivation of the statistical technique that they will use for analysis. Analyze data and interpret the results.
After collecting and organizing data, analysis follows. The investigators once more carry out the plans specified in the research design but this time on data analysis. They then examine all the results on tables, charts, estimated summary measures, and tests of hypotheses. They need to check that they were able to meet all of the specific objectives. Based on the analysis carried out, the investigators must be able to answer the research problem and give recommendations on how this can be useful in decision making. The investigators must double-check the results that contradict existing theory or the earlier hypothesis made. They may have committed errors in data collection or analysis. If not, they would have to propose possible explanations for these results or suggest future statistical inquiries that could help explain the inconsistency.

5.

Basic Steps in Performing a Statistical Inquiry


6. Present the results. After analyzing the data and interpreting the results, the investigators must present these results in a clear and concise manner to the users of the research.

Measurement
Process of determining the value or label of the variable based on what has been observed Ratio level of measurement has all of the following properties; A) the numbers in the system are used to classify a person/object into distinct, nonoverlapping, and exhaustive categories; B) the system arranges the categories according to magnitude;

C) the system has a fixed unit of measurement representing a set size throughout the scale, and D) the system has an absolute zero.
Examples:

1. allowance of a student (in pesos)


2. distance travelled by an airplane (in kms) 3. the speed of a car (in kms/hr) 4. height of an adult (in cms) 5. weight of a newborn baby (in kgs)

Interval level of measurement satisfies only the first three properties of the ratio level. The only difference between the interval and the ratio levels is the interpretation of the value O in their scales. The zero point in the interval level is not an absolute zero. Unlike in the ratio scale, the zero value in the interval scale has an arbitrary interpretation and does not mean the absence of the property we are measuring. Examples: Temperature readings measured in degrees Centigrade (0 C), Intelligent Quotient (IQ) If the temperature is O C, do we say that there is no temperature? Of course not, Since O C is not an absolute zero.

Ordinal level of measurement satisfies only the first two properties of the ratio level. Examples. 1. Size of shirt ( small, medium, large, extra large) 2. Performance rating of a salesperson measured as follows: Excellent, very good, good, satisfactory, poor 3. Faculty rank: Professor, Associate Professor, Assistant Professor, Instructor

Nominal level of measurement satisfies only the first property of the ratio level. Examples: 1. Gender 2. Civil Status 3. Type of movies ( Action, Roamance, Comedy. Others) 4. Major island group ( Luzon, Visayas, Mindanao)

Data Collection Methods


1. Use of available documented data in published or unpublished studies. 2. Surveys 3. Experiments 4. Observations

Collection of Data
Primary Data
Are data documented by the primary source. The data collectors themselves documented this data. Secondary Data are data documented by a secondary source. AN individual/agency, other than the data collectors, documented this data.

Collection of Data
Survey is a method of collecting data on the variable of interest by asking people questions. When data came from asking all the people in the population, then the study is called a census. On the other hand, when data came from asking a sample of people selected from a well-defined population, then the study is called a sample survey.

Collection of Data
Experiment is a method of collecting data where there is direct human intervention on the conditions that may affect the values of the variable of interest.

Collection of Data
Basic Steps in Conducting an Experiment 1. Specify the response variable and the explanatory variables. 2. Identify possible extraneous variables. 3. Determine how to control the extraneous variables that were identified in step 2. 4. Assign the treatment at random to each subject and apply assigned treatment. 5. Measure the response variable for each subject at the end of experiment. 6. Analyze the data.

Collection of Data
Observation method is a method of collecting data on the phenomenon of interest by recording the observations made about the phenomenon as it actually happens.

Collection of Data
Examples: 1) A local TV network asked voters to indicate whom they voted as they exited the polling booth. Survey 2) A private hospital divides terminally ill patients into two groups, with one group receiving medication A and the other group receiving medication B. After a month, they measured each subject s improvement. Experiment 3) A researcher investigates the level of pollution in key points in Metro Manila by setting up pollution measuring devices at selected intersections. Observation

Collection of Data
Questionnaire is a measurement instrument used in various data collection methods, particularly surveys. We use a questionnaire to determine and record the measurements of characteristics of the elements in a study such as height, weight, color, size, attitude, past and present behavior, and opinions. Two Types of Questionnaire Used in Surveys Self-Administered Questionnaire Interview Schedule

Collection of Data
Types of Questions
1. Closed-ended question
is a type of question that includes a list of response categories from which the respondent will select his answer. 2. Open-ended question is a type of question that does not include response categories.

Collection of Data
Open-ended Advantages Respondent can freely answer Can elicit feelings and emotions of the respondent Can reveal new ideas and views that the researcher might not have considered Good for complex issues Good for questions whose possible responses are unknown Allows respondent to clarify answers Gets detailed answers Shows how respondents think Closed-ended Facilitates tabulation of responses Easy to code and analyze Saves time and money

High response rate since it is simple and quick to answer Response categories make questions easy to understand Can repeat the study and easily make comparisons

Collection of Data
Open-ended Disadvantages Difficult to tabulate and code Closed-ended Increases respondent burden when there are too many or too limited response categories Bias responses against categories excluded in the list of choices Difficult to detect if respondent misinterpreted the question

High refusal rate because it requires more time and effort on the respondent Respondent needs to be articulate

Responses can be inappropriate or vague May threaten respondent

Responses have different levels of detail

Collection of Data
1. 2. 3. 4. 5. 6. Pitfalls To Avoid in Wording Questions Avoid Vague Questions Avoid Biased Questions Avoid Confidential and Sensitive Questions Avoid Questions that are Difficult to Answer Avoid Questions that are Confusing or Perplexing to answer Keep the question short and simple

Sampling and Sampling Techniques


Advantages of Sampling 1. Sampling is more economical. 2. A study based on a sample requires less time to accomplish. 3. Sampling allows for a wider scope for the study. 4. Results of studies based on a sample can even be more accurate. 5. Sampling is sometimes the only feasible method.

Sampling and Sampling Techniques


Target Population is the population we want study. Sampled Population is the population from where we actually select the sample. Elementary unit or element is a member of the population whose measurement on the variable of interest is what we wish to examine. Sampling unit is a unit of the population that we select in our sample.

Sampling and Sampling Techniques


Example Title of the Study: Monthly Integrated Survey of Selected Industries
This survey provides planners and policy makers in both government and private sectors with indicators on the performance of growth-oriented industries in the agro-biased , metallic mining, and manufacturing sectors.

Sampling and Sampling Techniques


Target population
set of all establishments in the manufacturing, mining, and agriculture industries

Elementary unit
an establishment (which is an economic unit that engages under a single ownership or control in one predominantly one kind of economic activity at a fixed single physical location) in the manufacturing, mining, and agricultural industries.

Sampling unit
an enterprise (which is an economic unit with one or more establishments into enterprises under a single ownership or control) in the manufacturing, mining and agricultural industries.

Sampling and Sampling Techniques


Sampling Frame or Frame is a list or map showing all the sampling units in the population. Example Suppose a researcher is interested in getting the opinion of eligible voters on the media campaign of candidates running for top position in the government. Target population: set of all eligible voters Sampling Frame: Commission on Elections (COMELEC) list of registered voters. Sampled population: set of registered voters in the list of COMELEC This sampled population excludes the eligible voters who did not register or did not revalidate their registration with the COMELEC during the registration period.

Sampling and Sampling Techniques


Sampling Error is the error attributed to the variation present among the computed values of the statistic from the different possible samples consisting of n elements. Nonsampling Error is the error from the other sources apart from sampling fluctuations.

Sampling and Sampling Techniques


Sampling error occurs when we collect data from a sample and not from all the elements in the population. It is an error innate in results based from a sample. Classifications of Nonsampling error 1. Measurement Error 2. Error in the implementation of the sampling design

Sampling and Sampling Techniques


Measurement error
Is the difference between the true value of the variable and the observed value used in the study. This occurs when we are using a faulty measurement instrument or when we do not use the instrument properly. Error in the implementation of the sampling design occurs when we do not adhere to the procedures and requirements as specified in the sampling design.

Sampling and Sampling Techniques


Total Error Nonsampling Error
Error in the Implementation of The Sampling Design Measurement Error Instrument Error Selection Error Frame Error Population Specification Error Response Error Processing Error Interviewer Bias Surrogate Information Error Response Bias Nonresponse Bias

Sampling Error

Sampling and Sampling Techniques


Measurement Errors 1. Interviewer bias may occur when an enumerator reacts to a respondent s reply 2. Errors in editing and coding 3. Bias occurs when respondent tends to respond to items in an acceptable manner instead of truthfully 4. Errors in conversion from one unit of measurement to another 5. Response set occurs when a respondent agrees with all the statements without careful consideration to each one of the given statements. 6. Faulty measurement devices such as a weighing scale that is not properly claibrated

Sampling and Sampling Techniques


Errors in the Implementation of the Sampling Design 1. Sampling frame defines a sampled population that is too far from the target population 2. Sampling frame is outdated 3. Complicated sample selection procedure is done in the field by confused enumerators who incorrectly select the respondents included in the sample 4. Lazy enumerators do no follow the specified sample selection procedure 5. Target population is the target consumers of a particular brand but researchers incorrectly define the qualifications of the target consumers.

Sampling and Sampling Techniques


Probability Sampling is a method of selecting a sample wherein each element in the population has a known, nonzero chance of being included in the sample; otherwise, it is nonprobability sampling.

Sampling and Sampling Techniques


Probability Sampling Methods 1. Simple Random Sampling 2. Stratified Sampling 3. Systematic Sampling 4. Cluster Sampling 5. Multistage Sampling

Probability Sampling Methods


Simple Random Sampling is a probability sampling method wherein all possible subsets consisting of n elements selected from the N elements of the population have the same chances of selection. In simple random sampling without replacement (SRSWOR), all the n elements in the sample must be distinct from each other. In simple random sampling with replacement (SRSWR),the n elements in the sample needed not be distinct, that is, an element can be selected more than once to be a part of the sample.

Probability Sampling Methods


Stratified sampling is a probability sampling method where we divide the population into nonoverlapping subpopulations or strata, and then select one sample from each stratum. The sample consists of all the samples in the different strata.

Probability Sampling Methods


Systematic sampling is a probability sampling method wherein the selection of the first element is at random and the selection of the other elements in the sample is systematic by subsequently taking kth element from the random start, where K is the sampling interval.

Probability Sampling Methods


Cluster sampling is a probability sampling method wherein we divide the population into nonoverlapping groups or clusters consisting of one or more elements, and then select a sample of clusters. The sample will consist of all the elements in the selected clusters.

Probability Sampling Methods


Multistage sampling is a probability sampling method where there is a hierarchical configuration of sampling units and we select a sample of these units in stages.

Basic Methods of Probability Sampling


Method 1. Simple Random Sampling Procedure List the elements and number them from 1 to N. Select n numbers from 1 to N, using a randomization mechanism. The sample will consist of the elements correspondings to the numbers selected. Advantages Design is simple and easy to understand Estimation methods are simple and easy. Disadvantages It needs a list of all elements in the population. Sample size must be very large for heterogeneous populations in order to get reliable results. High transportation cost if elements are widely spread geographically. When to use If the elements are homogeneous with respect to the characteristic under study. If the elements are not so spread out geographically.

Basic Methods of Probability Sampling


Method 2. Stratified Random Sampling Procedure Divide the population into nonoverlapping strata. Obtain a simple random sample from each stratum. The sample consists of the selected samples in all the strata. Advantages Estimates are more reliable compared to SRS of the same sample size if the population has been divided into strata with homogeneous elements, but the strata are very different from each other. Estimation of parameter for each subpopulation is easier when compared to other sampling methods It can faciltiate the administration and supervision of data collection, especially the stratification variables is geographic subdivision. Disadvantages It needs a list of all elements of the population, including their values of the stratification variable. High transportation cost if elements are widely spread geographically, unless there are field offices in each geographic area. When to use If population is heterogeneous with respect to the characteristic under study. If we want to perform separate analysis for certain subpopulations. If we wish to facilitate the administration of the collection of data.

Basic Methods of Probability Sampling


Method 3. Systematic Sampling Procedure Assign a unique number from 1 to N to each element of the population. Determine the sampling interval, k. Obtain the first element in the sample using a randomization mechanism. Get the rest of the elements in the sample by taking every kth element from the random start. Advantages Identifying the units in the sample is easy. The design does not require a list of all elements in the population. The sample is distributed evenly over the entire population. It gives more reliable estimates than simple random sampling when the arrangement of the elements in the sampling frame is according to magnitude. Disadvantages Estimates may no be reliable when there are periodic regularities in the list. It requires information on the arrangement of the elements in the sampling frame to determine the reliability of the estimates. When to use If there is no available list of elements in the population. If the arrangement of the elements in the sampling frame is according to magnitude.

Basic Methods of Probability Sampling


Method 4. Cluster Sampling Procedure Divide the population into nonoverlapping clusters. Select a sample of clusters using simple random sampling. The sample consists of all the elements in the selected clusters. Advantages The design needs only a list of clusters and not a list of elements. Transportation and listing costs are usually lower. Disadvantages Estimates are usually less reliable when compared to other sampling design. It is not cost-efficient if the clusters are large and the elements are homogeneous with respect to the characteristic under study. When to use If there is no available list of elements. If cost is more important than reliability of the estimates.

5. Multistage Sampling

Select sample in several stages.

Reduced transportation and listing cost.

Difficult estimation procedures. The design needs thorough planning before performing sample selection.

If the geographic coverage of the population of interest is wide. If no listing of the elementary units in the population is available.

Methods of Nonprobability Sampling


Nonprobability sampling methods do not make use any randomization mechanism in identifying the sampling units included in the sample. Rather, it allows the researcher to choose the units in the sample objectively. Since the selection of the sample is subjective, there is consequently no objective way of assessing the reliability of the results without making assumptions that there are oftentimes difficult to verify.

Methods of Nonprobability Sampling


Haphazard or Convenience Sampling
the sample consists of elements that are most accessible or easiest to contact. This usually includes friends, acquaintances, volunteers, and subjects who are available and willing to participate at the time of the study. Example: The adviser of a student organization is conducting a research on study habits of students in the university. To select a sample, the adviser includes the members of the student organization because it is easy to reach them and get data from them.

Methods of Nonprobability Sampling


Judgement or Purposive Sampling
The researcher chooses a sample that agrees with his/her subjective judgement of a representative sample.

Methods of Nonprobability Sampling


Quota Sampling
is the nonprobability sampling version of stratified sampling. In quota sampling, the researcher also chooses the grouping or strata in the study but the selection of the sampling units within the stratum does not make use of a probability sampling method. The researcher just sets a quota or number of sampling units to be included in each grouping but uses convenience sampling to select the units within each grouping

Sampling and Sampling Techniques


Sample Size Determination
An important component of the sampling design is the sample size. The number of elements that you include in the sample must not be too small because this will not allow you to come up with reliable estimates. In determining the sample size, you should always consider the reliability of the results of the study and, at the same time, the cost involved in doing the study.

Presentation of Data
Textual Presentation
Textual presentation of data incorporates important figures in a paragraph of text.

Tabular Presentation
Tabular presentation of data arranges figures in a systematic manner in rows and columns.

Graphical Presentation
Graphical presentation of data portrays numerical figures or relationships among variables in pictorial form.

Organization of Data
Raw Data are data in their original form. Array is an ordered arrangement of data according to magnitude. We also refer to the array as sorted data or ordered data.

Organization of Data
Frequency Distribution Frequency distribution is a way of summarizing data by showing the number of observations that belong in the different categories or classes We also refer to this as grouped data.

Frequency Distribution
Final Grade 40-49 50-59 60-69 70-79 80-89 90-99 Total No. of Students 8 23 42 62 58 17 210 Final Grade 40-46 47-53 54-60 61-67 68-74 75-81 82-88 89-95 96-102 Total No. of Students 7 9 18 30 41 48 39 13 5 210 Final Grade 40-45 46-51 52-57 58-63 64-69 70-75 76-81 82-87 88-93 94-99 Total No. of Students 7 6 10 24 26 35 45 34 13 10 210

Frequency Distribution
Class Interval is the range of values that belong in the class or category. Class Frequency is the number of observations that belong in a class interval. Class Limits are the end numbers used to define the class interval. The lower class limit (LCL) is the lower end number while the upper class limit (UCL) is the upper end number. Class Boundaries are the true limits. If the observations are rounded figures, then we identify the class boundaries based on the standard rules of rounding as follows: the lower class boundary (LCB) is halfway between the lower class limit of the class and the upper class limit of the preceding class while the upper class boundary (UCB) is halfway between the upper class limit of the class the lower class limit of the next class.

Frequency Distribution
Class Size is the size of the class interval. It is the difference between the upper class boundaries of the class and the preceding class; or the difference between the lower class boundaries of the next class and the class. We can also use the class limits in place of the class boundaries. Class Mark is the midpoint of a class interval. It is the average of the lower class limit and the upper class limit.

Frequency Distribution
Final Grade 40-49 50-59 60-69 70-79 80-89 90-99 Total 40-49 first Class Interval 8 Class frequency for the class interval 40-49 40-49; 40 is the lower class limit while 49 is the upper class limit 39.5-49.5; lower and upper class boundaries No. of Students 8 23 42 62 58 17 210 Class size; 50 40 = 10 Class Mark or Midpoint; (40 + 49)/2 = 44.5

Frequency Distribution
Steps in the Construction of a Frequency Distribution 1. Determine the adequate number of classes K. Usually between 5 to 20 or K = 1+3.322log n (Sturges s rule) Log is the ordinary logarithm (base 10) 2. Determine the range (R) = highest observed value lowest observed value. 3. Compute for C = R/K 4. Determine the class size C, by rounding off C to a convenient number. 5. Choose the lower class limit of the first class. Usually based on the lowest observed value. 6. Tally all the observed values in each class interval. 7. Sum the frequency column and check against the total number of observations.

Frequency Distribution
Less Than Cumulative Frequency Distribution (<CFD) shows the number of observations with the smaller than or equal to the upper class boundary. Greater Than Cumulative Frequency Distribution (>CFD) shows the number of observations with values larger than or equal to the lower class boundary. Cumulative Frequency Distribution is another variation of the frequency distribution. We use this to determine how many observations have values smaller than or greater than a specified class boundary. It shows the accumulated frequencies of successive classes, either at the beginning or at the end of the distribution.

Frequency Distribution
Final Grade 40-49 50-59 60-69 70-79 80-89 90-99 Total No. of Students 8 23 42 62 58 17 210 <CFD 8 31 73 135 193 210 >CFD 210 202 179 137 75 17

Graphical Presentation of the Frequency Distribution


1. 2. 3. 4. 5. Frequency Histogram Frequency polygon Less Than Ogive Greater Ogive Pie Chart

Measures of Central Tendency


Measures of Central Tendency, or "location", attempt to quantify what we mean when we think of as the "typical" or "average" score in a data set. The concept is extremely important and we encounter it frequently in daily life. For example, we often want to know before purchasing a car its average distance per litre of petrol. Or before accepting a job, you might want to know what a typical salary is for people in that position so you will know whether or not you are going to be paid what you are worth. Or, if you are a smoker, you might often think about how many cigarettes you smoke "on average" per day. Statistics geared toward measuring central tendency all focus on this concept of "typical" or "average." As we will see, we often ask questions in psychological science revolving around how groups differ from each other "on average". Answers to such a question tell us a lot about the phenomenon or process we are studying. We also use measures of central tendency to facilitate the comparison of two or more data sets. For example, a teacher may want to answer the question, Who performed better in exam, the girls or the boys? The teacher can then compare the average score of the girls with the average score of the boys. If the average score of the girls is higher, then the teacher can conclude that the girls generally performed better than the boys in the exam.

Measures of Central Tendency


The Arithmetic Mean
The arithmetic mean, or simply called the mean, is the most common type of average. It is the sum of all the observed values divided by the number of observations. population mean bar sample mean

Measures of Central Tendency


The Weighted Mean
When all the individual observed values have equal importance, we compute for the arithmetic mean. On the other hand, if we believe that the individual observed values vary in their degree of importance, then it is advisable to use a modification of the mean that we call the weighted mean. The Weighted Mean assigns weights to the observations depending on their relative importance.

Measures of Central Tendency


The Trimmed Mean We have noted that the mean may not be a good measure of central tendency whenever there are outliers. An outlier pulls the value of the mean in its direction and farther away from all of the observations. However, a modification of the arithmetic mean, called the Trimmed Mean, addresses the particular problem.

Measures of Central Tendency


The Median The median divides an ordered set of observations into two equal parts. In other words, it is the measure occupying the positional center of the array. If an observation is smaller than median, then it belongs in the lower half of the array; and if an observation is larger than the median, then it belongs in the upper half of the array.

Measures of Central Tendency


The Mode The mode is the most frequent observed value in the data set. It is the observed value that occurs the greatest number of times. If the data set is small, we easily see the mode through inspection. However, as the data becomes large, finding the mode is quite tedious without a computer. Generally, the mode is a less popular measure of central tendency as compared to the mean and the median.

Summary of the Different Measures of Central Tendency


Measure of Central Tendency Definition Data Requirement Existence/Uniquene ss Takes into account every value? Affected by Outliers? Can treat formula algebraic ally

Mean center of mass

Sum of all the values in the collection divided by the total number of elements in the collection Divides the array into two equal parts Most frequent value

At least interval scale and values that are close to each other

Always exist s/Always unique

Yes

Yes

Yes

Median center of the array Mode typical value

At least ordinal scale

Always exists/Always Unique Might not exist/Not always unique

No

No

No

Even if nominal scale only

No

No

No

Measures of Location
A measure of location provides us information on the percentage of observations in the collection whose values are less than or equal to it. We also commonly refer to these measures of location as Quantiles or Fractiles. The three measures of location Percentiles, Quartiles, and Deciles

Measures of Location
Percentiles divide the ordered observations into 100 equal parts. Quartiles divide the ordered observations into 4 equal parts. Deciles divide the ordered observations into 10 equal parts.

Measures of Dispersion
A Measure of Dispersion is a descriptive summary measure that help us characterize the data set in terms of how varied the observations are from each other. This measure allows us to determine the degree of dispersion of the observations about the center of the distribution. If its value is small, then this indicates that the observations are not too different from each other so that there is a concentration of observations about the center. On the other hand, if its value is large, then this indicates that the observations are very different from each other so that they are widely spread out from the center.

Measures of Dispersion
The Range
The range is the simplest and easiest-touse measure of dispersion. It is the difference between the highest and the lowest observation in a data set. It is also common practice to present the range by stating the smallest and the largest values in the collection.

Measures of Dispersion
The Variance and Standard Deviation
The variance is a measure of dispersion so we can use it to describe the variation of the measurements in the collection. It is defined as the average squared deviation or difference of each observation from the mean. The squared difference of an observation from the mean gives us an idea on how close this observation is to the mean. A large squared difference indicates that the observation and the mean are far from each other while a small squared difference indicates that the observation and the mean are close to each other. In fact, when the squared difference is zero (o) then this implies that the observation and the mean are equal to each other.

Measures of Dispersion
We can also use the variance to determine if the mean is a good measure of central tendency. A small variance indicates that the observations are highly concentrated about the mean so that it is appropriate to use the mean to represent all of the values in the collection. Whereas, if the variance is large then this indicate that, on the average, the observations are far or very different from the mean. In this case, we cannot consider the mean as a good measure of central tendency because it will not be suitable representative of all values in the collection.

Measures of Dispersion
The Coefficient of Variation
is a measure of relative variation. We can use it to compare the variability of two or more data sets even if they have different means or different units of measurement because the coefficient of variation has no unit. The coefficient of variation (CV) is the ratio of the standard deviation to the mean, expressed as a percentage.

Measures of Shape: Skewness and Kurtosis


A measure of skewness is a single value that indicates the degree and direction of asymmetry. If it is possible to divide the histogram at the center into two identical halves, wherein each half is a mirror image of the other, then the distribution is called a symmetric distribution. Otherwise, it is called a skewed distribution.

Measures of Shape: Skewness and Kurtosis


If the concentration of the values is at left-end of the distribution and the upper tail of the distribution stretches out more than the lower tail, then the distribution is said to be positively skewed or skewed to the right. Conversely, if the concentration of the values is at the right-end of the distribution and the lower tail of the distribution stretches out more than the upper tail, then the distribution is said to be negatively skewed or skewed to the left.

Measures of Shape: Skewness and Kurtosis


Skewed to the right

Measures of Shape: Skewness and Kurtosis


Skewed to the left

Measures of Shape: Skewness and Kurtosis


Interpreting If skewness is positive, the data are positively skewed or skewed right, meaning that the right tail of the distribution is longer than the left. If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail is longer. If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness number? Bulmer, M. G., Principles of Statistics (Dover, 1979) a classic suggests this rule of thumb: If skewness is less than 1 or greater than +1, the distribution is highly skewed. If skewness is between 1 and or between + and +1, the distribution ismoderately skewed. If skewness is between and +, the distribution is approximately symmetric. Mean = Median = Mode ---- Symmetric Distribution Mean > Median > Mode ----- Positively Skewed Mean < Median < Mode ------ Negarively Skewed

Measures of Shape: Skewness and Kurtosis


Karl Pearson (1857-1936), the founder of biometrics and a major contributor to the theory of modern applied statistics, coined the term kurtosis in the 1905. The term came from the Greek word kurtos, meaning convex. Pearson used it to describe the shape of the hump of a relative frequency distribution as compared to the normal distribution.

Measures of Shape: Skewness and Kurtosis


Mesokurtic the hump is the same as the normal curve. It is neither too flat nor too peak. ( Section C) Leptokurtic the curve is more peaked and the hump is narrower or sharper than the normal curve. The prefix lepto came from the Greek leptos meaning small or thin. (Section E) Platykurtic the curve is less peaked and the hump is flatter than the normal curve. The prefix platy came from the Greek word platus meaning wide or flat. (Section D)

Measures of Shape: Skewness and Kurtosis

Sampling Distributions
Basic Concepts In Inferential Statistics, we come up with generalizations about the population using the information that we collect from a sample. We will require this sample to be a random sample.

Sampling Distributions
The sampling distribution of a statistic is its probability distribution. discrete- pmf continuous pdf The standard deviation of a statistic is called its standard error.

Tests of Hypotheses
Basic Concepts in Testing Statistical Hypotheses the first step in hypothesis testing is to identify and state the statistical hypotheses to be tested. A statistical hypothesis is a conjecture concerning one or more populations whose veracity can be established using sample data. The Null Hypothesis, denoted as Ho, is a statistical hypothesis which the researcher doubts to be true. The Alternative Hypothesis, denoted as Ha, is the operational statement of the theory that the researcher believes to be true and wishes to prove and is contradiction of the null hypothesis.