Statistics Basics UNIT-1

History of Statistics The Word statistics have been derived from Latin word Status or the Italian word
Statista, meaning of these words is Political State or a Government. In the past, the statistics was used by rulers. The application of statistics was very limited but rulers and kings needed information about lands, agriculture, commerce, population of their states to assess their military potential, their wealth, taxation and other aspects of government. Gottfried Achenwall used the word statistik at a German University in 1749 which means that political science of different countries. In 1771 W. Hooper (Englishman) used the word statistics in his translation of Elements of Universal Erudition written by Baron B.F Bieford, in his book statistics has been defined as the science that teaches us what is the political arrangement of all the modern states of the known world. There is a big gap between the old statistics and the modern statistics, but old statistics also used as a part of the present statistics. Meanings of Statistics The word statistics has three different meanings (sense) which are discussed below: (1) Plural Sense (2) Singular Sense (3) Plural of the word Statistic (1) Plural Sense: In plural sense, the word statistics refer to numerical facts and figures collected in a systematic manner with a definite purpose in any field of study. In this sense, statistics are also aggregates of facts which are expressed in numerical form. For example, Statistics on industrial production, statistics or population growth of a country in different years etc. (2) Singular Sense: In singular sense, it refers to the science comprising methods which are used in collection, analysis, interpretation and presentation of numerical data. These methods are used to draw conclusion about the population parameter For Example: If we want to have a study about the distribution of weights of students in a certain college. First of all, we will collect the information on the weights which may be obtained from the records of the college or we may collect from the students directly. The large number of weight figures will confuse the mind. In this situation we may arrange the weights in groups such as: 50 Kg to 60 Kg 60 Kg to 70 Kg and so on and find the number of students fall in each group. This step is called a presentation of data. We may still go further and compute the averages and some other measures which may give us complete description of the original data. (3) Plural of Word Statistic: The word statistics is used as the plural of the word Statistic which refers to a numerical quantity like mean, median, variance etc, calculated from sample value. For Example: If we select 15 student from a class of 80 students, measure their heights and find the average height. This average would be a statistic. Kinds or Branches of Statistics: Statistics may be divided into two main branches:
(1) Descriptive Statistics (1) Descriptive Statistics:
(2) Inferential Statistics
In descriptive statistics, it deals with collection of data, its presentation in various forms, such as tables, graphs and diagrams and findings averages and other measures which would describe the data. For Example: Industrial statistics, population statistics, trade statistics etc Such as businessman make to use descriptive statistics in presenting their annual reports, final accounts, bank statements. (2) Inferential Statistics: In inferential statistics, it deals with techniques used for analysis of data, making the estimates and drawing conclusions from limited information taken on sample basis and testing the reliability of the estimates. For Example: Suppose we want to have an idea about the percentage of illiterates in our country. We take a sample from the population and find the proportion of illiterates in the sample. This sample proportion with the help of probability enables us to make some inferences about the population proportion. This study belongs to inferential statistics. Definition of Statistics: Statistics like many other sciences is a developing discipline. It is not nothing static. It has gradually developed during last few centuries. In different times, it has been defined in different manners. Some definitions of the past look very strange today but those definitions had their place in their own time. Defining a subject has always been difficult task. A good definition of today may be discarded in future. It is difficult to define statistics. Some of the definitions are reproduced here: (1) The kings and rulers in the ancient times were interested in their manpower. They conducted census of population to get information about their population. They used information to calculate their strength and ability for wars. In those days statistics was defined as the science of kings, political and science of statecraft (2) A.L. Bowley defined statistics as statistics is the science of counting This definition places the entries stress on counting only. A common man also thinks as if statistics is nothing but counting. This used to be the situation but very long time ago. Statistics today is not mere counting of people, counting of animals, counting of trees and counting of fighting force. It has now grown to a rich methods of data analysis and interpretation. (3) A.L. Bowley has also defined as science of averages
This definition is very simple but it covers only some area of statistics. Average is very simple important in statistics. Experts are interested in average deaths rates, average birth rates, average increase in population, and average increase in per capita income, average increase in standard of living and cost of living, average development rate, average inflation rate, average production of rice per acre, average literacy rate and many other averages of different fields of practical life. But statistics is not limited to average only. There are many other statistical tools like measure of variation, measure of correlation, measures of independence etc Thus this definition is weak and incomplete and has been buried in the past. (4) Prof: Boddington has defined statistics as science of estimate and probabilities This definition covers a major part of statistics. It is close to the modern statistics. But it is not complete because it stress only on probability. There are some areas of statistics in which probability is not used. (5) A definition due to W.I. King is the science of statistics is the method of judging collection, natural or social phenomena from the results obtained from the analysis or enumeration or collection of estimates. This definition is close to the modern statistics. But it does not
cover the entire scope of modern statistics. Secrist has given a detailed definition of statistics in plural sense. His definition is given on the previous. He has not given any importance to statistics in singular sense. Statistics both in the singular and the plural sense has been combined in the following definition which is accepted as the modern definition of statistics.
Statistics are the numerical statement of facts capable of analysis and interpretation and the science of statistics is the study of the principles and the methods applied in collecting, presenting, analysis and interpreting the numerical data in any field of inquiry. Characteristics of Statistics: Some of its important characteristics are given below:

Statistics are aggregates of facts. Statistics are numerically expressed. Statistics are affected to a marked extent by multiplicity of causes. Statistics are enumerated or estimated according to a reasonable standard of accuracy. Statistics are collected for a predetermine purpose. Statistics are collected in a systemic manner. Statistics must be comparable to each other.
Limitations of Statistics: The important limitations of statistics are: (1) Statistics laws are true on average. Statistics are aggregates of facts. So single observation is not a statistics, it deals with groups and aggregates only. (2) Statistical methods are best applicable on quantitative data. (3) Statistical cannot be applied to heterogeneous data. (4) If sufficient care is not exercised in collecting, analysing and interpretation the data, statistical results might be misleading.
(5) Only a person who has an expert knowledge of statistics can handle statistical data efficiently. (6) Some errors are possible in statistical decisions. Particularly the inferential statistics involves certain errors. We do not know whether an error has been committed or not. Functions or Uses of Statistics: (1) Statistics helps in providing a better understanding and exact description of a phenomenon of nature. (2) Statistics helps in proper and efficient planning of a statistical inquiry in any field of study. (3) Statistics helps in collecting an appropriate quantitative data. (4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphic form for an easy and clear comprehension of the data. (5) Statistics helps in understanding the nature and pattern of variability of a phenomenon through quantitative observations. (6) Statistics helps in drawing valid inference, along with a measure of their reliability about the population parameters from the sample data.
Importance of Statistics in Different Fields: Statistics plays a vital role in every fields of human activity. Statistics has important role in determining the existing position of per capita income, unemployment, population growth rate, housing, schooling medical facilities etcin a country. Now statistics holds a central position in almost every field like Industry, Commerce, Trade, Physics, Chemistry, Economics, Mathematics, Biology, Botany, Psychology, Astronomy etc, so application of statistics is very wide. Now we discuss some important fields in which statistics is commonly applied. (1) Business: Statistics play an important role in business. A successful businessman must be very quick and accurate in decision making. He knows that what his customers wants, he should therefore, know what to produce and sell and in what quantities. Statistics helps businessman to plan production according to the taste of the costumers, the quality of the products can also be checked more efficiently by using statistical methods. So all the activities of the businessman based on statistical information. He can make correct decision about the location of business, marketing of the products, financial resources etc (2) In Economics: Statistics play an important role in economics. Economics largely depends upon statistics. National income accounts are multipurpose indicators for the economists and administrators. Statistical methods are used for preparation of these accounts. In economics research statistical methods are used for collecting and analysis the data and testing hypothesis. The relationship between supply and demands is studies by statistical methods, the imports and exports, the inflation rate, the per capita income are the problems which require good knowledge of statistics.
(3) In Mathematics: Statistical plays a central role in almost all natural and social sciences. The methods of natural sciences are most reliable but conclusions draw from them are only probable, because they are based on incomplete evidence. Statistical helps in describing these measurements more precisely. Statistics is branch of applied mathematics. The large number of statistical methods like probability averages, dispersions, estimation etc is used in mathematics and different techniques of pure mathematics like integration, differentiation and algebra are used in statistics. (4) In Banking: Statistics play an important role in banking. The banks make use of statistics for a number of purposes. The banks work on the principle that all the people who deposit their money with the banks do not withdraw it at the same time. The bank earns profits out of these deposits by lending to others on interest. The bankers use statistical approaches based on probability to estimate the numbers of depositors and their claims for a certain day. (5) In State Management (Administration): Statistics is essential for a country. Different policies of the government are based on statistics. Statistical data are now widely used in taking all administrative decisions. Suppose if the government wants to revise the pay scales of employees in view of an increase in the living cost, statistical methods will be used to determine the rise in the cost of living. Preparation of federal and provincial government budgets mainly depends upon statistics because it helps in estimating the expected expenditures and revenue from different sources. So statistics are the eyes of administration of the state. (6) In Accounting and Auditing: Accounting is impossible without exactness. But for decision making purpose, so much precision is not essential the decision may be taken on the basis of approximation, know as statistics. The correction of the values of current assets is made on the basis of the purchasing power of money or the current value of it. In auditing sampling techniques are commonly used. An auditor determines the sample size of the book to be audited on the basis of error. (7) In Natural and Social Sciences: Statistics plays a vital role in almost all the natural and social sciences. Statistical methods are commonly used for analysing the experiments results, testing their significance in Biology, Physics, Chemistry, Mathematics, Meteorology, Research chambers of commerce, Sociology, Business, Public Administration, Communication and Information Technology etc (8) In Astronomy: Astronomy is one of the oldest branches of statistical study; it deals with the measurement of distance, sizes, masses and densities of heavenly bodies by means of observations. During these measurements errors are unavoidable so most probable measurements are founded by using statistical methods.
Example: This distance of moon from the earth is measured. Since old days the astronomers have been statistical methods like method of least squares for finding the movements of stars.
Application of Inferential Statistics in Managerial Decision Making: The main objective of Business Statistics is to make inferences (e.g., prediction, making decisions) about certain characteristics of a population based on information contained in a random sample from the entire population. The condition for randomness is essential to make sure the sample is representative of the population. Statistical inference refers to extending your knowledge obtained from a random sample from the entire population to the whole population. Its main application is in hypotheses testing about a given population. Statistical inference guides the selection of appropriate statistical models. Models and data interact in statistical work. Inference from data can be thought of as the process of selecting a reasonable model, including a statement in probability language of how confident one can be about the selection. Estimation and Hypothesis Testing: Inference in statistics is of two types. The first is estimation, which involves the determination, with a possible error due to sampling, of the unknown value of a population characteristic, such as the proportion having a specific attribute or the average value of some numerical measurement. To express the accuracy of the estimates of population characteristics, one must also compute the standard errors of the estimates. The second type of inference is hypothesis testing. It involves the definitions of a hypothesis as one set of possible population values and an alternative, a different set. There are many statistical procedures for determining, on the basis of a sample, whether the true population characteristic belongs to the set of values in the hypothesis or the alternative. Some Basic Definitions in Statistics: Constant: A quantity which can be assuming only one value is called a constant. It is usually denoted by the first letters of alphabets a,b,c. For Example: Value of = 22/7 = 3.14159 and value of e = 2.71828 Variable: A quantity which can vary from one individual or object to and other is called a variable. It is usually denoted by the last letters of alphabets x,y,z. For Example: Heights and Weights of students, Income, Temperature, No. of Children in a family etc Continuous Variable: A variable which can assume each and every value within a given range is called a continuous variable. It can occur in decimals.
For Example: Heights and Weights of students, Speed of a bus, the age of a Shopkeeper, the life time of a T.V etc Continuous Data: Data which can be described by a continuous variable is called continuous data. For Example: Weights of 50 students in a class. Discrete Variable: A variable which can assume only some specific values within a given range is called discrete variable. It cannot occur in decimals. It can occur in whole numbers. For Example: Number of students in a class, number of flowers on the tree, number of houses in a street, number of chairs in a room etc Discrete Data: Data which can be described by a discrete variable is called discrete data. For Example: Number of students in a college. Quantitative Variable: A characteristic which varies only in magnitude from on individual to another is called quantitative variable. It can be measurable. For Example: Wages, Prices, Heights, Weights etc Qualitative Variable: A characteristic which varies only in quality from one individual to another is called qualitative variable. It cannot be measured. For Example: Beauty, Marital Status, Rich, Poor, Smell etc
Collection of Statistical Data Statistical Data: A sequence of observation, made on a set of objects included in the sample drawn from population is known as statistical data. (1) Ungrouped Data: Data which have been arranged in a systematic order are called raw data or ungrouped data. (2) Grouped Data: Data presented in the form of frequency distribution is called grouped data. Collection of Data:
The first step in any enquiry (investigation) is collection of data. The data may be collected for the whole population or for a sample only. It is mostly collected on sample basis. Collection of data is very difficult job. The enumerator or investigator is the well trained person who collects the statistical data. The respondents (information) are the persons whom the information is collected. Types of Data: There are two types (sources) for the collection of data. (1) Primary Data (2) Secondary Data 1) Primary Data: The primary data are the first hand information collected, compiled and published by organization for some purpose. They are most original data in character and have not undergone any sort of statistical treatment. Example: Population census reports are primary data because these are collected, complied and published by the population census organization. (2) Secondary Data: The secondary data are the second hand information which are already collected by some one (organization) for some purpose and are available for the present study. The secondary data are not pure in character and have undergone some treatment at least once. Example: Economics survey of India is secondary data because these are collected by more than one organization like NSSO, Indian statistical Organisation, Board of Revenue, the Banks etc Methods of Collecting Primary Data: Primary data are collected by the following methods: 1. Observation method 2. Interview method 3.Through questionnaires 4.Through schedules 5. Other methods include

Warranty cards Distributor audits Pantry audits Consumer panels Using mechanical devices Through projective techniques Depth interviews Content analysis
OBSERVATION METHOD: The observation method is the most commonly used method especially in studies relating to behavioural sciences. In a general way we all observe things around us, but this sort of observation is not scientific observation. Observation becomes a scientific tool and the method of data collection for the researcher, when it serves a formulated research purpose, systematically planned and recorded and is subjected to checks and controls on
validity and reliability. Under the observation method, the information is sought by way of investigators own direct observation without asking from the respondent. INTERVIEW METHOD: The interview method of collecting data involves presentation of oralverbal stimuli and reply in terms of oral-verbal responses. This method can be used through personal interviews and, if possible, through telephone interviews. QUESTIONNAIRES METHOD: This method of data collection is quite popular, particularly in case of big enquiries. It is being adopted by private individuals, research workers, private and public organisations and even by governments. In this method a questionnaire is sent (usually by post) to the persons concerned with a request to answer the questions and return the questionnaire. A questionnaire consists of a number of questions printed or typed in a definite order on a form or set of forms. The questionnaire is mailed to respondents who are expected to read and understand the questions and write down the reply in the space meant for the purpose in the questionnaire itself. The respondents have to answer the questions on their own. The method of collecting data by mailing the questionnaires to respondents is most extensively employed in various economic and business surveys SCHEDULES: This method of data collection is very much like the collection of data through questionnaire, with little difference which lies in the fact that schedules (proforma containing a set of questions) are being filled in by the enumerators who are specially appointed for the purpose. These enumerators along with schedules, go to respondents, put to them the questions from the proforma in the order the questions are listed and record the replies in the space meant for the same in the proforma. In certain situations, schedules may be handed over to respondents and enumerators may help them in recording their answers to various questions in the said schedules. Enumerators explain the aims and objects of the investigation and also remove the difficulties which any respondent may feel in understanding the implications of a particular question or the definition or concept of difficult terms. This method requires the selection of enumerators for filling up schedules or assisting respondents to fill up schedules and as such enumerators should bevery carefully selected. The enumerators should be trained to perform their job well and the nature and scope of the investigation should be explained to them thoroughly so that they may well understand the implications of different questions put in the schedule. Enumerators should be intelligent and must possess the capacity of cross examination in order to find out the truth. Above all, they should be honest, sincere, and hardworking and should have patience and perseverance. This method of data collection is very useful in extensive enquiries and can lead to fairly reliable results DIFFERENCE BETWEEN QUESTIONNAIRES AND SCHEDULES Both questionnaire and schedule are popularly used methods of collecting data in research surveys. There is much resemblance in the nature of these two methods and this fact has made many people to remark that from a practical point of view the two methods can be taken to be the same. But from the technical point of view there is difference between the two:
Questionnaire Method 1. Questionnaire is generally sent through mail to informants to be answered. 2. Data collection is cheap. 3.Non response is usually high as many people do not respond. 4. It is not clear that who replies. 5. The questionnaire method is likely to be very slow since many respondents do not return the questionnaire. 6.No personal contact is possible in case of questionnaire.
Schedule Method 1. Schedules is generally filled by the resarch worker or enumerator, who can interpret the questions when necessary. 2.Data collection is more expensive as money is spent on enumerators. 3.Non response is very low because this is filled by enumerators. 4.Identity of respondent is known. 5.Information is collected well in time. 6.Direct personal contact is established.
Methods of Collecting Secondary Data: The secondary data are collected by the following sources:
Official: e.g. The publications of the Statistical Division, Ministry of Finance, the Federal Bureaus of Statistics, Ministries of Food, Agriculture, Industry, Labor etc Semi-Official: e.g. State Bank, Railway Board, Central Cotton Committee, Boards of Economic Enquiry etc Publication of Trade Associations, Chambers of Commerce etc Technical and Trade Journals and Newspapers. Research Organizations such as Universities and other institutions.
Difference between Primary and Secondary Data: The difference between primary and secondary data is only a change of hand. The primary data are the first hand data information which is directly collected form one source. They are most original data in character and have not undergone any sort of statistical treatment while the secondary data are obtained from some other sources or agencies. They are not pure in character and have undergone some treatment at least once. For Example: Suppose we interested to find the average age of MS students. We collect the ages data by two methods; either by directly collecting from each student himself personally or getting their ages from the university record. The data collected by the direct personal investigation is called primary data and the data obtained from the university record is called secondary data. Editing of Data: After collecting the data either from primary or secondary source, the next step is its editing. Editing means the examination of collected data to discover any error and mistake before presenting it. It has to be decided before hand what degree of accuracy is wanted and what extent of errors can be tolerated in the inquiry. The editing of secondary data is simpler than that of primary data.
Classification of Data: The process of arranging data into homogenous group or classes according to some common characteristics present in the data is called classification. For Example: The process of sorting letters in a post office, the letters are classified according to the cities and further arranged according to streets. Bases of Classification: There are four important bases of classification: (1) Qualitative Base (2) Quantitative Base (3) Geographical Base (4) Chronological or Temporal Base
(1) Qualitative Base: When the data are classified according to some quality or attributes such as sex, religion, literacy, intelligence etc (2) Quantitative Base: When the data are classified by quantitative characteristics like heights, weights, ages, income etc (3) Geographical Base: When the data are classified by geographical regions or location, like states, provinces, cities, countries etc (4) Chronological or Temporal Base: When the data are classified or arranged by their time of occurrence, such as years, months, weeks, days etc For Example: Time series data. Types of Classification: (1) One -way Classification: If we classify observed data keeping in view single characteristic, this type of classification is known as one-way classification. For Example: The population of world may be classified by religion as Muslim, Christians etc (2) Two -way Classification: If we consider two characteristics at a time in order to classify the observed data then we are doing two way classifications. For Example: The population of world may be classified by Religion and Sex. (3) Multi -way Classification: We may consider more than two characteristics at a time to classify given data or observed data. In this way we deal in multi-way classification. For Example: The population of world may be classified by Religion, Sex and Literacy.
Tabulation of Data: The process of placing classified data into tabular form is known as tabulation. A table is a symmetric arrangement of statistical data in rows and columns. Rows are horizontal arrangements whereas columns are vertical arrangements. It may be simple, double or complex depending upon the type of classification. Types of Tabulation: (1) Simple Tabulation or One-way Tabulation: When the data are tabulated to one characteristic, it is said to be simple tabulation or one-way tabulation. For Example: Tabulation of data on population of world classified by one characteristic like Religion is example of simple tabulation. (2) Double Tabulation or Two-way Tabulation: When the data are tabulated according to two characteristics at a time. It is said to be double tabulation or two-way tabulation. For Example: Tabulation of data on population of world classified by two characteristics like Religion and Sex is example of double tabulation. (3) Complex Tabulation: When the data are tabulated according to many characteristics, it is said to be complex tabulation. For Example: Tabulation of data on population of world classified by two characteristics like Religion, Sex and Literacy etcis example of complex tabulation. Construction of Statistical Table: A statistical table has at least four major parts and some (1) The Title (2) The Captions (3) The Stub (4) The Body (5) Prefatory Notes/Head Note (6) Foots Notes (7) Source Notes The general sketch of table indicating its necessary parts is shown below: other minor parts.
----THE TITLE-------Prefatory Notes-------Box Head----
----Row Captions----
----Column Captions----
----Stub Entries----
----The Body----
Foot Notes Source Notes (1) The Title: A title is the main heading written in capital shown at the top of the table. It must explain the contents of the table and throw light on the table as whole different parts of the heading can be separated by commas there are no full stop be used in the little. (2) The Captions: The vertical heading and subheading of the column are called columns captions. The spaces were these column headings are written is called box head. Only the first letter of the box head is in capital letters and the remaining words must be written in small letters. (3) The Stub: The horizontal headings and sub heading of the row are called row captions and the space where these rows headings are written is called stub. (4) The Body: It is the main part of the table which contains the numerical information classified with respect to row and column captions. (5) Prefatory Notes/Head Notes: A statement given below the title and enclosed in brackets usually describe the units of measurement is called prefatory notes. (6) Foot Notes: It appears immediately below the body of the table providing the further additional explanation. (7) Source Notes: The source notes is given at the end of the table indicating the source from when information has been taken. It includes the information about compiling agency, publication etc General Rules of Tabulation:
A table should be simple and attractive. There should be no need of further explanations (details). Proper and clear headings for columns and rows should be need. Suitable approximation may be adopted and figures may be rounded off. The unit of measurement should be well defined. If the observations are large in number they can be broken into two or three tables. Thick lines should be used to separate the data under big classes and thin lines to separate the sub classes of data.
Difference Between Classification and Tabulation: (1) First the data are classified and then they are presented in tables, the classification and tabulation in fact goes together. So classification is the basis for tabulation. (2) Tabulation is a mechanical function of classification because in tabulation classified data are placed in row and columns (3) Classification is a process of statistical analysis where as tabulation is a process of presenting the data in suitable form.
Frequency Distribution: A frequency distribution is a tabular arrangement of data into classes according to the size or magnitude along with corresponding class frequencies (the number of values fall in each class). Ungrouped Data or Raw Data: Data which have not been arranged in a systemic order is called ungrouped or raw data. Grouped Data: Data presented in the form of frequency distribution is called grouped data. Array: The numerical raw data is arranged in ascending or descending order is called an array. Example: Array the following data in ascending or descending order 6, 4, 13, 7, 10, 16, 19. Solution: Array in ascending order is 4, 6, 7, 10, 13, 16, and 19 Array in descending order id 19, 16, 13, 10, 7, 6, and 4 Class Limits/Class Interval: The variant values of the classes or groups are called the class limits. The smaller value of the class is called lower class limit and larger value of the class is called upper class limit. Class limits are also called inclusive classes. For Example: Let us take the class 10 19, the smaller value 10 is lower class limit and larger value 19 is called upper class limit.
Class Boundaries: The true values, which describe the actual class limits of a class, are called class boundaries. The smaller true value is called the lower class boundary and the larger true value is called the upper class boundary of the class. It is important to note that the upper class boundary of a class coincides with the lower class boundary of the next class. Class boundaries are also known as exclusive classes. For Example: No of Students 8 12 5 25 A student whose weights are between 60kg and 64.5kg would be included in the 60 65 class. A student whose weight is 65kg would be included in next class 65 70. Open-end Classes: A class has either no lower class limit or no upper class limit in a frequency table is called an open-end class. We do not like to use open-end classes in practice, because they create problems in calculation. For Example: Weights (Pounds) Below 110 110 120 120 130 130 140 140 Above No of Persons 6 12 20 10 2 Weights in Kg 60 65 65 70 70 75
Class Mark or Mid Point: The class marks or mid point is the mean of lower and upper class limits or boundaries. So it divides the class into two equal parts. It is obtained by dividing the sum of lower and upper class limit or class boundaries of a class by 2. For Example: The class mark or mid point of the class 60 69 is 60+69/2 = 64.5 Size of Class Interval: The difference between the upper and lower class boundaries (not between class limits) of a class or the difference between two successive mid points is called size of class interval.
Types of class Interval: Classes can be formed in two ways: Exclusive Class Interval: When the class intervals are fixed in such a way that the upper limit of one class is the lower limit of the next class, it is termed as exclusive method of class interval. Example: Marks (Percentage) 0-10 10-20 20-30 30-40 40-50 50-60 No. of Students 15 17 22 30 39 45
Inclusive Class Interval: In case of inclusive class intervals, the upper limit of a class in not equal to the lower limit of the next class. Example: Marks (Percentage) 0-9 10-19 20-29 30-39 40-49 Construction of Frequency Distribution: Following steps are involved in the construction of a frequency distribution. (1) Find the range of the data: The range is the difference between the largest and the smallest values. (2) Decide the approximate number of classes: Which the data are to be grouped. There are no hard and first rules for number of classes. Most of the cases we have 5 to 20 classes. H.A. Sturges has given a formula for determining the approximation number of classes. K = 1 + 3.322 log N Where K = Number of Classes Where log N = Logarithm of the total number of observations For Example: If the total number of observations is 50, the number of classes would be K = 1 + 3.322 log N No. of Students 5 8 7 13 25
K = 1 + 3.322 log 50 K = 1 + 3.322 (1.69897) K = 1 + 5.644 K = 6.644 Or 7 classes approximately. (3) Determine the approximate class interval size: The size of class interval is obtained by dividing the range of data by number of classes and denoted by h class interval size In case of fractional results, the next higher whole number is taken as the size of the class interval. (4) Decide the starting point: The lower class limits or class boundary should cover the smallest value in the raw data. It is a multiple of class interval. For Example: 0, 5, 10, 15, 20 etc are commonly used. (5) Determine the remaining class limits (boundary): When the lowest class boundary of the lowest class has been decided, then by adding the class interval size to the lower class boundary, compute the upper class boundary. The remaining lower and upper class limits may be determined by adding the class interval size repeatedly till the largest value of the data is observed in the class. (6) Distribute the data into respective classes: All the observations are marked into respective classes by using Tally Bars (Tally Marks) methods which is suitable for tabulating the observations into respective classes. The number of tally bars is counted to get the frequency against each class. The frequency of all the classes is noted to get grouped data or frequency distribution of the data. The total of the frequency columns must be equal to the number of observations. Example Construction of Frequency Distribution: Construct a frequency distribution with suitable class interval size of marks obtained by 50 students of a class are given below: 23, 50, 38, 42, 63, 75, 12, 33, 26, 39, 35, 47, 43, 52, 56, 59, 64, 77, 15, 21, 51, 54, 72, 68, 36, 65, 52, 60, 27, 34, 47, 48, 55, 58, 59, 62, 51, 48, 50, 41, 57, 65, 54, 43, 56, 44, 30, 46, 67, 53 Solution: Arrange the marks in ascending order as 12, 15, 21, 23, 26, 27, 30, 33, 34, 35, 36, 38, 39, 41, 42, 43, 43, 44, 46, 47, 47, 48, 48, 50, 50, 51, 51, 52, 52, 53, 54, 54, 55, 56, 56, 57, 58, 59, 59, 60, 62, 63, 64, 65, 65, 67, 68, 72, 75, 77 Minimum Value = 12 Maximum = 77 Range = Maximum Value Minimum Value = 77 - 12 = 65 Number of Classes = 1 + 3.322 log N = 1 + 3.322 log 50 = 1 + 3.322 (1.69897) = 1 + 5.644 = 6.644 Or 7 classes approximately.
Class Interval Size (h) = Marks Class Limits C.L Tally Marks
= Number of Students f (Frequency)
= 9.3 or 10 Class Boundary C.B Class Marks x
Note: For finding the class boundaries, we take half of the difference between lower class limit of the 2nd class and upper class limit of the 1st class (20-190/2 = = 0.5. This value is subtracted from lower class limit and added in upper class limit to get the required class boundaries. Frequency Distribution of Discrete Data: Discrete data is generated by counting; each and every observation is exact. When an observation is repeated. It is counted the number for which the observation is repeated is called frequency of that observation. The class limits in discrete data are true class limit; there are no class boundaries in discrete data. More than cumulative frequency distribution: It is obtained by finding the cumulate total of frequencies starting from the highest to the lowest class. The less than cumulative frequency distribution and more than cumulative frequency distribution for the frequency distribution given below are:
Diagrammatic & Graphical Representation of Statistical Data:
In the previous section we have discussed the techniques of classification and tabulation that help us in organizing the collected data in a meaningful fashion. However, this way of presentation of statistical data does not always prove to be interesting to a layman. Too many figures are often confusing and fail to convey the massage effectively. One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in which statistical data may be displayed pictorially such as different types of graphs and diagrams. The commonly used diagrams and graphs to be discussed in subsequent paragraphs are given as under: Types of Diagrams/Charts: 1. Simple Bar Chart 2. Multiple Bar Chart or Cluster Chart 3. Staked Bar Chart or Sub-Divided Bar Chart or Component Bar Chart Simple Component Bar Chart Percentage Component Bar Chart Sub-Divided Rectangular Bar Chart Deviation Bar Diagram Pie Chart Types of Diagrams/Charts: 1. Histogram 2. Frequency Curve and Polygon 3. Ogive(cumulative frequency curves) Simple Bar Chart: A simple bar chart is used to represents data involving only one variable classified on spatial, quantitative or temporal basis. In simple bar chart, we make bars of equal width but variable length, i.e. the magnitude of a quantity is represented by the height or length of the bars. Following steps are undertaken in drawing a simple bar diagram:

Draw two perpendicular lines one horizontally and the other vertically at an appropriate place of the paper. Take the basis of classification along horizontal line (X-axis) and the observed variable along vertical line (Y-axis) or vice versa. Marks signs of equal breath for each class and leave equal or not less than half breath in between two classes. Finally marks the values of the given variable to prepare required bars.
Example: Draw simple bar diagram to represent the profits of a bank for 5 years. Years Profit (million $) 1989 10 1990 12 1991 18 1992 25 1993 42
Simple bar chart showing the profit of a bank for 5 years.
Multiple Bar Chart: By multiple bars diagram two or more sets of inter-related data are represented (multiple bar diagram facilities comparison between more than one phenomena). The technique of simple bar chart is used to draw this diagram but the difference is that we use different shades, colors, or dots to distinguish between different phenomena. We use to draw multiple bar charts if the total of different phenomena is meaningless.
Example: Draw a multiple bar chart to represent the import and export of a country (values in Million $) for the years 1991 to 1995. Years 1991 1992 1993 1994 1995 Imports 7930 8850 9780 11720 12150 Exports 4260 5225 6150 7340 8145
Simple bar chart showing the import and export of a country from 1991 1995.
Component Bar Chart: Sub-divided or component bar chart is used to represent data in which the total magnitude is divided into different or components. In this diagram, first we make simple bars for each class taking total magnitude in that class and then divide these simple bars into parts in the ratio of various components. This type of diagram shows the variation in different components within each class as well as between different classes. Sub-divided bar diagram is also known as component bar chart or staked chart. Example: The table below shows the quantity in hundred kgs of Wheat, Barley and Oats produced on a certain form during the years 1991 to 1994. Years 1991 1992 1993 1994 Wheat 34 43 43 45 Barley 18 14 16 13 Oats 27 24 27 34
Construct a component bar chart to illustrate this data.
Solution: To make the component bar chart, first of all we have to take year wise total production. Years 1991 1992 1993 1994 Wheat 34 43 43 45 Barley 18 14 16 13 Oats 27 24 27 34 Total 79 81 86 92
The required diagram is given below:
Percentage Component Bar Chart: Sub-divided bar chart may be drawn on percentage basis. To draw sub-divided bar chart on percentage basis, we express each component as the percentage of its respective total. In drawing percentage bar chart, bars of length equal to 100 for each class are drawn at first step and subdivided in the proportion of the percentage of their component in the second step. The diagram so obtained is called percentage component bar chart or percentage staked bar chart. This type of chart is useful to make comparison in components holding the difference of total constant. Example: The table given on next page shows the quantity in hundred kgs of Wheat, Barley and Oats produced on a certain form during the years 1991 to 1994.
Years 1991 1992 1993 1994
Wheat 34 43 43 45
Barley 18 14 16 13
Oats 27 24 27 34
Construct a percentage component bar chart to illustrate this data.
Solution: Necessary computations for the construction of percentage bar chart given below: Item % Wheat Barley Oats Total

1991 Cum % 43.0 65.8 100 %
1992 Cum % 53.1 70.4 100 %
1993 Cum % 50.0 68.6 100 % 48.9 14.1 37.0 100
1994 Cum % 48.9 63.0 100
43.0 22.8 34.2 100
53.1 17.3 29.6 100
50.0 18.6 31.4 100
% indicates Percentage of each item Cum % indicates the cumulative percentage.
Deviation Bar Diagram: Deviation bar diagram are used for representing net quantities-Excess or deficit i.e. net profit, net loss, net exports or imports etc. Such bars can have both positive and negative values. Positive values are shown above the base line and negative values below it. Pie Chart: Pie chart can used to compare the relation between the whole and its components. Pie chart is a circular diagram and the area of the sector of a circle is used in pie chart. Circles are drawn with radii proportional to the square root of the quantities because the area of a circle is 2r2. To construct a pie chart (sector diagram), we draw a circle with radius (square root of the total). The total angle of the circle is 360. The angles of each component are calculated by the formula.
Angle of Sector
These angles are made in the circle by mean of a protractor to show different components. The arrangement of the sectors is usually anti-clock wise. Example: The following table gives the details of monthly budget of a family. Represent these figures by a suitable diagram. Item of Expenditure Food Clothing House Rent Fuel and Lighting Miscellaneous Total Family Budget $600 $100 $400 $100 $300 $1500
Solution: The necessary computations are given below:
Angle of Sector
Items Food Clothing House Rent Fuel and Lighting Miscellaneous Total
Family Budget Expenditure $ 600 100 400 100 300 1500 Angle of Sectors 144 24 96 24 72 360 Cumulative Angle 144 168 264 288 360
Histogram: Histogram is a two dimensional frequency diagram. The histograms are diagrams which represent the class interval and the frequency in the form of a rectangle. There will be as many adjoining rectangles as there are class intervals. Properties of Histogram: (1) In a histogram, class intervals are shown on X-axis and frequencies on Y-axis. (2) The scales for both the axes need not be the same. (3) Class intervals must be exclusive. If the intervals are in inclusive form, we have to convert them to the exclusive form. (4) In a histogram, there are rectangles with class intervals as bases and the corresponding frequencies as heights. (5) The class limits are marked on the horizontal axis and the frequency is marked on the vertical axis. Thus, a rectangle is constructed on each class interval. (6) If the intervals are equal, then the height of each rectangle is proportional to the corresponding class frequency. (7) If the intervals are unequal, then the area of each rectangle is proportional to the corresponding class frequency.
Example: Class Interval Frequency 0-5 4 5-10 10 10-15 18 15-20 8 20-25 6
Frequency Polygon: A frequency polygon is a line graph drawn by joining the mid-points of the tops of the adjoining rectangles in a histogram. The mid-points of the first and the last classes are joined to the midpoints of the classes preceding and succeeding respectively at zero frequency to complete the polygon.
Example: Class (Height in Centimetres) Frequency 150160 4 160170 3 170180 8 180190 6 190-200 1
Ogives (Cumulative Frequency Curves): The curves that we have discussed earlier have a similar property that the frequency of each class was shown against it but sometimes it needs to cumulate the frequency and the curves that are based on cumulative frequencies are known as cumulative frequency curves or ogives. Series can be cumulated in two ways. In the first way, the frequencies of all the preceding class intervals are added to the frequency of a class and in the second way, the frequencies of all the succeeding classes are added to the frequency if a class. The frequencies added in the first way would show the frequency of the classes less than a particular value and the frequency added in the second way would show the frequency of classes more than a particular value. Simply the first method give a series cumulated upwards and a second one a series cumulated downwards. The method of drawing the frequency curve and the more than frequency curve are almost same with a little difference that in case of a simple frequency curve, the frequency is plotted against the midpoint of a class interval but in case of a cumulative frequency curve, it is plotted at the upper or lower limit of the class depending upon the manner in which series has been cumulated. It will be plotted against the upper limit of the class interval when the cumulation has been done upwards (less than series) but if the cumulation has been done downwards, i.e. the more than series has been formed, frequencies would be plotted against the lower limit of the class-intervals. Age Interval(yrs) 15-19 20-24 25-29 30-34 35-39 40-44 Frequency 13 15 20 10 8 4 Cumulative frequency 13 28 48 58 66 70
Measures of Central Tendency
Meaning of Central Tendency: Condensation of data is very much required in statistical analysis because a large number of figures are confusing and are very difficult to handle. To reduce the complexity of data, it is essential that the number of figures should be reduced to one figure so that figures can be compared easily. Such single figures are called measures of central tendency or averages. According to Simpson & Kafka: A measure of central tendency is a typical value around which other figure congregate. Average: A single value which can represent the whole set of data is called an average. If the average tends to lie or indicating the center of the distribution is called measure of central tendency or sometimes they locate the general position of the data, so they are also called measure of location. Desirable Qualities of a Good Average: An average possesses all or most of the following qualities (characteristics) is considered a good average: (1) It should be easy to calculate and simple to understand. (2) It should be clearly defined by a mathematical formula. (3) It should not be affected by extreme values. (4) It should be based on all the observations. (5) It should be capable of further mathematical treatment. (6) It should have sample stability. Types of Averages: Mathematical averages are: (1) Arithmetic Mean (2) Geometric Mean (3) Harmonic Mean Positional averages are: (4) Median (5) Mode Arithmetic Mean: It is the most commonly used average or measure of the central tendency applicable only in case of quantitative data. Arithmetic mean is also simply called mean. Arithmetic mean is defined as: Arithmetic mean is quotient of sum of the given values and number of the given values.
The arithmetic mean can be computed for both ungroup data (raw data: a data without any statistical treatment) and grouped data (a data arranged in tabular form containing different groups). If x is the involved variable, then arithmetic mean of x is abbreviated as A.M. of x and denoted by . The arithmetic mean of x can be computed by any of the following methods. Nature of Data Ungrouped Data Grouped Data
Methods Name Direct Method Indirect or Short-Cut Method Method of Step-Deviation
Where x : Indicates values of the variable X. n : Indicates number of values of X. f : Indicates frequency of different groups. A : Indicates assumed mean. D = Indicates deviation from A i.e, D = (x A) u = (x - A)/c u : Step-deviation and c : Indicates common divisor h : Indicates size of class or class interval in case of grouped data. : Summation or addition. Example (1): The one-sided train fare of five selected BS students is recorded as follows($):10, 5, 15, 8 and 12. Calculate arithmetic mean of the following data. Solution: Let train fare is indicated by x, then x(S) 10 5 15 8 12 x = 50
Arithmetic mean of , we decide to use above-mentioned formula. Form the given data, we have x = 50 and n = 5. Placing these two quantities in above formula, we get the arithmetic mean for given data. ;
. Weighted Arithmetic Mean: In calculation of arithmetic mean, the importance of all the items was considered to be equal. However, there may be situations in which all the items under considerations are not equal importance. For example, we want to find average number of marks per subject who appeared in different subjects like Mathematics, Statistics, Physics and Biology. These subjects do not have equal importance. If we find arithmetic mean by giving Mean. Thus, arithmetic mean computed by considering relative importance of each items is called weighted arithmetic mean. To give due importance to each item under consideration, we assign number called weight to each item in proportion to its relative importance. Weighted Arithmetic Mean is computed by using following formula:
Where: Stands for weighted arithmetic mean. Stands for values of the items and Stands for weight of the item Example: A student obtained 40, 50, 60, 80, and 45 marks in the subjects of Math, Statistics, Physics, Chemistry and Biology respectively. Assuming weights 5, 2, 4, 3, and 1 respectively for the above mentioned subjects. Find Weighted Arithmetic Mean per subject. Solution:
Subjects Math Statistics Physics Chemistry Biology Total
Marks Obtained x 40 50 60 80 45
Weight w 5 2 4 3 1 w = 15
Wx 200 100 240 240 45 wx = 825
Now we will find weighted arithmetic mean as:
marks/subject. Merits and Demerits of Arithmetic Mean: Merits:
It is rigidly defined. It is easy to calculate and simple to follow. It is based on all the observations. It is determined for almost every kind of data. It is finite and not indefinite. It is readily put to algebraic treatment. It is least affected by fluctuations of sampling.
Demerits:

The arithmetic mean is highly affected by extreme values. It cannot average the ratios and percentages properly. It is not an appropriate average for highly skewed distributions. It cannot be computed accurately if any item is missing. The mean sometimes does not coincide with any of the observed value.
Mathematical Properties of Mean: 1. The sum of the deviations of the items from the mean (taking plus and minus sign) is always equal to zero. 2. The sum of squared deviations of the items from the mean is (minimum) less than the sum of squared deviations of items from any other value. 3. If we have the arithmetic mean and the number of observations of two or more than two related groups, we can compare combined average of these groups by applying the following formula: X12 = N1 X1 + N2 X2 N1 + N2 Where X12 = Combined mean of two groups N1 = No. of observations in the first group N2 = No. of observations in the second group X1 = Arithmetic mean of first group X2 = Arithmetic mean of second group Geometric Mean: It is another measure of central tendency based on mathematical footing like arithmetic mean. Geometric mean can be defined in the following terms: Geometric mean is the nth positive root of the product of n positive given values Hence, geometric mean for a value X containing n values such as x1, x2, x3, xn is denoted by G.M. of X and given as under:
(For
Ungrouped
Data)
If we have a series of n positive values with repeated values such as x1, x2, x3, xk are repeated f1, f2, f3, fk times respectively then geometric mean will becomes: (For Where Grouped Data)
Example: Find the Geometric Mean of the values 10, 5, 15, 8, 12 Solution: Here x1 = 10, x2 = 5 x3 = 15, x4 = 8 x5 = 12 and n = 5
Example: Find the Geometric Mean of the following Data X f 13 2 14 5 15 13 16 7 17 3
Solution: We may write it as given below: Here x1 = 13, x2 = 14, x3 = 15, x4 = 16, x5 = 17 f1 = 2, f2 = 5, f3 = 13, f4 = 7 f5 = 3 n = f = f1 + f2 + f3 + f4 + f5 = 2+5+13+7+3 = 30 Using the formula of geometric mean for grouped data, geometric mean in this case will become:
The method explained above for the calculation of geometric mean is useful when the numbers of values in given data are small in number and the facility of electronic calculator is available. When a set of data contains large number of values then we need an alternative way for
computing geometric mean. The modified or alternative way of computing geometric mean is given as under:
For Ungrouped Data
For Grouped Data
Example: Find the Geometric Mean of the values 10, 5, 15, 8, 12 X 10 5 15 8 12 Total log x 1.0000 0.6990 1.1761 0.9031 1.0792 log x = 4.8573 Example: Find the Geometric Mean for the following
distribution of students marks: Marks No. of Students Solution: Marks 0-30 30-50 50-80 80-100 Total No. of Students f 20 30 40 10 f = 100 Mid Points x 15 40 65 90 f log x 20 log 15 = 23.5218 30 log 40 = 48.0168 40 log 65 = 72.5165 10 log 90 = 19.5424 f log x = 163.6425 0-30 20 30-50 30 50-80 40 80-100 10
Combined Geometric Mean: Just like the combined arithmetic mean, we can also calculate combined geometric mean by using the following formula. Log G = N1 Log G1 + N2 Log G2 N1+N2 Where G1 and G2 are the respective geometric means for two groups and N1 and N2 means the number of observations in each group. Properties of Geometric Mean: The main properties of geometric mean are:

The geometric mean is less than arithmetic mean, G.M<A.M The product of the items remains unchanged if each item is replaced by the geometric mean. The geometric mean of the ratio of corresponding observations in two series is equal to the ratios their geometric means. The geometric mean of the products of corresponding items in two series is equal to the product of their geometric mean.
Applications of Geometric mean: 1. The G.M. is used to find the average percent increase in sales, production, population, or other economic or business related data. 2. G.M. is considered to be the best average for construction of index numbers. 3. It is an average that is most suitable in case when large weights are assigned to the small values and small weights are assigned to large value of observations.
Merits and Demerits of Geometric Mean: Merits:

It is rigidly defined and its value is a precise figure. It is based on all observations. It is capable of further algebraic treatment. It is not much affected by fluctuation of sampling. It is not affected by extreme values.
Demerits:

It cannot be calculated if any of the observation is zero or negative. Its calculation is rather difficult. It is not easy to understand. It may not coincide with any of the observations.
Harmonic Mean
Harmonic mean is another measure of central tendency and also based on mathematic footing like arithmetic mean and geometric mean. Like arithmetic mean and geometric mean, harmonic mean is also useful for quantitative data. Harmonic mean is defined in following terms: Harmonic mean is quotient of number of the given values and sum of the reciprocals of the given values. Harmonic mean in mathematical terms is defined as follows: For Ungrouped Data For Grouped Data
Example: Calculate the harmonic mean of the numbers: 13.5, 14.5, 14.8, 15.2 and 16.1 Solution: The harmonic mean is calculated as below:
x 13.2 14.2 14.8 15.2 16.1 Total
1/x 0.0758 0.0704 0.0676 0.0658 0.0621 (1/x) = 0.3417 Age (Years) Example: Given the following frequency distribution of first year students of a particular college. Calculate the Harmonic Mean. 13 2 14 5 15 13 16 7 17 3
Number of Students
Solution: The given distribution belongs to a grouped data and the variable involved is ages of first year students. While the number of students Represent frequencies. Ages (Years) Number of Students x f 13 14 15 16 17 Total Now we will find the Harmonic Mean as 2 5 13 7 3 f = 30 1/x 0.1538 0.3571 0.8667 0.4375 0.1765 (1/x) = 1.9916
years. Merits and Demerits of Harmonic Mean: Merits:

It is based on all observations. It not much affected by the fluctuation of sampling. It is capable of algebraic treatment. It is an appropriate average for averaging ratios and rates. It does not give much weight to the large items.
Demerits:

Its calculation is difficult. It gives high weight-age to the small items. It cannot be calculated if any one of the items is zero. It is usually a value which does not exist in the given data.
Relationship among the A.M., G.M. and H.M.: In any distribution when the observations differ in size, the value of A.M., G.M. and H.M., will be in the following order
A.M.> G.M. >H.M.
If all the items in the distribution have same value then the relationship will be
A.M. G.M. H.M.
Mode
Mode is the value which occurs the greatest number of times in the data. When each value occur the same numbers of times in the data, there is no mode. If two or more values occur the same numbers of time, then there are two or more modes and distribution is said to be multi-mode. If the data having only one mode the distribution is said to be uni-model and data having two modes, the distribution is said to be bi-model. Mode from Ungrouped Data: Mode is calculated from ungrouped data by inspecting the given data. We pick out that value which occur the greatest numbers of times in the data. Mode from Grouped Data: When frequency distribution with equal class interval sizes, the class which has maximum frequency is called model class.
Or
Where l = Lower class boundary of the model class fm = Frequency of the model class (maximum frequency) f1 = Frequency preceding the model class frequency f2 = Frequency following the model class frequency h = Class interval size of the model class Mode from Discrete Data: When the data follows discrete set of values, the mode may be found by inspection. Mode is the value of X corresponding to the maximum frequency. Example: Find the mode of the values 5, 7, 2, 9, 7, 10, 8, 5, 7 Solution: Mode is 7 because it occur the greatest number of times in the data.
Example: The weights of 50 college students are given in the following table. Find the mode of the distribution. Weights (Kg) No of Students Solution: Weights (Kg) 60 64 65 69 70 74 75 79 80 84 No of Students f 5 9 16 12 8 Class Boundary 59.5 64.5 64.5 69.5 69.5 74.5 74.5 79.5 79.5 84.5 60 64 5 65 69 9 70 74 16 75 79 12 80 84 8
Example: The following frequency distribution shows the numbers of children in each family in a locality. Find the mode. No of Children No of Families 0 6 1 30 2 42 3 55 4 25 5 18 6 5
Solution: The data follows discrete set of values So, Mode = 3 (corresponding to the maximum frequency) Graphic Location of Mode: Mode being a positional average so it can be located graphically by the following process.
o o
A histogram of the frequency distribution is drawn. In histogram, the highest rectangle represents the model class.
The top left corner of the highest rectangle is joined with the top left corner of the following rectangle and top right corner of the highest rectangle is joined with the top right corner of the preceding rectangle respectively. From the point of intersection of both the lines a perpendicular drawn on X-axis, check that point on X-axis. This will be the required value of mode.
Important Notes: 1. If there is a series with unequal class intervals, then the series must be regrouped to make the class interval equal in magnitude. 2. In case of inclusive class interval, the formula for calculation of Mode will be the same but the class intervals have to be converted in exclusive type. Advantages and Disadvantages of Mode: Advantages:

It is easy to understand and simple to calculate. It is not affected by extreme large or small values. It can be located only by inspection in ungrouped data and discrete frequency distribution. It can be useful for qualitative data. It can be computed in open-end frequency table. It can be located graphically.
Disadvantages:

It is not well defined. It is not based on all the values. It is stable for large values and it will not be well defined if the data consists of small number of values. It is not capable of further mathematical treatment. Sometimes, the data having one or more than one mode and sometimes the data having no mode at all. Median
Median is the most middle value in the arrayed data. It means that when the data are arranged, median is the middle value if the number of values is odd and the mean of the two middle values, if the numbers of values is even. A value which divides the arrayed set of data in two equal parts is called median, the values greater than the median is equal to the values smaller than the median. It is also known as a positional average.
Median from Ungrouped Data(Individual & Discrete Series):
Median = Value of
item
Example: Find the median of the values 4, 1, 8, 13, 11
Solution: Arrange the data 1, 4, 8, 11, 13
Median = Value of
item
Median = Value of Median = 8
item =
item
Example: Find the median of the values 5, 7, 10, 20, 16, 12 Solution: Arrange the data 5, 7, 10, 12, 16, 20
Median = Value of
item
Median = Value of Median =
item =
item
Median from Grouped Data: The median for grouped data, we find the cumulative frequencies and then calculated the median number n/2. The median lies in the group (class) which corresponds to the cumulative frequency in which n/2 lies. We use following formula to find the median.
Where l = Lower class boundary of the model class
f = Frequency of the median class n = f = Number of values or total frequency c = Cumulative frequency of the class preceding the median class h = Class interval size of the model class Example: Calculate median from the following data. Group 60 64 65 69 70 74 75 79 80 84 85 89
Frequency Solution:
12
Group 60 64 65 69 70 74 75 79 80 84 85 89
f 1 5 9 12 7 2
Class Boundary 59.5 64.5 64.5 69.5 69.5 74.5 74.5 79.5 79.5 84.5 84.5 89.5
Cumulative Frequency 1 6 15 27 34 36
item
item
Median from Discrete Data: When the data follows the discrete set of values grouped by size. We use the formula (n+1/2)th item for finding median. First we form a cumulative frequency distribution and median is that value which corresponds to the cumulative frequency in which (n+1/2)th item lies. Example: The following frequency distribution is classified according to the number of leaves on different branches. Calculate the median number of leaves per branch. No of Leaves No of Branches Solution: 1 2 2 11 3 15 4 20 5 25 6 18 7 10
No of Leaves X 1 2 3 4
No of Branches f 2 11 15 20
Cumulative Frequency C.F 2 13 28 48
5 6 7 Total
25 18 10 101
73 91 101
Median = Size of Median = 5 because
item item lies corresponding to 5
item
Quartiles: There are three quartiles called, first quartile, second quartile and third quartile. There quartiles divides the set of observations into four equal parts. The second quartile is equal to the median. The first quartile is also called lower quartile and is denoted by Q1. The third quartile is also called upper quartile and is denoted by Q3. The lower quartile Q1 is a point which has 25% observations less than it and 75% observations are above it. The upper quartile Q3 is a point with 75% observations below it and 25% observations above it. Quartile for Individual Observations (Ungrouped Data):
Quartile for a Frequency Distribution (Discrete Data):
Quartile for Grouped Frequency Distribution:
Example: The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200, 1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Find the quartile deviation and coefficient of quartile deviation. Solution: After arranging the observations in ascending order, we get 1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750, 1755, 1785, 1880, 1885, 1960.
Example: Calculate the quartile deviation and coefficient of quartile deviation from the data given below: Maximum Load (short-tons) 9.3-9.7 9.8-10.2 Number of Cables 2 5
10.3-10.7 10.8-11.2 11.3-11.7 11.8-12.2 12.3-12.7 12.8-13.2
12 17 14 6 3 1
Solution: The necessary calculations are given below: Maximum Load (short-tons) 9.3-9.7 9.8-10.2 10.3-10.7 10.8-11.2 11.3-11.7 11.8-12.2 12.3-12.7 12.8-13.2 Number of Cables (f) 2 5 12 17 14 6 3 1 Class Boundaries 9.25-9.75 9.75-10.25 10.25-10.75 10.75-11.25 11.25-11.75 11.75-12.25 12.25-12.75 12.75-13.25 Cumulative Frequencies 2 2+5=7 7 + 12 = 19 19 + 17 = 36 36 + 14 = 50 50 + 6 =56 56 + 3 = 59 59 + 1 =60
lies in the class
Where
and
lies in the class
Where
and
Graphic Location of Median: Median and other partition values can be located from the graph of the cumulative frequency polygon (Ogive Polygon). Suppose we have a graph of the cumulative frequency polygon as shown in the figure below.
On Y-axis, we mark the height equal to and from this point we draw a straight line parallel to Xaxis which intersects the polygon at the point m. from the point m, we draw a perpendicular which touches the X-axis at M. This point on X-axis is the median. Similarly, for the lower quartile we take height equal to n/4 on the Y-axis. From this we draw a line parallel to X-axis which the polygon at the point q. From this point we draw perpendicular on X-axis which touches it at the point Q1 which is the first quartile. For upper quartile take the height on Y-axis equal to 3n/4 Deciles and Percentiles: Deciles: The deciles are the partition values which divides the set of observations into tem equal parts. There are nine deciles namely D1,D2,D3 . . . . D9. The first decile is D1 is a point which has 10% of the observations below it. Deciles for Individual Observations (Ungrouped Data):
Quartile for a Frequency Distribution (Discrete Data):
Decile for Grouped Frequency Distribution:
Percentiles: The percentiles are the points which divide the set of observations into one hundred equal
parts. These points are denoted by P1, P2, P3 . . . . P99, and are called the first, second, third, , ninety ninth percentiles. The percentiles are calculated for very large number of observations like workers in factories and the population in provinces or countries. The percentiles are usually calculated for grouped data. The first percentile denoted by is calculated as P1 = Value of th th (n/100) item. We find the group in which the (n/100) item lies and then P1 is interpolated from the formula.
Introduction to Measure of Dispersion: A modern student of statistics is mainly interested in the study of variability and uncertainty. In this section we shall discuss variability and its measures and uncertainty will be discussed in probability. We live in a changing world. Changes are taking place in every sphere of life. A man of statistics does not show much interest in those things which are constant. The total area of the earth may not be very important to a research minded person but the area under different crops, area covered by forests, area covered by residential and commercial buildings are figures of great importance because these figures keep on changing form time to time and from place to place. Very large number of experts is engaged in the study of changing phenomenon. Experts working in different countries of the world keep a watch on forces which are responsible for bringing changes in the fields of human interest. The agricultural, industrial and mineral production and their transportation from one part to the other parts of the world are the matters of great interest to the economists, statisticians, and other experts. The changes in human population, the changes in standard living, and changes in literacy rate and the changes in price attract the experts to make
detailed studies about them and then correlate these changes with the human life. Thus variability or variation is something connected with human life and study is very important for mankind. Dispersion: The word dispersion has a technical meaning in statistics. The average measures the center of the data. It is one aspect observations. Another feature of the observations is as to how the observations are spread about the center. The observation may be close to the center or they may be spread away from the center. If the observation are close to the center (usually the arithmetic mean or median), we say that dispersion, scatter or variation is small. If the observations are spread away from the center, we say dispersion is large. Suppose we have three groups of students who have obtained the following marks in a test. The arithmetic means of the three groups are also given below: Group Group A: 46, B: 30, 48, 40, 50, 50, 52, 60, 54 70
Group C: 40, 50, 60, 70, 80 In a group A and B arithmetic means are equal i.e. . But in group A the observations are concentrated on the center. All students of group A have almost the same level of performance. We say that there is consistence in the observations in group A. In group B the mean is 50 but the observations are not closed to the center. One observation is as small as 30 and one observation is as large as 70. Thus there is greater dispersion in group B. In group C the mean is 60 but the spread of the observations with respect to the center 60 is the same as the spread of the observations in group B with respect to their own center which is 50. Thus in group B and C the means are different but their dispersion is the same. In group A and C the means are different and their dispersions are also different. Dispersion is an important feature of the observations and it is measured with the help of the measures of dispersion, scatter or variation. The word variability is also used for this idea of dispersion The study of dispersion is very important in statistical data. If in a certain factory there is consistence in the wages of workers, the workers will be satisfied. But if some workers have high wages and some have low wages, there will be unrest among the low paid workers and they might go on strikes and arrange demonstrations. If in a certain country some people are very poor and some are very high rich, we say there is economic disparity. It means that dispersion is large. The idea of dispersion is important in the study of wages of workers, prices of commodities, standard of living of different people, distribution of wealth, distribution of land among framers and various other fields of life. Some brief definitions of dispersion are: 1. The degree to which numerical data tend to spread about an average value is called the dispersion or variation of the data. 2. Dispersion or variation may be defined as a statistics signifying the extent of the scatteredness of items around a measure of central tendency. 3. Dispersion or variation is the measurement of the scatter of the size of the items of a series about the average Measures of Dispersion:
For the study of dispersion, we need some measures which show whether the dispersion is small or large. There are two types of measure of dispersion which are: (a) Absolute Measure of Dispersion (b) Relative Measure of Dispersion Absolute Measures of Dispersion: These measures give us an idea about the amount of dispersion in a set of observations. They give the answers in the same units as the units of the original observations. When the observations are in kilograms, the absolute measure is also in kilograms. If we have two sets of observations, we cannot always use the absolute measures to compare their dispersion. We shall explain later as to when the absolute measures can be used for comparison of dispersion in two or more than two sets of data. The absolute measures which are commonly used are: 1. 2. 3. 4. The Range The Quartile Deviation The Mean Deviation The Standard deviation and Variance
Relative Measure of Dispersion: These measures are calculated for the comparison of dispersion in two or more than two sets of observations. These measures are free of the units in which the original data is measured. If the original data is in dollar or kilometers, we do not use these units with relative measure of dispersion. These measures are a sort of ratio and are called coefficients. Each absolute measure of dispersion can be converted into its relative measure. Thus the relative measures of dispersion are: 1. 2. 3. 4. 5. Coefficient of Range or Coefficient of Dispersion. Coefficient of Quartile Deviation or Quartile Coefficient of Dispersion. Coefficient of Mean Deviation or Mean Deviation of Dispersion. Coefficient of Standard Deviation or Standard Coefficient of Dispersion. Coefficient of Variation (a special case of Standard Coefficient of Dispersion)
Range and Coefficient of Range: The Range: Range is defined as the difference between the maximum and the minimum observation of the given data. If xm denotes the maximum observation xo denotes the minimum observation then the range is defined as Range =
xm - xo
In case of grouped data, the range is the difference between the upper boundary of the highest class and the lower boundary of the lowest class. It is also calculated by using the difference between the mid points of the highest class and the lowest class. It is the simplest measure of dispersion. It gives a general idea about the total spread of the observations. It does not enjoy any prominent place in statistical theory. But it has its application and utility in quality control
methods which are used to maintain the quality of the products produced in factories. The quality of products is to be kept within certain range of values. The range is based on the two extreme observations. It gives no weight to the central values of the data. It is a poor measure of dispersion and does not give a good picture of the overall spread of the observations with respect to the center of the observations. Let us consider three groups of the data which have the same range: Group A: 30, 40, 40, 40, 40, 40, 50 Group B: 30, 30, 30, 40, 50, 50, 50 Group C: 30, 35, 40, 40, 40, 45, 50 In all the three groups the range is 50 30 = 20. In group A there is concentration of observations in the center. In group B the observations are friendly with the extreme corner and in group C the observations are almost equally distributed in the interval from 30 to 50. The range fails to explain these differences in the three groups of data. This defect in range cannot be removed even if we calculate the coefficient of range which is a relative measure of dispersion. If we calculate the range of a sample, we cannot draw any inferences about the range of the population. Coefficient of Range: It is relative measure of dispersion and is based on the value of range. It is also called range coefficient of dispersion. It is defined as:
Coefficient of Range The range xm - xo is standardized by the total xm +
xo
Let us take two sets of observations. Set A contains marks of five students in Mathematics out of 25 marks and group B contains marks of the same student in English out of 100 marks. Set A: 10, 15, 18, Set B: 30, 35, 40, The values of range and coefficient of range are calculated as: Range Set A: (Mathematics) Set B: (English) 20 - 10 = 10 50 - 30 = 20 20, 45, 20 50
Coefficient of Range
In set A the range is 10 and in set B the range is 20. Apparently it seems as if there is greater dispersion in set B. But this is not true. The range of 20 in set B is for large observations
and the range of 10 in set A is for small observations. Thus 20 and 10 cannot be compared directly. Their base is not the same. Marks in Mathematics are out of 25 and marks of English are out of 100. Thus, it makes no sense to compare 10 with 20. When we convert these two values into coefficient of range, we see that coefficient of range for set A is greater than that of set B. Thus there is greater dispersion or variation in set A. The marks of students in English are more stable than their marks in Mathematics. Example: Following are the wages of 8 workers of a factory. Find the range and the coefficient of range. Wages in ($) 1400, 1450, 1520, 1380, 1485, 1495, 1575, 1440. Solution: Here Largest value xm = 1575 and Smallest Value xo = 1380 Range = xm -
xo = 1575 1380 = 195
Coefficient of Range Example: The following distribution gives the numbers of houses and the number of persons per house. Number of Persons Number of Houses 1 26 2 113 3 120 4 95 5 60 6 42 7 21 8 14 9 5 10 4
Calculate the range and coefficient of range. Solution: Here Largest value Range = xm
xm = 10 and Smallest Value = xo = 1
- xo = 10 1 = 9
Coefficient of Range Example: Find the range of the weight of the students of a university. Weights (Kg) Number of Students 60-62 5 63-65 18 66-68 42 69-71 27 72-74 8
Calculate the range and coefficient of range. Solution:
Weights (Kg)
Class Boundaries
Mid Value
No. of Students
60-62 63-65 66-68 69-71 72.74
59.5-62.5 62.5-65.5 65.5-68.5 68.5-71.5 71.5-74.5
61 64 67 70 73
5 18 42 27 8
Method 1: Here xm = Upper class boundary of the highest class = 74.5
xo = Lower class boundary of the lowest class = 59.5 Range = xm - xo = 74.5 - 59.5 = 15 Kilogram
Coefficient of Range Method 2:
xm = Mid value of the highest class = 73 xo = Mid value of the lowest class = 61 Range = xm - xo = 73 61 = 12 Kilogram
Here Coefficient of Range
Quartile Deviation and its Coefficient: Quartile Deviation: It is based on the lower quartile Q1 and the upper quartile Q3. The difference Q3 - Q1 is called the inter quartile range. The difference Q3 - Q1 divided by 2 is called semi-inter-quartile range or the quartile deviation. Thus
Quartile Deviation (Q.D) The quartile deviation is a slightly better measure of absolute dispersion than the range. But it ignores the observation on the tails. If we take difference samples from a population and calculate their quartile deviations, their values are quite likely to be sufficiently different. This is called sampling fluctuation. It is not a popular measure of dispersion. The quartile deviation calculated
from the sample data does not help us to draw any conclusion (inference) about the quartile deviation in the population. Coefficient of Quartile Deviation: A relative measure of dispersion based on the quartile deviation is called the coefficient of quartile deviation. It is defined as
Coefficient of Quartile Deviation It is pure number free of any units of measurement. It can be used for comparing the dispersion in two or more than two sets of data. Example: The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200, 1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Find the quartile deviation and coefficient of quartile deviation. Solution: After arranging the observations in ascending order, we get 1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750, 1755, 1785, 1880, 1885, 1960.
Quartile Deviation (Q.D)
Coefficient of Quartile Deviation
Mean Deviation and its Coefficient: The Mean Deviation: The mean deviation or the average deviation is defined as the mean of the absolute deviations of observations from some suitable average which may be the arithmetic mean, the median or the mode. The difference (X - Average) is called deviation and when we ignore the negative sign, this deviation is written as |X - Average| and is read as mod deviations. The mean of these mod or absolute deviations is called the mean deviation or the mean absolute deviation. Thus for sample data in which the suitable average is the , the mean deviation (M.D.) is given by the relation:
For frequency distribution, the mean deviation is given by
When the mean deviation is calculated about the median, the formula becomes
The mean deviation about the mode is
For a population data the mean deviation about the population mean
is
The mean deviation is a better measure of absolute dispersion than the range and the quartile deviation. A drawback in the mean deviation is that we use the absolute deviations |X - Average| which does not seem logical. The reason for this is that is always equal to zero. Even if we
use median or mode in place of , even then the summation (X - Median) or (X - Mode) will be zero or approximately zero with the result that the mean deviation would always be either zero or close to zero. Thus the very definition of the mean deviation is possible only on the absolute deviations.
The mean deviation is based on all the observations, a property which is not possessed by the range and the quartile deviation. The formula of the mean deviation gives a mathematical impression that is a better way of measuring the variation in the data. Any suitable average among the mean, median or mode can be used in its calculation but the value of the mean deviation is minimum if the deviations are taken from the median. A series drawback of the mean deviation is that it cannot be used in statistical inference. Coefficient of the Mean Deviation: A relative measure of dispersion based on the mean deviation is called the coefficient of the mean deviation or the coefficient of dispersion. It is defined as the ratio of the mean deviation to the average used in the calculation of the mean deviation. Thus
Example: Calculate the mean deviation form (1) arithmetic mean (2) median (3) mode in respect of the marks obtained by nine students gives below and show that the mean deviation from median is minimum. Marks (out of 25): 7, 4, 10, 9, 15, 12, 7, 9, 7 Solution: After arranging the observations Marks: 4, 7, 7, 7, 9, 9, 10, 12, 15 in ascending order, we get
Mode = 7 (Since 7 is repeated maximum number of times) Marks (X) 4 7 7 4.89 1.89 1.89 5 2 2 3 0 0
7 9 9 10 12 15 Total
1.89 0.11 0.11 1.11 3.11 6.11 21.11
2 0 0 1 3 6 21
0 2 2 3 5 8 23
From the above calculations, it is clear that the mean deviation from the median hast the least value. Example: Calculate the mean deviation from mean and its coefficients from the following data. Size of Items Frequency 3-4 3 4-5 7 5-6 22 6-7 60 7-8 85 8-9 32 9-10 8
Solution: The necessary calculation is given below: Size of Items 3-4 4-5 5-6 6-7 7-8 8-9 9-10 Total X 3.5 4.5 5.5 6.5 7.5 8.5 9.5 f 3 7 22 60 85 32 8 217 Fx 10.5 31.5 121 390 637.5 272 76 1538.5 3.59 2.59 1.59 0.59 0.41 1.41 2.41 10.77 18.13 34.98 35.40 34.85 45.12 19.28 198.53
Standard Deviation: The standard deviation is defined as the positive square root of the mean of the square deviations taken from arithmetic mean of the data.
For the sample data the standard deviation is denoted by
and is defined as:
For a population data the standard deviation is denoted by (sigma) and is defined as:
For frequency distribution the formulas becomes
or The standard deviation is in the same units as the units of the original observations. If the original observations are in grams, the value of the standard deviation will also be in grams. The standard deviation plays a dominating role for the study of variation in the data. It is a very widely used measure of dispersion. It stands like a tower among measure of dispersion. As far as the important statistical tools are concerned, the first important tool is the mean and the second important tool is the standard deviation . It is based on all the observations and is subject to mathematical treatment. It is of great importance for the analysis of data and for the various statistical inferences. However some alternative methods are also available to compute standard deviation. The alternative methods simplify the computation. Moreover in discussing these methods we will confirm ourselves only to sample data because sample data rather than whole population confront mostly a statistician. Actual Mean Method:
In applying this method first of all we compute arithmetic mean of the given data either ungroup or grouped data. Then take the deviation from the actual mean. This method is already defined above. The following formulas are applied:
For Ungrouped Data
For Grouped Data
This method is also known as direct method Assumed Mean Method: (a) We use following formulas to calculate standard deviation:
For Ungrouped Data
For Grouped Data
Where D = X - A and A is any assumed mean other than zero. This method is also known as short-cut method. (b) If A is considered to be zero then the above formulas are reduced to the following formulas: For Ungrouped Data For Grouped Data
(c) If we are in a position to simplify the calculation by taking some common factor or divisor from the given data the formulas for computing standard deviation are: For Ungrouped Data For Grouped Data
Where , h = Class Interval and c = Common Divisor. This method is also called method of step-deviation. Examples of Standard Deviation: Example: Calculate the standard deviation for the following sample data using all methods: 2, 4, 8, 6, 10, and 12. Solution: Method-I: Actual Mean Method
2 4 6 8 10 12 X =42 X 2 4 8 6 10 12 Total
(2-7)2 = 25 (4-7)2 = 9 (8-7)2 = 1 (6-7)2 = 1 (10-7)2 = 9 (12-7)2 = 25 Method-II: Taking Assumed Mean as 6 D = (X - 6) -4 -2 2 0 4 6 D=6 D2 16 4 4 0 16 36 D2 = 76
Method-III: Taking Assume Mean as Zero
X 2 4 8 6 10 10 X = 42
X2 4 16 64 36 100 144 X2 = 364
Method-IV: Taking 2 as common divisor or factor X 2 4 8 6 10 12 Total
U =(X 4) /2
-1 0 2 1 3 4 u= 9
u2
1 0 4 1 9 16 u2 = 31
Example: Calculate standard deviation from the following distribution of marks by using all the methods. Marks 1-3 3-5 5-7 7-9 No. of Students 40 30 20 10
Solution: Method-I: Actual Mean Method Marks 1-3 3-5 5-7 7-9 Total f 40 30 20 10 100 X 2 4 6 8 fX 80 120 120 80 400 4 0 4 16 160 0 80 160 400
Marks Method-II: Taking assumed mean as 2 Marks 1-3 3-5 5-7 7-9 Total f 40 30 20 10 100 X 2 4 6 8 D = (X-2) 0 2 4 6 fD 0 60 80 60 200 fD2 0 120 320 160 800
Marks Method-III: Using assumed mean as Zero Marks 1-3 3-5 5-7 7-9 Total f 40 30 20 10 100 X 2 4 6 8 fX 80 120 120 80 400 fX2 160 480 720 640 2000
Marks Method-IV: By taking 2 as the common divisor Marks 1-3 3-5 5-7 7-9 Total f 40 30 20 10 100 X 2 4 6 8
u = (X-2)/2
-2 -1 0 1
fu -80 -30 0 10 -100
fu
160 30 0 10 200
Marks Mathematical Properties of Standard Deviation: 1. Same like the combined mean, the combined standard deviation can also be find for its component parts. Following formula can be used to calculate combined mean. 12 = N1(12 + d12) + N2 (22 + d22) N1 + N2 Here 12 = Combined Standard Deviation 1 = Standard deviation of first group of values 2 = Standard deviation of second group of values d1 = X1 X12 (deviations of mean of first group from combined mean) d2 = X2 X12 (deviations of mean of second group from combined mean) X1 = Mean of first group of values X2 = Mean of second group of values X12 = Combined Mean 2. In a normal distribution, the following relationships hold good: Mean 1 covers 68.27 % of the items Mean 2 covers 95.45 % of the items Mean 3 covers 99.73 % of the items Mean 1.96 covers 95 % of the items Coefficient of Standard Deviation:
The standard deviation is the absolute measure of dispersion. Its relative measure is called standard coefficient of dispersion or coefficient of standard deviation. It is defined as: Coefficient of Standard Deviation Coefficient of Variation: The most important of all the relative measure of dispersion is the coefficient of variation. This word is variation not variance. There is no such thing as coefficient of variance. The coefficient of variation (C.V.)is defined as:
Coefficient of Variation
Thus C.V. is the value of S when X is assumed equal to 100. It is a pure number and the unit of observations is not mentioned with its value. It is written in percentage form like 20% or 25%. When its value is 20%, it means that when the mean of the observations is assumed equal to 100, their standard deviation will be 20. The C.V. is used to compare the dispersion in different sets of data particularly the data which differ in their means or differ in the units of measurement. The wages of workers may be in dollars and the consumption of meat in their families may be in kilograms. The standard deviation of wages in dollars cannot be compared with the standard deviation of amounts of meat in kilograms. Both the standard deviations need to be converted into coefficient of variation for comparison. Suppose the value of C.V. for wages is 10% and the values of C.V. for kilograms of meat is 25%. This means that the wages of workers are consistent. Their wages are close to the overall average of their wages. But the families consume meat in quite different quantities. Some families use very small quantities of meat and some others use large quantities of meat. We say that there is greater variation in their consumption of meat. The observations about the quantity of meat are more dispersed or more variant. Example: Calculate the coefficient of standard deviation and coefficient of variation for the following sample data: 2, 4, 8, 6, 10, and 12. X 2 4 8 6 10 12 X = 42 (2-7)2 = 25 (4-7)2 = 9 (8-7)2 = 1 (6-7)2 = 1 (10-7)2 = 9 (12-7)2 = 25 Solution:
Coefficient of Standard Deviation Coefficient of Variation Example: Calculate coefficient of standard deviation and coefficient of variation from the following distribution of marks: Marks 1-3 3-5 5-7 7-9 Solution: Marks 1-3 3-5 5-7 7-9 Total f 40 30 20 10 X 2 4 6 8 fX 80 120 120 80 400 4 0 4 16 160 0 80 160 400 No. of Students 40 30 20 10
Marks
Coefficient of Standard Deviation Coefficient of Variation
Uses of Coefficient of Variation:
Coefficient of variation is used to know the consistency of the data. By consistency we mean the uniformity in the values of the data/distribution from arithmetic mean of the data/distribution. A distribution with smaller C.V. than the other is taken as more consistent than the other. C.V. is also very useful when comparing two or more sets of data that are measured in different units of measurement
The Variance: Variance is another absolute measure of dispersion. It is defined as the average of the squared difference between each of the observations in a set of data and the mean. For a sample data the variance is denoted is denoted by S2 and the population variance is denoted by 2 (sigma square). The sample variance S2has the formula:
Where X is sample mean and n is the number of observations in the sample. The population variance 2 is defined as:
Where is the mean of the population and N is the number of observations in the data. It may be remembered that the population variance 2 is usually not calculated. The sample variance S2is calculated and if need be, this S2 is used to make inference about the population variance.
The term is positive, therefore S2 is always positive. If the original observations are in centimeter, the value of the variance will be 2 (centimetres). Thus the unit of S2is the square of the units of the original measurement. For a frequency distribution the sample variance S2 is defined as:
For s frequency distribution the population variance 2 is defined as:
In simple words we can say that variance is the square of standard deviation. Variance = (Standard Deviation)2
Example: Calculate the variance for the following sample data: 2, 4, 8, 6, 10, and 12. Solution: X 2 4 8 6 10 12 X = 42 distribution of marks: Marks 1-3 3-5 5-7 7-9 No. of Students 40 30 20 10 (2-7)2 = 25 (4-7)2 = 9 (8-7)2 = 1 (6-7)2 = 1 (10-7)2 = 9 (12-7)2 = 25 Example: Calculate variance from the following
Solution: Marks 1-3 3-5 5-7 7-9 Total f 40 30 20 10 100 X 2 4 6 8 fX 80 120 120 80 400 4 0 4 16 160 0 80 160 400
Sheppard Corrections and Corrected Coefficient of Variation: Sheppard Corrections: In grouped data the different observations are put into the same class. In the calculation of variation or standard deviation for grouped data, the frequency f is multiplied with X which is the mid-point of the respective class. Thus it is assumed that all the observations in a class are centered at X. But this is not true because the observations are spread in the said class. This assumption introduces some error in the calculation of S2and S. The value of S2 and S can be corrected to some extent by applying Sheppard correction. Thus
Where h is the uniform class interval. This correction is applied in grouped data which has almost equal tails in the start and at the end of the data. If a data a longer tail on any side, this correction is not applied. If size of the class interval h is not the same in all classes, the correction is not applicable. Corrected Coefficient of Variation: When the corrected standard deviation is used in the calculation of the coefficient of variation, we get what is called the corrected coefficient of variation. Thus Corrected Coefficient of Variation
Example: Calculate Sheppard correction and corrected coefficient of variation from the following distribution of marks by using all the methods. Marks 1-3 3-5 5-7 7-9 No. of Students 40 30 20 10
Solution: Marks 1-3 3-5 5-7 7-9 Total
F
40 30 20 10 100
X 2 4 6 8
fX 80 120 120 80 400
u = (X - 2)/2
-2 -1 0 1
fu
-80 -30 0 10 -100
fu2
160 30 0 10 200
Corrected Coefficient of Variation
Combined Variance:
Like combined mean, the combined variance or standard deviation can be calculated for different sets of data. Suppose we have two sets of data containing and , and variances and . If and observations with means is the combined
is the combined mean and
variance of
observations, then combined variance is given by
It can be written as
Where The combine standard deviation can be calculated by taking the square root of .
Example: For a group of 50 male workers the mean and standard deviation of their daily wages are $ 63 and $ 9 respectively. For a group of 40 female workers these values are $ 54 and $ 6 respectively. Find the mean and variance of the combined group of 90 workers. Solution: Here , , , ,
Combined Arithmetic Mean
Combined Variance
Relationship Between the measures of Dispersion: 1. Quartile deviation is 0.6745 (almost 2/3) of Standard Deviation 2. Mean deviation is 0.7979 (almost 4/5)of standard deviation 3. Arithmetic Mean Quartile Deviation covers 50 % of the items. 4. Arithmetic Mean Mean Deviation covers 57.51 % of the items.
Skewness: Lack of symmetry is called Skewness. If a distribution is not symmetrical then it is called skewed distribution. So, mean, median and mode are different in values and one tail becomes longer than other. The skewness may be positive or negative. Symmetrical Distribution: In a symmetrical distribution Mean, Median and mode are equal. Positively skewed distribution: If the frequency curve has longer tail to right the distribution is known as positively skewed distribution and Mean > Median > Mode. Positively skewed distribution: If the frequency curve has longer tail to left the distribution is known as negatively skewed distribution and Mean < Median < Mode.
Measure of Skewness: The difference between the mean and mode gives as absolute measure of skewness. If we divide this difference by standard deviation we obtain a relative measure of skewness known as coefficient and denoted by SK. (i) Karl Pearson coefficient of Skewness
Sometimes the mode is difficult to find. So we use another formula
(ii) Bowleys coefficient of Skewness
(iii) Kellys Coefficient of Skewness SK = P90 + P10 2P50 P90 P10 SK = D9 + D1 2D5 D9 D1 Important Notes: 1. Method of Karl Pearson of skewness is more suitable in case of close ended frequency distribution. 2. Method of Bowely and Kelly is more suitable are more suitable if the class intervals are open ended. Moments: Moments are used to describe the characteristics of a distribution. They tell us whether a distribution is symmetrical or not. They also tell us about the nature of symmetry i.e. the curve is normal or more flat then the normal curve or more peaked than the normal curve. So we can also study the skewness of a distribution from the help of moments. MOMENTS ABOUT THE MEAN: In case of Individual observations: = 1 = (X - X) N Second moment about the mean = 2 = (X - X)2 N Third moment about the mean = 3 = (X - X)3 N Fourth moment about the mean = 4 = (X - X)4 N In case of Discrete & Continuous series: First moment about the mean = 1 = f(X - X) N Second moment about the mean = 2 = f(X - X)2 N Third moment about the mean = 3 = f(X - X)3 N First moment about the mean (Based on Percentiles) (Based on Deciles)
Fourth moment about the mean = 4 = f(X - X)4 N Here 1, 2, 3, 4 are representing first, second, third and fourth moment about the mean. Although Moments can be extended to powers like 5 or 6 or even more but in real practice, first four moments are sufficient to describe the characteristics of a frequency distribution. Moment about any arbitrary Value (Except Mean): In real practice, when mean value of a frequency distribution is not a integer value (i.e.-fraction value), then calculation becomes quite difficult as we have to calculate 2nd, 3rd and 4th powers of deviations from mean. In this situation moments are first calculated about any arbitrary origin and then these are converted to moments about the actual mean. When the deviations are taken from any arbitrary point, the formula will be: 1 = (X-A) N 2 = (X-A)2 N 3 = (X-A)3 N 4 = (X-A)4 N Here 1, 2, 3, 4 are representing the first, second, third and fourth moment about an arbitrary point A. For discrete and continuous series the formula will be 1 = f(X-A) N 2 = f(X-A)2 N 3 = f(X-A)3 N 4 = f(X-A)4 N Finding Central Moment from Moments about any arbitrary point: By using the following relationships, we can convert the moment about any arbitrary value into moment about mean. 1 = 0 2 = 2 - (1)2 3 = 3 - 3 1 2 + 2(1)3 4 = 4 - 4 1 3 + 6 2 12 - 3 14 To study the nature and characteristics of a frequency distribution, following two constants are calculated from moments.
(i)
1 = 32
23
(ii) 2 =
4 22
Here 1 is the measure of skewness and tells us whether a distribution is skewed or not. In case of a symmetrical distribution, 1 will be zero because in a symmetrical distribution, the deviations from the mean are always equal to zero and hence the 1 will be zero. But in case of second moment about the mean (2), the deviations are squared so the plus (+) and minus (-) signs are disappeared total will be a positive figure. In the third moment ( 3), the signs again come back and again the sum is zero. Thus all the odd moments are equal to zero if the distribution is symmetrical but if the distribution is not symmetrical the odd moments will also have a positive or negative value. Still 1 as a measure of skewness has a very serious limitation as it never tells about the direction of skewness as 1 can never have a negative value. This drawback can be removed by using the Karl Pearsons Gamma one (1) coefficient.
1 = + 1
3 23/2
From the formula given above, if the value of 3 is negative, skewness will also be negative and if it is positive, skewness will be also positive. 2 is the measure of Kurtosis. For a symmetrical distribution (normal distribution), the value of 2= 3 and the distribution will be called Mesokurtic. If the value of 2 is greater than 3, the curve is more peaked than the normal curve i.e.- Leptokurtic. If the value of 2 is less than 3, the curve is less peaked than the normal curve i.e.-Platykurtic. Kurtosis: Kurtosis refers to the degree of flatness or peakedness of a frequency distribution. If the curve is more peaked than the normal curve, it is called the leptokurtic curve. If it is more flat than the normal curve, it is called the Platykurtic curve. The normal curve is called the mesokurtic curve.
Leptokurtic Curve
Mesokurtic Curve
Platykurtic Curve

Statistics Basics UNIT-1

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Statistics Basics UNIT-1

Transféré par

Droits d'auteur :

Formats disponibles

History of Statistics The Word statistics have been derived from Latin word Status or the Italian word

(1) Descriptive Statistics (1) Descriptive Statistics:

(2) Inferential Statistics

----THE TITLE-------Prefatory Notes-------Box Head----

= Number of Students f (Frequency)

= 9.3 or 10 Class Boundary C.B Class Marks x

Diagrammatic & Graphical Representation of Statistical Data:

Simple bar chart showing the profit of a bank for 5 years.

Construct a component bar chart to illustrate this data.

The required diagram is given below:

Years 1991 1992 1993 1994

Construct a percentage component bar chart to illustrate this data.

1991 Cum % 43.0 65.8 100 %

1992 Cum % 53.1 70.4 100 %

1993 Cum % 50.0 68.6 100 % 48.9 14.1 37.0 100

1994 Cum % 48.9 63.0 100

43.0 22.8 34.2 100

53.1 17.3 29.6 100

50.0 18.6 31.4 100

% indicates Percentage of each item Cum % indicates the cumulative percentage.

Solution: The necessary computations are given below:

Example: Class Interval Frequency 0-5 4 5-10 10 10-15 18 15-20 8 20-25 6

Measures of Central Tendency

Methods Name Direct Method Indirect or Short-Cut Method Method of Step-Deviation

Subjects Math Statistics Physics Chemistry Biology Total

Wx 200 100 240 240 45 wx = 825

Now we will find weighted arithmetic mean as:

marks/subject. Merits and Demerits of Arithmetic Mean: Merits:

Example: Find the Geometric Mean of the following Data X f 13 2 14 5 15 13 16 7 17 3

For Ungrouped Data

For Grouped Data

Merits and Demerits of Geometric Mean: Merits:

x 13.2 14.2 14.8 15.2 16.1 Total

years. Merits and Demerits of Harmonic Mean: Merits:

A.M.> G.M. >H.M.

A.M. G.M. H.M.

Example: Find the median of the values 4, 1, 8, 13, 11

Solution: Arrange the data 1, 4, 8, 11, 13

Median = Value of Median = 8

Median = Value of Median =

Where l = Lower class boundary of the model class

Cumulative Frequency C.F 2 13 28 48

Median = Size of Median = 5 because

item item lies corresponding to 5

Quartile for a Frequency Distribution (Discrete Data):

Quartile for Grouped Frequency Distribution:

10.3-10.7 10.8-11.2 11.3-11.7 11.8-12.2 12.3-12.7 12.8-13.2

lies in the class

lies in the class

Quartile for a Frequency Distribution (Discrete Data):

Decile for Grouped Frequency Distribution:

Coefficient of Range The range xm - xo is standardized by the total xm +

xo = 1575 1380 = 195

xm = 10 and Smallest Value = xo = 1

Calculate the range and coefficient of range. Solution:

60-62 63-65 66-68 69-71 72.74

59.5-62.5 62.5-65.5 65.5-68.5 68.5-71.5 71.5-74.5

Method 1: Here xm = Upper class boundary of the highest class = 74.5

Quartile Deviation (Q.D)

Coefficient of Quartile Deviation

For frequency distribution, the mean deviation is given by

The mean deviation about the mode is

1.89 0.11 0.11 1.11 3.11 6.11 21.11