Vous êtes sur la page 1sur 11

# 1|Page

BIOSTATISTICS Wednesday >>> 15th Feb. 2012 Dr. Atallah Z. Rabi Slide (19): The data types As we have discussed before, the date are numbers about what we are interested in; like the weight, the blood pressure, the number of students, grades of the students, so the data we collect is the numbers. So if I want to know your weight either I will ask you or I will weight you, numbers as this collected is called weight data, as the data is divided into two types; one is the continuous data and the other is the categorical data or discrete data, categories means that usually we have groups of issues or categories, for example; male or female, it is a category so the person belong either to the male category or to the female category, but not to both of them. Another example is the blood group of humans the ABO system, also the state of health is an example; like healthy or sick. And as you notice the discrete data is usually mutually exclusive; which means that if one person belongs to one category, it means that he belongs to that category only; notice that the person cant be male and female at the same time, we cant have blood group A and blood group B at the same time, so it is mutually exclusive. So if the person is in this category, it means that he doesnt belong to any other category, and the total would be the total of the population. Lets say that I have 100 people and 30 of them belongs to blood group A, 18 of them belongs to blood group B, 32 of them belongs to blood group AB, and now a smart person like you will automatically know how many people are having blood group O, they are 20; because it is mutually exclusive, so if we have 80 people who have A or B or AB, out of 100, so the remaining will be O. So this is what we mean by mutually exclusive, and it usually covers everybody, all of the population will be included, we have no exclusions.
2|Page

So this is what we call categorical data or discrete, so there is no continuity of the number, so we have 18 individuals not 18.3 having blood group B for example, however the continuous is what we usually measure or weight, so the number is not discrete, it is continuous. So if we ask someone what is your weight? The answer would be for example 67 Kg, but it is possible on a measuring scale to get 67.3 Kg, when measuring on a scale it would be 67.35 Kg, so there is a rule for the continuity of the data, and this is for the continuous data. So the continuous data is used to record the measurements of individuals that can take any value within an acceptable range, so what do we mean by saying within an acceptable range? That means I couldnt say that the weight of a student like you is 1000 Kg, it is known to be less than 100, and we couldnt say that his weight is 15 Kg, so that is what we mean by an acceptable range, and the acceptable range is for that specific measurement that we are considering or we are concerned with. Another example is the range of your secondary certificates; it is between 85 and 100, so no one would take 105 or 65, so this is what we mean when we say within an acceptable range, and the acceptable range varies from one source of data with the others; like the pulse rate for example, usually the acceptable range of the pulse rate for individuals like you is between 60 and 80, and for the doctor it will be between 80 and 100. And they say in medicine that the number of your heart beats or pulse rate is fixed, so the faster your heart beats, the closer to the end of your life. Usually the data we collect most of the time describes the people we are concerned with, it describes either the weights, the grades, the pulse rate, the blood pressure, the blood group, the income and whatever we want, so it gives a good description of the population we are concerned with. And this is why the relevant statistics is called descriptive statistics, it describes the population we are concerned with, the JUST students
3|Page

for example, if we are concerned with their distribution on faculties; we have medicine faculty, engineering faculty, agriculture faculty, nursing faculty, etc. The number of students in each faculty describes the distribution of the students according to the faculty. Also we may have another descriptions; for example the distribution according to the gender; males or females, or according to the year; 1st year students,2nd year, 3rd year, 4th year and in medical and dental schools they have 5th year and 6th year, etc. So this is what we call descriptive statistics. Slide (20): First look at the data Our goal is to show you how to get a first look at the data and get ready to do more elaborate procedures, it is a numerical summary of the data, and you should know that descriptive statistics should be clear and easily interpreted. What do we mean by clear? When I talk about the descriptive statistics, I should mention what I want, do I want to describe the students in just according to their year level?, or according to their nationality?; they say that we have 57 nationalities in JUST, or according to the blood groups?, and etc. So we have to be clear and this should be stated, and this description The categorization of students should follow this description, and it should cover all aspects. When I say according to the year level, we have 1st, 2nd, 3rd, 4th,5th and 6th year, and no more than that. According to the level either the bachelor degree or the graduate degree, we shouldnt need any other categories assigned, we shouldnt need any question for somebody; like what about those in the 7th year or 9th year? Do we have people like them? If we have then we should include them in our descriptive statistics. Slide (21): Measures of central tendency Now in the data we have what we call measures of central tendency,
4|Page

so what do we mean by central tendency? It means that most of the numbers would be around a certain group; for example when the doctor do the M391 test which is our course by the way, one of the students would ask; what is the students average? So the average is one of the measurements of the central tendency. The descriptive data or any data about the group, except for the categories, moves towards the center, so this is what we call the central tendency, and we have to be careful about the central tendency mean for example. So the Mean is the arithmetic average, for example; if 3 people were in hospital 8, 10 and 30 days respectively, the mean time is how many? I have to add all of them and divide by the number of them, so the average would be about 16 days (48/3). We have to be careful with the mean, sometimes it is misleading, for example; if we have a child of 6 years old, and she has friends, and she wants to make a birthday party, she will invite those of her age, then one of her friends bring her grandmother with her, and the grandmother age is 90, then we have 6 friends and our girl (6+1=7) whos age is 6 years old, now (6*7=42) and the grandmothers age is 90, (42+90=132), now we divide 132 by the number of people we have in the party which is 8, which makes the average age of those on the birthday party 16 years. Now it is misleading to say that the average age of those in the birthday party is 16 years, so we have to be careful about this. And this is why we have another measure of the central tendency which is called the median, the Median is the value at which 50% of the numbers or the measurements are higher and 50% of them are lower, so it divides the data into two groups; higher and lower, so when we calculate the median, what is the first step we should do? We should arrange the data we have into ascending or descending order. We have two types of data; the odd numbers and the even numbers. Now the odd numbers would be like 17, 23, 29, 31 or etc., and in the odd numbers the median would be in the middle, lets say if we have 31 numbers of data, so what is the median after arranging in
5|Page

ascending or descending order? It will be the data number 16, because with 16 we have 15 data above that and another 15 data below that, so this is in the odd number. However if we have an even number of data; like 32 for example, so after arranging the data, in the middle we wont have one number we will have two numbers, so we add them and divide them by 2, and the result would be the median. So the first step is to arrange the data into ascending or descending order, then we look at the middle or the median value. So as a result, the median value if the number of the data is odd will be one number, and if the number of data is even we will have two numbers; so we add them together and then we divide the sum by 2. And there are equations that we usually use, if the number is odd like 31; we use {rank=(n+1)/2} where (n) is the number of data which is here 31, so {(31+1)/2} which equals 16, and this is why we say (n+1), and this is for the case of an odd data number, and in the case of an even data number, it will be in the middle The middle two values. So the mean is the arithmetic average, and the median is the value that divides the data we have into two equal parts; 50% above it and 50% below it. And the Mode is the most common value, we dont usually use it, for example with our birthday party, we have 6, 6, 6, 6, 6, 6, 6 and 90, so the mode is 6, but the average is 16 and the median is 6. Slide (22): Mean calculation Now I think you dont have any problem with calculating the mean. Slide (23): Measures of dispersion Now we have another way of measuring the data, so to overcome the differences for the birthday, we have what we call the measures of dispersion, dispersion is the variation from the mean, so the more variation around the mean, the more the measures of dispersion.
6|Page

We have 3 measures of dispersion and sometimes we use the quartile; especially the difference between the 3rd quartile and the 1st quartile, but these are the most commonly used measures of dispersion, which means the variation from the mean, which are: 1) Range. 2) Variance. 3) Standard. The Range is the difference between the highest value and the smallest value in our data, so {range = the highest value the smallest value}. So now how do we calculate the range? We should identify the lowest value and the highest value, and subtract the lowest from the highest, and by that we get the range. The Variance is the sum () of the squares of the difference between the value and the mean, divided by the sample size number. The Standard deviation is the square root. Slide (24): Formulae for measures of variation And now we will see how to calculate the range, the range is the difference between X max. and X min. , so the range = X max. - X min. . Standard sample is the sum () of squares the difference between _2 _ the value and the mean , the value is (Xi) and the mean is (X), so the standard sample = (Xi X) and this is the sum of the squares. Now the sample variance is usually abbreviated by s , and the _ Sample variance is the sum of squares the difference between 2 2 the value (Xi) and the mean (X) , divided by (n 1), lets say that the mean is 8 and the value is 6, then (6-8) is (-2) which = 4. We use the variance in order to overcome the minus values, otherwise if we add all the values the differences between the mean and the measurements values we will have 0, so this is the variance sample or the sample variance. Now why do we divide the sum by (n 1)? And this is important, because it we be with us for the rest of the
7|Page 2

course, (n 1) is what we call the degrees of freedom, as I mentioned when we said we have 100 people, and they are divided into categories according to their blood groups, and we know three values, and by that the fourth one would be known automatically. So this is what we call the degrees of freedom, which is always = (n 1). So these three values for example, I could give them the number I want; like for example; 18, 32, and 30, another one might not agree with the me, and said that according to her sample or the group she studied, the values are 20, 30 and 25, so she give different values, but the fourth one is fixed anyway, so we can leave it without mentioning; because it is fixed depending on the three values that she gave. Now the fourth value according to the doctor equals {100 (18+32+30)}, {100 80} which = 20, and the fourth value according to her group equals {100 (20+30+25)}, {100 75} which = 25, so as we said before; the fourth value if fixed anyway because it is depending on the other 3 values. So this is why we have the degrees of freedom, which is = (n 1), we could assign any values, and the last one is indicated by the numbers we gave. So again the sample variance is: the square of (Xi) which means _ any value minus the mean or the arithmetic average of the data (X), divided by (n 1). Now the variance of the population is ( ) = the square sum () of (Xi) minus () divided by N. _ Notice that () is the mean of the population while (X) X bar is the mean of the sample, also notice that (N) is the number of the whole population or the population size, where (n) is the number of the sample. And here with the population we divided by N and not by (n 1), do you know why? Because the population usually is very large, so it wouldnt make much difference if we divide by the whole population number or the population number minus 1.
8|Page 2

Lets say that the population of students in JUST size is 20000, it wouldnt make that much difference if I divided by 20000 or 19999, and that is why we divide by (N). * Remember this information very well; because the doctor was going to give a bonus of 0.25 extra mark to the student who answer, badly no one answered . Now the Standard deviation usually is the positive square root of the variance, because the square root is + or - , so we take the 2 positive value. So the standard deviation of the sample is the square root of s , which equals s. And the standard deviation of the population is the square root of which equals . Now the Coefficient of variance can be calculated by dividing the standard deviation by the mean and then multiplying by 100, so some times we have the same standard deviation, so we have the same variation around the mean, but we have different means, so in order to describe the difference or to show the difference between this group of data and that group, we have what we call the coefficient of variation. And again the coefficient of variance = (the standard deviation / the mean) * 100. There is in statistics something called the standard error and it always will be in samples, so whenever we take a sample, which is a small group of people, we collect data from them and interpret the results on the population, and make implements to the population, so the Standard error of the mean = the standard deviation / the square root of the sample size.
2

Now which is higher the standard error of a small sample or a large sample? The standard error of the small sample is higher; because you know that the larger the sample size is, the closer we are to approach the
9|Page

population size, so the more our sample would be representing the population. If I have this class of 100 students, and I take a sample of 10 students, the standard error would be, the standard deviation / the square root of 10, the square root of 10 is approximately 3. Now if I take a sample of 20, then the standard error will be less; because I divide the standard deviation which is the same as before when the sample was 10 by the square root of (n) which is here the square root of 20, the square root of 20 is approximately 4.5. Slide (25): Line histogram showing distribution of HR in women And this is what we call the data presentation and the histogram distribution And here is an example of how we present our data, after collecting the data, we have to make a data presentation. This is the pulse rate between 175 and 105, and this is what we call reasonably correct or appropriate. This heart rate is usually for very old women; like grandmothers it also works for our 90 years old grandmother. And this is what we call line histogram, and we also have what is called bar histogram.

10 | P a g e

On Monday we will take exercises on how to measure or how to calculate the mean, the median, the mode, the standard deviation, the variance, the coefficient of variance, and we will take examples and tables and we will calculate them. And now you are free to go ^_^ And Thank you The End Done by: Raja Amin El-haddad

Life in lines Life makes everyone wonder and say wow !!! Every day is a big surprise And every day is a big gift So lets fly above the clouds Carrying happiness and leaving sadness Because thats the life and we all travelling on its journey So leave the anger And make sure you are a good passenger

11 | P a g e