Académique Documents
Professionnel Documents
Culture Documents
Learning Goals
Why we use graphs What are the various types of graphs for presenting numerical data Which graph to use in which scenario Graphical Distortion
Inferences
What can you infer from the data? Practically nothing! How long before you come up with answers? Probably the first thing you do, is count the observations for each age. Note down the observations along with the corresponding age That makes a frequency table for you! Frequency table- Just like in categorical data, a frequency table for discrete numerical data lists each possible value (either individually or grouped into intervals), the associated frequency and sometimes the corresponding relative frequency. Note: Age is, in theory, a continuous variable as it can assume any value. But here the variable is, age, in whole years, which is discrete. But 44 distinct values in your data! Hence frequency table with 44 rows and one frequency column
In general, depending on need and size of data, decide whether to group or not ( For discrete data). For continuous data it is necessary to group. How to make groups?- Find max and min. Choose suitable class width= (max-min)/(desired no of classes), round off to the next integer, if decimal. If not, then the next integer
Applied Statistics and Computing Lab
6
Types of Graph
Graphs for presenting Numerical data:
Bar chart (for discrete variable) Histogram Frequency Polygon Ogive Line Diagram
Graph of the frequency distribution of continuous data Suppose given the ages of 507 people in continuous form- (Now age not reported in whole years, can take any value on real line) We draw histogram instead of bar chart Similar to bar chart for numerical data except that there are no gaps between the bars Length of each rectangle represents frequency of each equal classinterval , so that area represented by histogram= total frequency If class-intervals are not equal, then length represents relative frequency,(= class frequency/class interval) then total area enclosed by histogram=1
10
Inferences:
Maximum concentration is in the age group 20-25 Gives an idea about shape of the distributionfor eg, we can say that the distribution of ages is not symmetric, it is highly right skewed (See module on Skewness and kurtosis) Extent of spread or variation (see module on dispersion) What is bin width?- Bin width refers to the length of each class interval. How to choose bin width?- Well, R chose a bin width for you! The default bin width in R is given by Sturges Rule Some other Thumb Rules- Doanes Formula, Rice Rule, Scott Rule, Freedman Diaconis Rule, All you need to do is specify the option in breaks= in R (see histogram in R-code slide) For more details on these rules- http://en.wikipedia.org/wiki/Histogram
11
12
A frequency polygon is formed by plotting the frequencies of each class against their midpoints and joining the points by straight lines To get a closed polygon, we take two additional classes, one at each end, that have zero frequencies. ( The midpoints corresponding to these classes thus have zero frequencies) Basically, if superimposed on a histogram, it joins the midpoints of each rectangular bar by straight line segments We draw the frequency polygon for the ages data over the 13 histogram itself
Inferences
But is there any additional information you can derive from a frequency polygon, over and above which the histogram gives? Not really! In fact histogram gives more information since while it lists the entire class intervals, a frequency polygon only shows the midpoint. To appreciate fully, look at a frequency polygon without the corresponding histogram-
In the construction we have made a simplification by drawing the class frequency corresponding to the mid point of the class interval thereby losing more information14
16
What can you infer? Can you infer better? Which region should have the higher insurance demand?
Applied Statistics and Computing Lab
17
(Similar questions for region B) It wants to design separate products for the age groups 20 or less, 20-50, 50 and above and a few additional schemes for 60 plus people Clearly, it needs to know the cumulative frequency for each age group!
Applied Statistics and Computing Lab
18
A cumulative frequency curve or ogive is obtained by plotting cumulative, rather than individual class frequencies. There maybe two types of ogivesA curve showing the number of observations equal to or greater than the lower class limit of each corresponding class- referred to as more than type ogive A curve showing the number of observations equal to or less than the upper class limit of each corresponding classreferred to as less than type ogive
The black plot gives the less than type Ogive The purple plot gives the more than type Ogive From diagram insurance company readily has the answers- From the less than type Ogive, observe that there are 114 people aged less than 22, around 100 people aged less than 20 491 people aged less than 52, roughly 500 people less than 50 From the more than type ogive, we infer there are very few people above 60, something around 5 Q. Draw the ogives for Region B and try to answer the above questions. Compare with the age distribution for region A
Applied Statistics and Computing Lab
20
Line Diagram
The ages data is an example of cross section data. Use any of the above diagrams depending on nature of cross section data. But what if given a time series data- a series of observations given corresponding to each time point? For eg, consider the following data Year
1985 1990 1994 1995 1998 1999 2000
How to represent this graphically? Need to represent each value corresponding to each given year
Source: Falling Through The Net: Toward Digital Inclusion ( U.S Department of Commerce,October 2000)
21
Line Chart
Plot years on the horizontal axis and mark the values corresponding to each year on the vertical axis Join the points by line segments. We have our line graph ready! Think: Can we construct a histogram , ogive or bar chart with this data? Why or why not? Line diagram is meant for representing chronological data. It exhibits the relationship of the variable 22 with time.
Shows an increasing trend over the years- that is, from 1985 to 2000, the percentage of households with computers consistently rising From under 10% in 1985 it has crossed to over 50% in 2000, signifying an over 400% increase from 1985 to 2000 Useful for analysing time trend- that is, the long-term movement of time series data
Applied Statistics and Computing Lab
23
Seeing this graph, we conclude that poverty has been falling in this country as the number of people above poverty line is rising.
24
Draw a line chart showing the relative share of people above poverty line We see that the relative share of people above poverty line is actually declining and thus the relative share of people below poverty line is actually rising Our earlier conclusion, based on representation of inadequate data, led to a fallacious conclusion
25
R Codes
Histogram data=read.csv('ages.continuous.csv',header=TRUE,sep=',') View(data) age=data$age max(age) colors=c("red", "bisque", "darkslategray", "violet", "orange", "blue", "pink", "cyan","brown","cornsilk") # hist for histogram,right=TRUE means right-closed, left-open intervals hist(age,right=TRUE,col=colors) # To specify bin widths on your own bins=seq(17,67,by=5) hist(age,right=TRUE,breaks=bins,col=colors) #Example of Histogram with too small binwidth bins=seq(17,67,by=2) # Example of Histogram with too large binwidth bins=seq(17,67,by=25) hist(age,right=TRUE,breaks=bins,col=colors) # Drawing a frequency polygon over a histogram bins=seq(17,67,by=10) hist(age,right=TRUE,breaks=bins,col=colors,xlim=c(10,75)) # draw the histogram lines(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(age,seq(17,67,by=10)))),0),lwd=2) #draw the frequency polygon
27
R Codes
Frequency Polygon RegionA.age=data$RegionA.age RegionB.age=data$RegionB.age max(RegionA.age) min(RegionA.age) max(RegionB.age) min(RegionB.age) bins.A=seq(17,67,by=10) bins.B=seq(15,75,by=10) #To draw two frequency polygons on the same graph plot(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(RegionA.age,seq(17,67,by=10)))),0),type="b",main="Frequency distribution of age",xlab="age ",ylab="frequency", xlim=c(10,80),ylim=c(0,270)) lines(c(10,seq(20,70,by=10),80),c(0,as.vector(table(cut(RegionB.age,seq(15,75,by=10)))),0),lwd=2,col="violet") Line Chart data=read.csv('Households with computer.csv',header=TRUE,sep=',') household.comp=data$Households.with.computer.percentage Year=data$Year x=c(0,0,0,0,0) y=c(0,0,0,0,0) plot(x,y,xlab="Year",ylab="Percentage of Households with Computer",type="b",xlim=c(1985,2000),ylim=c(5,65)) lines(Year,household.comp,type="b",col="blue") title("Line Chart")
28
R Codes
Ogives min(data) max(data) NumberOfClasses = 10 ClassInterval = (67 - 17)/10 ClassInterval ClassEnds = seq(17,67,5) classes=cut(data[,1], breaks=ClassEnds) FrequencyDistribution = table(classes) CumulativeFrequencies = c(cumsum(FrequencyDistribution)) cbind(CumulativeFrequencies) #Less than type Ogive plot(ClassEnds,c(0,as.vector(CumulativeFrequencies)),type="b",xlim=c(10,70),ylim=c(0,700),main="Ogives",xlab="ClassIntervals",y lab="Cumulative Frequency of Age") #More than type Ogive cbind(FrequencyDistribution) Frequency=as.vector(FrequencyDistribution) cbind(as.vector(FrequencyDistribution)) More.than.cum.freq=cumsum(rev(Frequency)) Upper.limit=rev(ClassEnds) lines(Upper.limit,c(0,More.than.cum.freq),type="b",col="violet")
29
Thank you
30