Graphical Presentation 2

GRAPHICAL PRESENTATIONS 2 (REPRESENTATION OF NUMERICAL DATA)
Applied Statistics and Computing Lab Indian School of Business
Applied Statistics and Computing Lab
Learning Goals
Why we use graphs What are the various types of graphs for presenting numerical data Which graph to use in which scenario Graphical Distortion
Why use Graphs: An example

A private insurance firm interested in marketing its insurance products in region A. To target precisely, needs to know age distribution. QuestionsIn which age group does the highest number of people lie? Needs to divide population into 4 different age groups, to sell 4 different products It has the following dataApplied Statistics and Computing Lab
3
Data on ages of 507 people

23,21,23,26,22,27,29,37,55,53,21,19,20,18,32,20,28,19,23,33,40,28,24,36,23,29,3 4,31,34,42,45,23,46,26,30,25,20,37,24,36,28,29,23,23,25,24,37,42,30,28,29,39,26, 20,21,20,19,20,40,25,45,28,21,22,19,24,24,20,29,27,27,40,43,22,22,21,24,23,23,4 5,20,25,25,33,21,23,20,34,20,41,25,32,24,65,28,25,38,23,22,20,35,34,67,38,33,26, 25,52,21,32,43,24,28,62,45,40,21,23,30,20,28,41,32,26,37,38,27,23,50,25,23,43,3 3,22,26,37,32,23,37,23,27,23,27,24,21,25,23,23,46,34,25,29,45,44,35,55,25,31,19, 45,34,19,20,29,33,37,21,23,51,31,27,27,37,25,37,33,25,29,25,20,25,28,24,31,25,2 7,23,20,28,40,21,62,44,49,34,25,29,19,20,20,26,19,36,34,24,27,23,20,28,40,21,62, 44,49,34,25,29,19,20,20,26,19,36,34,24,27,22,22,48,21,27,33,34,54,25,35,22,21,4 1,23,19,29,27,36,21,20,20,24,35,33,25,45,55,49,30,28,25,23,26,21,26,32,32,32,35, 19,26,22,23,25,38,30,43,60,32,26,23,24,21,28,25,20,64,39,27,32,23,24,23,29,44,2 0,24,42,27,43,37,20,47,45,20,28,21,37,27,26,22,21,62,27,27,22,22,52,42,30,19,19, 19,24,21,36,32,52,26,56,30,23,21,44,37,51,38,23,44,26,23,20,44,25,18,22,35,24,2 5,23,22,24,26,26,28,34,24,33,46,51,25,19,35,19,19,20,41,33,44,19,29,35,33,22,33, 44,29,46,19,30,26,20,32,20,27,22,40,42,29,31,22,29,36,37,25,46,25,43,43,24,24,1 9,46,29,26,32,29,34,26,34,22,25,41,38,21,34,37,56,28,35,29,22,22,24,36,40,40,37, 23,34,20,23,40,20,30,32,30,21,39,37,22,39,49,24,20,40,24,39,32,24,22,20,27,21,2 6,28,26,18,30,22,30,18,52,25,28,42,23,41,32,22,24,25,27,24,27,31,35,21,36,20,23, 19,25,31,32,40,41,36,43,34,26,29,23,45,33,29,29,45,48,19,38,26,48,22,32,44,44,1 9,32,30,
4
Inferences
What can you infer from the data? Practically nothing! How long before you come up with answers? Probably the first thing you do, is count the observations for each age. Note down the observations along with the corresponding age That makes a frequency table for you! Frequency table- Just like in categorical data, a frequency table for discrete numerical data lists each possible value (either individually or grouped into intervals), the associated frequency and sometimes the corresponding relative frequency. Note: Age is, in theory, a continuous variable as it can assume any value. But here the variable is, age, in whole years, which is discrete. But 44 distinct values in your data! Hence frequency table with 44 rows and one frequency column
So, list individually or group?

List Individually or group into intervals?In the ages data, there are 44 distinct values. If we list individually, we have data with 44 rows! Cumbersome to interpret Insurance company interested in selling 4 different products catering to the needs of 4 different age groups. Interested in 4 age categories
In general, depending on need and size of data, decide whether to group or not ( For discrete data). For continuous data it is necessary to group. How to make groups?- Find max and min. Choose suitable class width= (max-min)/(desired no of classes), round off to the next integer, if decimal. If not, then the next integer
6
Construction of Frequency table

Class Interval 17-29 30-42 43-55 56-68 Frequency 298 142 56 10
Do we have the answers in a minute from this table?

The age group 17-29 has the maximum number of people We also have the exact number of people in each age group This same data can be represented pictorially in a number of ways!
7
Types of Graph
Graphs for presenting Numerical data:
Bar chart (for discrete variable) Histogram Frequency Polygon Ogive Line Diagram
Bar Chart (Numerical Data)

Graph of the frequency distribution Similar to bar chart for categorical data Each frequency or relative frequency is represented by a rectangle centered over the corresponding value (or range of values for grouped data) Area of the rectangle is proportional to the corresponding frequency or relative frequency We could name the groups group 1, group 2, group 3 and group 4 and plot the corresponding frequencies, exactly like in case of categorical data (Exercise) Conceptually hence there is no difference between the two
9
Histogram( for continuous numerical data)-
Graph of the frequency distribution of continuous data Suppose given the ages of 507 people in continuous form- (Now age not reported in whole years, can take any value on real line) We draw histogram instead of bar chart Similar to bar chart for numerical data except that there are no gaps between the bars Length of each rectangle represents frequency of each equal classinterval , so that area represented by histogram= total frequency If class-intervals are not equal, then length represents relative frequency,(= class frequency/class interval) then total area enclosed by histogram=1
10
Inferences:
Maximum concentration is in the age group 20-25 Gives an idea about shape of the distributionfor eg, we can say that the distribution of ages is not symmetric, it is highly right skewed (See module on Skewness and kurtosis) Extent of spread or variation (see module on dispersion) What is bin width?- Bin width refers to the length of each class interval. How to choose bin width?- Well, R chose a bin width for you! The default bin width in R is given by Sturges Rule Some other Thumb Rules- Doanes Formula, Rice Rule, Scott Rule, Freedman Diaconis Rule, All you need to do is specify the option in breaks= in R (see histogram in R-code slide) For more details on these rules- http://en.wikipedia.org/wiki/Histogram
11
How to Choose optimal bin width?

Lots of research going on regarding optimal bandwidth For all practical purposes, you can just rely on the default bandwidth option in R! You can also specify your own bandwidth option in R- suppose you want bin width of 5 Just make sure, that the no of class intervals are not too small or not too large (generally between 5 to 20). We show how to specify various bin widths in R and choose the best that suits your purposeHistogram with too small binwidth( Problem: Shows too much individual data and does not allow the underlying pattern, ie, frequency distribution of the data to be easily seen) Histogram with too large binwidth(Problem: Bins are too large and does not convey the properties of the distribution)
12
Frequency Polygon (for representing continuous data)
A frequency polygon is formed by plotting the frequencies of each class against their midpoints and joining the points by straight lines To get a closed polygon, we take two additional classes, one at each end, that have zero frequencies. ( The midpoints corresponding to these classes thus have zero frequencies) Basically, if superimposed on a histogram, it joins the midpoints of each rectangular bar by straight line segments We draw the frequency polygon for the ages data over the 13 histogram itself
Inferences
But is there any additional information you can derive from a frequency polygon, over and above which the histogram gives? Not really! In fact histogram gives more information since while it lists the entire class intervals, a frequency polygon only shows the midpoint. To appreciate fully, look at a frequency polygon without the corresponding histogram-
In the construction we have made a simplification by drawing the class frequency corresponding to the mid point of the class interval thereby losing more information14
Why use Frequency Polygon?

For comparing between two sets of data the corresponding frequency polygons can be drawn on the same graph Drawing two histograms on the same diagram for comparison purposes is confusing The insurance company is looking at the profitability of investing in two regions- region A and region B. Region with a higher proportion of 50 plus population demands more insurance. The ages.both.regions.csv data gives the ages of a random sample of 507 people in both region A and in region B Draw two histograms on the same diagram and try to compareApplied Statistics and Computing Lab
15
Why use Frequency polygon

Q. What can you infer? Practically nothing, right!
16
Why Frequency Polygon (Contd)

Draw two frequency polygons on the same diagram and compare.
What can you infer? Can you infer better? Which region should have the higher insurance demand?
17
Ogives- Cumulative Frequency Curves

Now suppose the insurance company wants answers to more particular questionsIn region A, how many are 50 years or more? In region A, how many people are 20 years or less? In region A, how many people are 60 years or more?
(Similar questions for region B) It wants to design separate products for the age groups 20 or less, 20-50, 50 and above and a few additional schemes for 60 plus people Clearly, it needs to know the cumulative frequency for each age group!
18
Ogives for Region A
A cumulative frequency curve or ogive is obtained by plotting cumulative, rather than individual class frequencies. There maybe two types of ogivesA curve showing the number of observations equal to or greater than the lower class limit of each corresponding classreferred to as more than type ogive A curve showing the number of observations equal to or less than the upper class limit of each corresponding classreferred to as less than type ogive
Each successive point is joined by line segments to give the ogive

19
Ogives- For Region A
The black plot gives the less than type Ogive The purple plot gives the more than type Ogive From diagram insurance company readily has the answers- From the less than type Ogive, observe that there are 114 people aged less than 22, around 100 people aged less than 20 491 people aged less than 52, roughly 500 people less than 50 From the more than type ogive, we infer there are very few people above 60, something around 5 Q. Draw the ogives for Region B and try to answer the above questions. Compare with the age distribution for region A
20
Line Diagram
The ages data is an example of cross section data. Use any of the above diagrams depending on nature of cross section data. But what if given a time series data- a series of observations given corresponding to each time point? For eg, consider the following data Year
1985 1990 1994 1995 1998 1999 2000
Households with computer

8.2 15 22.8 24.1 36.6 42.1 51
How to represent this graphically? Need to represent each value corresponding to each given year
Source: Falling Through The Net: Toward Digital Inclusion ( U.S Department of Commerce,October 2000)
21
Line Chart
Plot years on the horizontal axis and mark the values corresponding to each year on the vertical axis Join the points by line segments. We have our line graph ready! Think: Can we construct a histogram , ogive or bar chart with this data? Why or why not? Line diagram is meant for representing chronological data. It exhibits the relationship of the variable 22 with time.
Line Chart: Inferences
Shows an increasing trend over the years- that is, from 1985 to 2000, the percentage of households with computers consistently rising From under 10% in 1985 it has crossed to over 50% in 2000, signifying an over 400% increase from 1985 to 2000 Useful for analysing time trend- that is, the long-term movement of time series data
23
Graphical Distortion of Data

As much as graphs can be used to summarize and represent various aspects of data succinctly it can also be used to distort data First might be inadequate representation of data. Consider the following line graph showing the population above poverty line of a hypothetical country APeople above poverty line
1200 1000 800 600 400 200 0 1990 1995 2000 2005 2010 People above poverty line
Seeing this graph, we conclude that poverty has been falling in this country as the number of people above poverty line is rising.
24
Graphical Distortion: Continued

But now, this graph used inadequate information- this is the table from which the graph has been produced Relative share of people above poverty line
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1990 1995 2000 2005 2010 Relative share of people above poverty line
Draw a line chart showing the relative share of people above poverty line We see that the relative share of people above poverty line is actually declining and thus the relative share of people below poverty line is actually rising Our earlier conclusion, based on representation of inadequate data, led to a fallacious conclusion
25
Graphical Distortion of data Contd..

The above is just an example. There might be numerous ways in which data can be misrepresented For eg, one common misuse might be distortion with scale With the explosion of data visualization techniques and sophisticated displays like 3-D charts data distortion can be easier to achieve For more information readhttp://lilt.ilstu.edu/gmklass/pos138/datadisplay/bad chart.htm
26
R Codes
Histogram data=read.csv('ages.continuous.csv',header=TRUE,sep=',') View(data) age=data$age max(age) colors=c("red", "bisque", "darkslategray", "violet", "orange", "blue", "pink", "cyan","brown","cornsilk") # hist for histogram,right=TRUE means right-closed, left-open intervals hist(age,right=TRUE,col=colors) # To specify bin widths on your own bins=seq(17,67,by=5) hist(age,right=TRUE,breaks=bins,col=colors) #Example of Histogram with too small binwidth bins=seq(17,67,by=2) # Example of Histogram with too large binwidth bins=seq(17,67,by=25) hist(age,right=TRUE,breaks=bins,col=colors) # Drawing a frequency polygon over a histogram bins=seq(17,67,by=10) hist(age,right=TRUE,breaks=bins,col=colors,xlim=c(10,75)) # draw the histogram lines(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(age,seq(17,67,by=10)))),0),lwd=2) #draw the frequency polygon
27
R Codes
Frequency Polygon RegionA.age=data$RegionA.age RegionB.age=data$RegionB.age max(RegionA.age) min(RegionA.age) max(RegionB.age) min(RegionB.age) bins.A=seq(17,67,by=10) bins.B=seq(15,75,by=10) #To draw two frequency polygons on the same graph plot(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(RegionA.age,seq(17,67,by=10)))),0),type="b",main="Frequency distribution of age",xlab="age ",ylab="frequency", xlim=c(10,80),ylim=c(0,270)) lines(c(10,seq(20,70,by=10),80),c(0,as.vector(table(cut(RegionB.age,seq(15,75,by=10)))),0),lwd=2,col="violet") Line Chart data=read.csv('Households with computer.csv',header=TRUE,sep=',') household.comp=data$Households.with.computer.percentage Year=data$Year x=c(0,0,0,0,0) y=c(0,0,0,0,0) plot(x,y,xlab="Year",ylab="Percentage of Households with Computer",type="b",xlim=c(1985,2000),ylim=c(5,65)) lines(Year,household.comp,type="b",col="blue") title("Line Chart")
28
R Codes
Ogives min(data) max(data) NumberOfClasses = 10 ClassInterval = (67 - 17)/10 ClassInterval ClassEnds = seq(17,67,5) classes=cut(data[,1], breaks=ClassEnds) FrequencyDistribution = table(classes) CumulativeFrequencies = c(cumsum(FrequencyDistribution)) cbind(CumulativeFrequencies) #Less than type Ogive plot(ClassEnds,c(0,as.vector(CumulativeFrequencies)),type="b",xlim=c(10,70),ylim=c(0,700),main="Ogives",xlab="ClassIntervals",y lab="Cumulative Frequency of Age") #More than type Ogive cbind(FrequencyDistribution) Frequency=as.vector(FrequencyDistribution) cbind(as.vector(FrequencyDistribution)) More.than.cum.freq=cumsum(rev(Frequency)) Upper.limit=rev(ClassEnds) lines(Upper.limit,c(0,More.than.cum.freq),type="b",col="violet")
29
Thank you
30

Graphical Presentation 2

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Graphical Presentation 2

Transféré par

Droits d'auteur :

Formats disponibles

GRAPHICAL PRESENTATIONS 2 (REPRESENTATION OF NUMERICAL DATA)

Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Why use Graphs: An example

Data on ages of 507 people

Applied Statistics and Computing Lab

So, list individually or group?

Construction of Frequency table

Do we have the answers in a minute from this table?

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Bar Chart (Numerical Data)

Histogram( for continuous numerical data)-

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

How to Choose optimal bin width?

Applied Statistics and Computing Lab

Frequency Polygon (for representing continuous data)

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Why use Frequency Polygon?

Why use Frequency polygon

Applied Statistics and Computing Lab

Why Frequency Polygon (Contd)

Ogives- Cumulative Frequency Curves

Ogives for Region A

Each successive point is joined by line segments to give the ogive

Ogives- For Region A

Households with computer

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Line Chart: Inferences

Graphical Distortion of Data

Applied Statistics and Computing Lab

Graphical Distortion: Continued

Applied Statistics and Computing Lab

Graphical Distortion of data Contd..

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Vous aimerez peut-être aussi