Académique Documents
Professionnel Documents
Culture Documents
Learning goals
Understanding a possible approach to data analysis Studying three data representation techniques:
Stem and leaf plot Frequency table Dot plot
Data Analysis
Exploratory
Cleaning Summarization Exploration of salient features
Location Variability (spread) Concentration
Shape Skewness Tail information
Inferential
Applied Statistics and Computing Lab
3
Dataset
The percentage of employees involved in a certain worker involvement in decision making program, in 30 companies: (5, 32, 53, 35, 42, 43, 52, 45, 46, 44, 37, 48, 58, 49, 57, 50, 47, 78, 34, 51, 42, 52, 47, 33, 55, 56, 49, 48, 63, 38) Arranged in ascending order: (5, 32, 33, 34, 35, 37, 38, 42, 42, 43, 44, 45, 46, 47, 47, 48, 48, 49, 49, 50, 51, 52, 52, 53, 55, 56, 57, 58, 63, 78)
0 | 5 1 | 2 | 3 | 234578 4 | 223456778899 5 | 012235678 6 | 3 7 | 8
4 Data taken from Aczel A., Sounderpandian J. Complete business statistics
Example
GPA of 50 students in the first semester exam for their second course in Quantitative methods The GPA range is 0-10 The numbers have 7 values after the decimal point Converted into 1 value after decimal point format
Applied Statistics and Computing Lab
6
For negative values, a ve sign is put in front of the stem Stem and leaf plot is a powerful tool to study a data Gives an idea about the distribution of values; their spread and density Useful in detecting unusual values and the value occurring with the highest frequency Easy to read and understand Not very informative if there are too few or too many values
7
Frequency table
A table listing the frequency counts for each value of a variable Useful tool to give a basic idea about the data in a quick glance Very easy to construct and is mostly self-explanatory Can accommodate many types of data, whether categorical or numerical. Both types of numerical data; discrete and continuous, can be represented in a frequency table
Applied Statistics and Computing Lab
8
Cars dataset
Consists of data on 804 used cars in the USA Data is collected on 12 features, such as the price, make and model of the car, the number of cylinders, number of doors etc. Collected from the Kelly Blue Book
11
The class limits i.e. the highest and lowest values of a class interval must be chosen carefully Must ensure that classes are determined such that any one value of the dataset can not possibly belong to more than one class intervals Using two types of brackets; closed [] or open () A class interval can have one open and one closed bracket Closed bracket => include the number on that side of the interval Open bracket => all numbers up to or starting from, but excluding the number on that side of the interval
Meaning Includes every number from 1 to 3, including the limits e.g. 1, 1.3, 1.8, 2.24, 2.6, 2.98, 2.999999, 3 Includes every number starting from 1 and reaching up to but not including 3 e.g. 1, 1.01, 1.3, 1.78,2.4, 2.9, 2.99, 2.999, 2.9999, 2.99999 (There can be as many 9s after the decimal) Includes every number starting after 1 (but not 1) and reaching up to and including 3 e.g. 1.000000000001, 1.0000001, 1.1, 1.24, 1.7, 2.3, 2.69, 2.99, 3 (There can be as many zeroes after the decimal point but the last digit must be a 1) Includes every number in between 1 and 3, excluding 1 and 3 e.g. 1.0000000000000000001, 1.15, 1.6, 1.92, 2.3, 2.89, 2.99999999999999999999999
(1,3)
For a discrete data, limits of class intervals can be easily determined in a non-overlapping manner For continuous data, values at the limits can repeat across classes
14
Dot plot
A simple tool to depict the frequencies of values in a dataset X-axis denotes the value and the corresponding frequency is denoted on the Y-axis Gives an idea about the distribution of values Indicates the intervals within which the variable may not take any values The value with highest frequency is easily determined To create a dot plot in R, the variable has to be numeric In case of a categorical variable or a variable with class intervals, an equivalent variable assigning a numeric value to each category or class must be created
Applied Statistics and Computing Lab
15
16
Comparison
Stem and leaf plot Discrete data Continuous data Frequency table Constructing class intervals can be useful Dot plot Need to create class intervals Best depiction if there are many values but only a few of them have a high frequency
Depicts actual values Most informative with Can detect unusual large data observations
Disadvantages
17
Height is in cms. Height (in cms.) 147.2 149.5 149.9 151.1 198.1 1 Height (in cms.) [146, 152) [152, 158) [158, 164) [164, 170) [170, 176) [176, 182) [182, 188) [188, 194) [194, 200) Applied Statistics and Computing Lab Frequency 5 31 92 101 118 92 40 26 2
18
Frequency 1 1 1 1
Out of the 507 total data points, 147 have unique height values Clearly, for this continuous data we need to make class intervals!
In this case stem plot is not at all a good idea. Most importantly, for this variable, we do not need to know the exact values. Knowing the range within which they lie might be sufficient
Height is in cms.
Height (in cms.) Frequency [146, 152) [152, 158) [158, 164) [164, 170) [170, 176) [176, 182) [182, 188) [188, 194) [194, 200) 5 31 92 101 118 92 40 26 2
19
Conclusion
Easy to construct Tools important to get a feel of the data! Must use the appropriate representation based on the characteristics of the data Helpful in determining the further course of data analysis
20
R-codes
Functions Stem and leaf plot R-code stem(variable name) Note: scale is an important parameter to explore in Rs stem function table(variable name) Install.packages(TeachingDemos) library(TeachingDemos) dots(variable name)
21
Thank you