Vous êtes sur la page 1sur 5

Elementary Statistics

1. Descriptive Statistics

Descriptive statistics consists of a flexible and continuously growing body of methods for
identifying and describing key features of sets of data. It serves along with probability theory
as the foundation for inferential statistics. The goal in inferential statistics is to use
information from a sample to draw inference and to quantify error probabilities and
confidence levels associated with those inferences. Descriptive statistics is more general
because it may be applied to any sets of data, no just data gathered for purposes of
inference.

In certain applications, data files are enormous. For example, data in field of geographical
information system (GIS) can occupy many gigabites on a computer disk. It would be virtually
impossible for a person to get any useful information at all by looking at the raw data. This is
why some people sometimes define descriptive statistics as the art of converting data into
information.

1.1 Univariate Descriptives

Our aim is to describe sets of data. Consider they very small data file in the following table :

ID Weight Gender

01 121 F

02 136 F

03 153 M

04 125 M

05 134 M

06 120 M

07 125 F

08 147 M

09 140 F

This table gives weights and genders of 9 experimental subjects. Weight and gender are
variables. Weight is a quantitative variable and gender is a qualitative variable. A person or
object on which measurements are made is called an observational unit (or experimental
unit), and the collection of data values obtained from a given observational unit constitutes
an observation. Each observational unit has its data recorded in one row or record of the file.

The observational units in a study may be mice or people or stars or trees or cities or mineral
specimens. The field of study may be the social sciences, lofe sciences, or physical sciences.
In all of these disparate areas, the same statistical methods apply and provide a unifying
methodology. Though a data file may have more than one variable, we often begin by
considering certain variables, one at a time.

Definitions

The data corresponding to a single variable is called univariate data.

A number, table, or visual display that gives information about a single variable is called a
univariate descriptive.

1.1 Frequency Tables and Visual Displays

One of the most basic univariate descriptives is the frequency table:

The frequency of a particular value is the number of the times that particular value occurs in
the data column.

The relative frequency is the frequency expressed as a percentage or proportion of the total
number of observations under consideration.

The Cumulative frequency is the sum of the frequencies of the values at or before the
current one in the listing.

Value Frequency Relative Frequency Cumulative frequncy

F 4 0.44 4

M 5 0.56 9

Value Frequency Relative Frequency Cumulatiev Rel. Frequency

120 1 11.1% 11.1%

121 1 11.1% 22.2%

125 2 22.2% 44.4%

134 1 11.1% 55.6%

136 1 11.1% 66.7%


140 1 11.1% 77.8%

147 1 11.1% 88.9%

153 1 11.1% 100.0%

A more informative table can be made by grouping the data into intervals. The result is a
more interesting display which conveys new insight about the data. An important point here
is that the intervals are the same length. Therefore the variation in frequency among the
intervals says something about the variation in density of the observation from one location
to another. In particular, the table with grouped data shows that the weights cluster more
closely together (more densely) at the lower end of the weight range than at the higher end.

Interval Frequency Relative Frequency Cumulative Rel. Frequency

[120,130[ 4 44.4% 44.4%

[130,140[ 2 22.2% 66.7%

[140,150[ 2 22.2% 88.9%

[150,160[ 1 11.1% 100.0%

Remark

(Frequency) here means the number of observations contained in the given interval.

Every data value is in one and only one interval

The information in frequency tables is often presented is graphical form. For a non-numerical
variable we have the bar chart. A bar chart corresponding to a frequency table for numerical
data is called a histogram.

The given histogram is a


frequency histogram,
because it relates bar heights to frequencies.
1.2 Measures of Location and Dispersion

A number that is computed from a population or that in some way characterizes a


population is called a parameter. A number computed from a sample is called a statistic.

There exists two different kinds of populations.

Concrete population is a definite set of observational units. The number of units is denoted
N. for example, the number of cats in a given city at a given date is a concrete population. No
person knows exactly what the sample size, N, is equal to for this population. But we know
that it has some specific value.

Abstract population is a population with no definite size and whose observational units only
come into existence as time progresses or as they are created by an experimenter. For
example, measurements of some experiments done by a scientist.

Notes: We will use the following notations:

n denotes the number of observational units in a sample, and N the number of observational
units in a concrete population.

We often use x as the name of a variable. Then the data values are called x-values.

Definition1: The sample mean of a list of x-values is denoted x and is defined by:

x=∑x/n

Where the sum is over all observations in the sample. the population mean is denoted ų and
is defined similarly:

ų=∑x/N

This time, the sum is over all observational units in the population.

Definition 2: The median of a list of n numbers is found by first listing the numbers in
increasing order, then:

If n is odd the median is the number in the middle position in the listing.

If n is even the median is the average of the two middle numbers in the listing.

Both the median and mean are measures of location. A measure of dispersion indicates how
spread out a set of numerical data values are.

Definition 3: Sample variance is denoted s² and is definted by the formula

s²=∑(x-x)²/n-1

Sample standard deviation is denoted s and is defined to be the square root of the sample
variance. Range is definted to be the largest data value mines the smallest data value.

Definition 4: Population variance is denoted ∂², and is definted by:


∂²=∑(x-ų)²/N

Population standard deviation is denoted ∂ and is defined to be the square root of the
population variance.

1.3 Interpreting the Standard Deviation

Standard deviation gives us information about how data values are distributed. A set of data
values is said to have mound shaped distribution if a dot-plot or histogram for the data
would have approximately the shape of a symmetrical mound. Standard deviation may be
computed as either population standard deviation (denominator N) or sample standard
deviation (denominator n-1). the results are valid either way.

1.4 Percentiles and Z-scores

Definition 5: Given a list of a, p-th percentile is a number c such that p percent (or fewer) of
the numbers in the list are less than c, and 100-p percent (or fewer) are greater than c.

The relative standard deviation

100*(standard deviation/(mean))

Some analysts prefer to call this the percent relative standard deviation and to call the
relative standard deviation the value that is obtained without multiplying by 100

Remark

When we are infront of a data that is grouped into intervals, we can only get an approximate
mean and variance. For this we use the class mark.

definitions: The class is the midpoint of a class, defined as follows:

class mark= lower limit of class+ upper limit of class/2

Vous aimerez peut-être aussi