Vous êtes sur la page 1sur 6

# A.

## Errors in Data and Calculations

Measurements are prone to errors. Therefore, all techniques for data analysis must consider
this error in measurements. Experimental errors while taking measurements are sometimes
unavoidable and may depend on accuracy. For example, consider the measurement of
length. Measure the length of a table as 5 meters. Here, we are actually comparing the
length of the table with that of a standard that is 1 meter long. In this comparison, there is
always some uncertainty regarding its accuracy. It depends on the accuracy of the scale that
you have used for measuring the length. If the length measured is between 5 and 6 and the
scale that was used did not have any subdivisions of meters marked on that, the
measurement is not accurate. To get a more accurate measurement of the length, use a
scale where the meter is subdivided into centimeters and the length of the table can be
measured to the accuracy of centimeters, say 5 meters and 3 centimeters. Experimentally-
determined quantities always have errors to varying degrees. The reliability of the
conclusions drawn from this data must take experimental errors into considerations for
calculations. Minimization of errors by adopting accurate measurement scales, estimation of
the errors and principles of error propagation in calculations are very important in all
sciences to prevent deceptive and confusing interpretation of facts.

## B. Absolute and Relative Uncertainty

Experimental and measurement errors always create uncertainty in the final data. This
problem can be solved by introducing the rules of significant figures. In this method, we
specify the range of error by which each of the given values can be varied. Each of the
readings will be uncertain within this range of error. This error value is known as absolute
error. The same error can be represented in terms of percentage, and then it is called
relative error.

For example, when representing the temperature of a solution it will be 37 3C. Here,
3C represents the actual temperature range by which the reading is uncertain or can be
varied and this is known as the absolute error. When the same error is represented as a
percentage it is known as relative error. 37 3C can be represented as 37 1.25%.
Here, the error, 1.25 % is called relative error.

C. Types of Errors
Systemic Errors
Random Errors

When an error affects all measurements in the same way it is called a systemic error. In
most cases, the cause of this error is known and introducing a correction factor can minimize
the error. For example, a watch showing an error of + five minutes (five minutes fast). In this
case we can reduce five minutes from the time shown by the clock to get the correct time. A
balance that shows an error of 0.5 gm can be adjusted for that error effectively if the fact
is known. If an error occurs due to unknown reasons it is called a random error or an
accidental error. This type of error can be detected by repeating the experiments under the
same conditions. If different experimental values or results when repeating the experiments
without changing the experimental conditions are found, then there are random errors.
These errors can be quantified and minimized by applying methods of statistical analysis.
The results or data of an experiment should be reliable and reproducible. The term
precision refers to the reliability and reproducibility of results. It also indicates the
magnitude by which the data is free from random errors. We also use the term accuracy to
refer to the quality of the data. When there is a minimum of both systemic and random
errors or when it is almost zero and the results are reproducible, then we refer to the data as
accurate.
D. Statistical Analysis

## Data, Information, and Knowledge

Data is the set of results that is obtained from an experiment. Data makes a crude form of
information. Information is the communication of knowledge. Knowledge is established or
proved facts supported by evidence or data. But data is not knowledge. The data can be
converted into knowledge systematically as per the sequence shown below.

Data becomes information when it becomes relevant to solve your specific problem.
Information becomes facts when the data can support the information. Facts become
knowledge when they are useful in the successful explanation of the problem, phenomenon,
or process. Statistics play an important role in the systematic conversion of data into
knowledge. It is science that helps you in making decisions under uncertainties based on a
numerical and measurable scale. This decision making should be based on the data, but not
on personal views and belief. Statistical analysis of data involves the study of the laws of
probability, collection, organization and presentation of data, data properties, relationships
of data, etc.

## Types of Data and Levels of Measurement

Data can be of two types. Qualitative data and quantitative data.

Data such as color, size, or any other attribute of a population is not computable by
arithmetic relations and is considered qualitative data. They are the markers by which we
can identify an individual, process, or to which group or class they belong. They are called
categorical variables.

## Quantitative data consists of measurements in the form of numerical values. The

statistical analysis is applicable only in the case of this type of data. Quantitative data can
be of discrete data or continuous data. Discrete data are countable data. For example, the
number of unripe fruits present among the fruits of a basket or box. When the parameters
are measurable and are expressed in a continuous scale, it is called continuous data. For
example, the weight of tissues used in an experiment.

Statistical analysis of data includes a number of steps. The first thing in statistical
analysis is to measure or to count. This measuring or counting is the connection between
the reality and the data. A set of data is the representation of the reality in the form of a
numerical or measurable scale. If the analyst is involved in collecting the data, it is called
primary type data otherwise, it is called secondary type data.

Data, which is in discrete or continuous type, can be in any one of the following forms:
Nominal, Ordinal, Interval and Ratio(NOIR)

Under the conditions of uncertainty, decision making is largely dependent on the application
of statistical analysis of data for probabilistic risk assessment of the decision.

Figure 1 is the graphical representation constructing statistical models for decision making
under the conditions of uncertainties.
Figure 1 The statistical
thinking process in decision making under
uncertainties.

## The Process of Statistical Analysis

Statistics are sets of mathematical methods used to collect, analyze, present, and interpret
data to get to a conclusion about the problem. They are now used in a wide variety of
professions to solve many complex experimental problems. The methods of statistical
analysis are very helpful for decision makers, managers, and administrators of political,
business, and economics to enable them to arrive at correct and better decisions about
uncertain states of affairs. The advancement in computer technology and software has
greatly simplified statistical analysis, and a great number of statistical information is
available in todays economic socio-political environments. New developments in software
engineering have played an important role in statistical data analysis. There are very
efficient software packages with extensive data-handling capabilities. They are ideal for
handling various types of data from very small to very elaborate forms, which can be carried
out routinely. Even though computers assist in the statistical analysis, the analysis mainly
focuses on the outcome, in its ability to make correct predictions and decisions.

## The statistical analysis of a data involves four basic steps:

Definition of (understanding) the problem;
Data collection or its compilation;
Analyzing the data; and
Final assessment and reporting of results.

Defining the problem: A clear vision of the problem is a prerequisite. The correct
definition of the problem will help in collecting the exact type of data for analysis.

Collecting data: The data has to be collected from a specific group or population.
Therefore, the population about which we are trying to make an inference also has to be
clearly defined. Sampling and experimental design are required for carrying out precise
collection of data. Designing the ways to collect data is an important part of statistical
methods of data analysis, even though improvements in computational statistics have
simplified the process of data collection.

Defining the population and sample are two important aspects of statistical analysis.
(a) Population: a set of all the elements of interest in an experiment or study.
(b) Sample: a subset of a population is called a sample.
In statistics, we select a small, well-defined population and then extend the inference to the
whole population. This is known as Inductive Reasoning in mathematics. Its main purpose is
to test the hypothesis regarding a population. Inference about a population is obtained from
the information contained in a sample.

Analyzing the data: Data is grouped or classified and analyzed by suitable methods
turning its conversion into results.

Reporting the results: Finally, the results are expressed in a suitable form such as tables,
graphs, or a set of percentages. Since only a small collection or sample has been examined
and not the entire population, the results should reflect the uncertainty condition through
probability statements, intervals of values, and errors.

## E. Presentation of Experimental Data

Data has to be analyzed and converted into a result that tells the proper information or
knowledge. The data that we obtain may be from small groups or samples, which represent
the entire population. Samples are the only the realistic way to obtain data because of time
and cost constraints. For the convenience of statistical analysis, data can be classified into
two categories: cross-sectional and time series data:
Cross-sectional data - Data collected at the same time or approximately the same point of
time.
Time series data - Data collected at different time intervals over a specific time period.
The data may be collected from existing sources or from a new observation of experiments
designed to get new data. In experimental studies there will be a number of factors
influencing the process. First, the variable of interest is identified and then the other
variables or factors are controlled so that data can be collected on the influence of the
variables. A survey is the most common type of observational study.

F. Data Analysis

In statistics, there are mainly two categories of data analysisexploratory methods and
confirmatory methods. Simple arithmetic calculations are used to analyze data and easy-
to-draw pictures are used to summarize the data in exploratory methods.

## A probability theory is used in the confirmatory method of data analysis. Probability is

important in decision making because it provides a means for measuring, expressing, and
analyzing the uncertainties linked with future events.

## Data Processing: Coding, Typing, and Editing

The data that is recorded on a data sheet will go through three stages:
Coding: The data are transferred, if necessary, onto coded sheets.
Typing: Data are typed and stored by at least two independent data- entry persons.
Editing: The data is compared to the independently entered data to check for errors.

When the data is recorded or entered into the data sheet or computer, the following types of
errors are possible :
Recording errors
Typing errors
Transcription errors (incorrect copying)
Inversion, (example- 123.45 is typed as 123.54) errors
Repetition errors
Deliberate errors.
G. Trends

Experimental data is displayed in a suitable graphical form to analyze the trends of variation
among the variables. In certain cases it can be observed that the values are highly variable
and fluctuate around a mean value. This type of phenomenon is called scatter and the
distribution so obtained is called Gaussian distribution. For example, if we want to plot
the variation of blood glucose levels as a function of time, we may get a scattered
distribution. If we want to draw a line through all the values, it will result in a highly
fluctuating line.

## H. Testing Mathematical Models

The following are the main mathematical models used for testing the distribution
of variables.

Normal
Application: It is a basic distribution of statistics and an appropriate model for many
physical phenomena. Many applications arise from the central theorem average of values
of n number of observations approach normal distribution, irrespective of form of original
distribution under quite general conditions.
Example: Distribution of physical measurements, intelligence test scores, product
dimensions, average temperatures, etc. Many methods of statistical analysis presume to be
normal distribution. The
generalized Gaussian distribution has the following probability density function (pdf).

## A. exp[ B|X|n], where A, B, and n are constants.

If n =1, it is Laplacian and if n = 2 it is Gaussian distribution. This distribution approximates
reasonably good data in some image coding applications.

## Slash distribution: The distribution of the ratio of a normal random variable to an

independent uniform random variable.

Log-normal
Application: The representation of a random variable whose logarithm follows normal
distribution. This is a model for processes arising from many small multiplicative errors and
is appropriate when the value of an observed variable is a random proportion of the
previously observed value.
Example: Distribution of various biological phenomena, distribution of sizes from breakage
process, distribution of income size, life distribution of some transistor types, etc. In cases
where the data are log normally distributed, the geometric mean acts as a better data
descriptor than the mean. The more closely the data follows a log-normal distribution, the
closer the geometric mean is to the median, and therefore log re-expression produces
symmetrical distribution. The ratio of two log-normally distributed variables is known as log-
normal.

Poisson
Application: It is usually used in quality control, reliability, queuing theory, etc. If the
events take place independently at a constant rate, it gives a probability of exactly x
independent occurrences during a given period of time. It may also represent the number of
occurrences over constant areas or volumes. It is frequently used as approximation to
binomial distribution.
Example: Used to represent distribution of a number of defects in a piece of material,
customer arrivals, insurance claims, incoming telephone calls, radiation emitted, etc.
Geometric
Application: It gives probability of the number of binomial trials required before the first
success is achieved.
Example: It can be used in quality control, reliability, and other industrial situations.

Binomial
Application: It gives probability of exact success in n number of independent trials, when
probability of success p on single trial is a constant. Used frequently in quality control,
reliability, survey sampling, and other industrial problems.

## I. Goodness of Fit ( Chi-Square Distribution)

In chi-square distribution, the probability distribution curve stretches over the positive
side of the line and has a long right tail. The form of the curve depends on the value of the
degree of freedom. Chi-square distribution is mainly used in Chi-square tests for association.
Chi-square tests are of statistical significance and widely used in bivariate tabular
association analysis. The hypothesis is based on whether or not two different populations are
different enough in some characteristic or aspect of their behavior based on two random
samples. This procedure is also known as the Pearson Chi-square test. The Chi-square
test is used to see if an observed distribution is in accordance to any particular distribution.
This test is calculated by comparing the observed data with the expected data based on the
particular distribution.