Vous êtes sur la page 1sur 28

Data Mining vs.

Data Assimilation
S. Lakshmivarahan
School of Computer Science
University of Oklahoma
Norman, Oklahoma
varahan@ou.edu

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Data Mining(DM) - early beginnings

Much of what we know in physical sciences had their origins


in Astronomy - with observations of celestial objects
Thanks to the Herculean eorts of:
Copernicus (1473-1543)
Galileo (1544-1642)
Kepler (1571-1630)
Newton (1643-1727)

This is only a small sampling from a long list of pioneers

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Discovery of simple laws from observations


Observations collected over decades were meticulously
analyzed by hand to formulate new laws of nature
Examples:
Heliocentric system
The Four laws of Kepler
Law of gravitation by Newton
Three Newtons laws

Within the context of physical sciences these are some of the


earliest examples of data mining
Note: In Chemical, Biological and other Sciences there are
instances such as the above that are re pleat with historical
facts that can illustrate the use of data mining in each of
these disciplines

S. Lakshmivarahan

Data Mining vs. Data Assimilation

What is Data Mining

DM is the process extracting the structure or patterns that are


inherent in the data/observations
These patterns provide clues about the data generating
process
Ultimate goal of DM is to understand and quantify the data
generating process
Since the motion of celestial objects inherently followed
certain laws, early pioneers with their hard work and ingenuity
could discover the laws that laid the foundation of the
physical sciences and engineering as we know today

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Abundance of data - revival of DM

Volume of data collected doubles in every three years -Thanks


to technology
Computers
Large scale storage device technology
Communication and sensor technologies

Today interest in DM include:


Physical sciences, Biological sciences, Medical Sciences
Space exploration, All branches of Engineering,
Environmental Sciences, Ecology
Economics, Social Sciences, Finance, Banking and Commerce,
Sports and recreation
Governments, private companies

More about DM a bit later. Back to early Astronomy

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Development of Calculus and discovery of dynamic models

Introduction of mathematical models by combining concurrent


developments in
Physical laws - Newtons laws
Calculus by Newton (1643-1727) and Leibnitz (1646-1716)
among others

Naturally lead to the development of dynamic models to


describe the motion of planets around the sun
With the availability of models, the potential for forecast or
prediction became very clear

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Discovery of Least squares- Beginnings of Data


Assimilation (DA)
Gauss (1777-1855) (when he was only 24 years old) using the
known models of his time, undertook the challenging problem
of predicting when the celestial object called Ceres will
reappear on the telescope
The model had unknown parameters that needed to be
estimated
By combining the model with observations in the least squares
sense, Gauss, estimated the unknown parameters - created the
rst assimilated model
He then used this assimilated model to accurately predict of
the time and location of reappearance of the lost astronomical
object

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Gauss laid the foundation for DA

This work leads the development of the method of least


squares as we know it today
Method of least squares still continues to dominate the theory
and practice of estimation of unknown parameters
By this time Gauss had also invented the notion of statistical
analysis relating to the distribution observational errors
following the bell shaped curve which we now call as the
normal or Gaussian distribution

S. Lakshmivarahan

Data Mining vs. Data Assimilation

What is Data Assimilation?

Fusion of model with data


Models are general descriptions of the underlying physical
processes in question
Model represents a class - suitably parametrized
Examples:
Static regression models have unknown coecients
Dynamic model has unknown initial/boundary conditions +
physical parameters such as Reynolds number, coecient of
thermal expansion of water, specic heat of water, etc

Data/observations reveal all the secrets of or the truth about


the process that model tries to capture

S. Lakshmivarahan

Data Mining vs. Data Assimilation

DA - fusion of models with data

By combining models and data - estimating the unknown


parameters of the models using the data - we can get a
specialized instantiation of the model called the assimilated
model
This assimilated model is a good tool for creating forecast or
prediction
One of the standard tools for the fusion of model and data is
based on the method of least squares
The discipline of DA primarily deals with development of
methods for assimilating models with data

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Goal of DA - generate good forecast/prediction

Predict the path of a hurricane, tornado- using one of several


models + data collected using satellites, Radars, special
planes that y into the hurricanes twice a day
From the crime scene data, reconstruct the case - CSI, Miami
NTSB estimate the causes of failure using the data from the
debris
Predict the potential tax revenues so that a Government can
develop its budget for the next year
Medical diagnosis - from symptoms to the cure

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Direct vs. Inverse problems - A classification


To further explore the relation between DM and DA introduce an useful classication
Scientic and Engineering problems can be classied into one
of two types
Direct problems - Examples
Given a polynomial p(x), evaluate it at x = 1.0
Given a dierential equation and the initial condition, nd the
solution
Given a matrix A and a vector x, compute the vector b = Ax

Inverse problems - Examples


Given a polynomial p(x), solve for the roots of p(x) = 0
Given a dierential equation and a particular solution, nd the
initial condition that corresponds to the solution
Given a matrix A and a vector b, nd the solution x such that
Ax = b

It turns out that DM and DA naturally correspond to two


types of inverse problems and prediction is a direct problem
S. Lakshmivarahan

Data Mining vs. Data Assimilation

First level of inverse problems - The Core of Data Mining


At the highest level, Data Mining relates to solving the
important class of inverse problems leading to the discovery
of basic laws/models that are implied by the data
Examples of discovery of laws/models from data include:
Basic laws in early Astronomy -Kepler, Newton,
Atom models in early 1900s
Higgs Boson, the so called God particle in 2012
Theory of evolution by C. Darwin
Building models to identify credit card fraud
Based on the observed structure of the autocorrelation of a
time series, decide on the class and the type of model that
might be capture the observed autocorrelation

Data Mining has been and still continues to be the basis for
the advancement of knowledge in all of Sciences and
Engineering
S. Lakshmivarahan

Data Mining vs. Data Assimilation

Second level of inverse problems - The Core of Data


Assimilation
Assume now that the newly discovered mathematical laws are
expressed in the form of a class of models
The problem then becomes one of data assimilation that
relates to solving a second level of inverse problem that deals
with the estimation of the unknown parameters of the model
using the same or similar data
Determination of the weights for links connecting the neurons
in an Articial Neural Network - minimize classication error
Estimate the sea surface temperature using satellite
observations - based on Planck/Stefans law of radiation
Estimate the amount of rain in a cloud system using radar
observation - based on an empirical law
Estimate the structure of the earth - based on the anomaly of
the local gravitational eld - basis for geophysical exploration

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Third level involves Prediction - a direct problem

Once an assimilated model is made available, interest then


shifts to the direct problem of generation of short term
prediction
Predict lunar/solar eclipse
Prediction of total revenue by a state treasury
Prediction of how snow will fall in Boston due to a coastal low
pressure system
Prediction of the amount of green houses in the atmosphere by
2025

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Is it DM or DA?

First phase: At its core DM relates to the discovery of basic


knowledge - Remember Kepler and Newton
This knowledge is often expressed as a law which is
encapsulated in a (mathematical) model
Emphasis then shifts to testing the goodness of a model
Second phase: At its core DA deals with the problem of
estimating the unknowns by tting the model to data Remember Gauss
Third phase: Using the assimilated model generate forecast
products for public consumption
DM and DA are the two parts of a continuum

S. Lakshmivarahan

Data Mining vs. Data Assimilation

A classification of models

Models: Based on causality (Motion of a Hurricane) vs


correlation (ARMA model in time series)
Models: Explicit (ARMA model) vs implicit (Neural Networks)
Static (Regression) vs. Dynamic (ODE/PDE)
Models: Deterministic (motion of a planet) vs. stochastic
(evolution stock prices)
Model: Linear vs. nonlinear
Model Time: Discrete (unemployment) vs. continuous
(temperature)
Model Space: Discrete (Markov chain) vs. continuous (rain
fall)

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Forms of Data
Data arise in various forms:
Time series data - annual rain fall, total monthly sales
Data martix m n - n objects (columns) and m attributes
(rows)
Cross Sectional data - Tabular forms
Practical problems: Missing data, outliers, Data quality
control
Note: In Science and Engineering, data are often of the
quantitative type (permiting full blown arithmetic
operations). In Economics, Social Sciences etc., data could be
a mixture of both quantitative and qualitative types.
Algorithms for mining/assimialtion qualitative data dier from
those of quantitative data sets

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Estimation - Over vs. under determined problems

Two scenarios arise depending on the cost of collecting


observations
Over determined (OD) case - abundance of observation much
larger than the number of unknowns to be estimated - Once
deployed, satellites, radars will deliver large amounts of data
for quite a long time
Under determined (UD) case - less number of observations
compared with the number of unknowns - Exploration for
minerals, natural gas, oil, etc.,

In the OD case there is no solution and in the UD case there


are innitely many solutions
These cases are the motivation for the denition of solution in
the least squares sense

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Framework for DA

Estimation problem is recast as a constrained minimization


problem
Constraints arise naturally:
Positivity of certain physical parameters - inequality constraint
Model itself acts as a constraint - equality constraint

Strict enforcement of constraints - Strong constraint


formulation -Lagrangian multiplier technique
Weak enforcement of constraints - Weak constraint
formulation - Penalty function technique

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Well-posed vs. ill-posed problems

In a well posed problem solution exits and is unique


In an ill-posed problem solution may not exist or it may have
innitely many solutions
Many of the inverse problems are ill-posed
These are solved by using some form of regularization
techniques - Tikhonov regularization
Using regularization we solve the nearest well-posed version of
a given ill-posed problem
Example: Solving (A + I )X () = b instead of AX = b for
some small positive for which (A + I ) is positive denite

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Methods for estimation

Parametric vs. non-parametric methods


Least squares - two versions
Unweighed least squares - orthogonal projection
Weighted least squares - oblique projections

Generalized method of moments


Maximum likelihood methods
Bayesian methods where we combine a known prior with
conditional distributions

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Optimization problems

Unimodal vs. multi modal problems


Continuous vs. discrete optimization problems
Continuous,Unimodal problems solved using:
Gradient method
Conjugate gradient method
Quasi-Newton method

Continuous multi modal and discrete optimization problems


solved using randomized techniques:
Simulated annealing
Genetic algorithms

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Methods for DM/DA - I

Time series analysis


Signal processing in EE
Medicine
Econometrics, Finance

The goal is to build stochastic dynamic models in discrete


time by exploiting the underlying correlation, seasonality
properties of the data set
In Finance model both level and volatility
Autoregressive, integrated, moving average (ARIMA) models
This is one of the well developed areas in empirical modeling

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Methods for DM/DA - II


Multivariate regression analysis (1800s) Statistics
Data reduction using PCA (1940), ICA (1990) - Statistics
Classication using
Clustering (1950s)
Neural networks (1950s),
Pattern recognition (1950s),
Support Vector Machines (SVM) (1980s)

Association rules
Image processing, voice recognition
Decision trees (1960)
Probabilistic reasoning in networks (1990s) - J. Pearl Turing
Award in 2012
Random eld - Spatial data analysis

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Commonality of approaches in DM, DA, AI,Machine


Learning

Supervised learning
Learning with a teacher - Learning in Neural Networks
Learning with a probabilistic teacher - using imprecise
knowledge

Unsupervised learning/Learning without a teacher Clustering, Adaptive Control

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Summary

At the rst level Data Mining seeks to uncover the basic laws
that are hidden in the data. These laws are presented by
models of some kind with unknown parameters
At the second level Data Assimilation deals with the task of
fusing data with models to produce an assimilated model - by
estimating the unknown parameters
At the third level, using the given assimilated model produce
various forecast products for public consumption
DM, DA and Forecasting are the three parts of a continuum
in knowledge discovery

S. Lakshmivarahan

Data Mining vs. Data Assimilation

References

J. M. Lewis, S. Lakshmivarahan and S. K. Dhall (2006)


Dynamic Data Assimilation: a least squares approach, Volume
104, Encyclopedia of Mathematics and its Applications,
Cambridge University Press, 654 pages
J. D. Hamilton (1994) Time Series Analysis, Princeton
University Press, 799 pages
P. Tang, M. Steinbach and V. Kumar (2006) Introduction to
Data Mining, Addison Wesley

S. Lakshmivarahan

Data Mining vs. Data Assimilation

Vous aimerez peut-être aussi