Vous êtes sur la page 1sur 11

Machine Learning Is The Dual Of Query

Erik Meijer, Applied Duality


Brian Beckman, Amazon
Tomakeapplicationssmart,theymustbehighlyspecializedforeachindividualuser.Onemightassume
theonlywaytogetcustomizedbehavioristowriteacustomizedapp,butthatwouldbeneitherfeasiblenor
correct.Instead,appscanandmustlearnnewcode,anduserswillandmusttolerateapproximate,
nondeterministicdecisionsfromtheirapps.Overtime,appsthatlearndelivermorepreciseandaccurate
answersastheyaccumulatedatafromtheusers'interactions.Machinelearningisthetechniqueto
automaticallyturndataintocode,infact,itisthedualofquery,whichturnscodeintodata.Whereittook
severaldecadesforquerytechnologytodemocratizeandhencetobebroadlyusableforpractitionersto
buildapplications,machinelearningisrapidlyapproachingthestatusofcommoditytechnology.Byusingthe
codegeneratedonhistoricaldatatoqueryrealtimestreamingdatabycomposingquerywithitsdual,
machinelearningthecompositionprovestobethefutureofbigdata.

We
ar
e
al
l

f
ami
l
i
ar

wi
t
h
t
he
concept

of

a
dat
abase.

For

most

devel
oper
s,

a
dat
abase
i
s
a
magi
c
bl
ack
box
t
o
whi
ch
t
hey
can
send
a
pi
ece
of

code
(
a
quer
y)

and
t
hat
,

as
a
r
esponse,

r
et
ur
ns
a
bunch
of

dat
a
t
hat

i
s
t
he
answer

t
o
t
he
quer
y,

based
on
t
he
i
nf
or
mat
i
on
hi
dden
i
nsi
de
t
he
dat
abase.

We
can
t
hus
cl
eanl
y
model

a
dat
abas
e
t
hat

st
or
es
val
ues
of

t
ype
T
as
a
hi
gher

or
der

f
unct
i
on
database(T R) [(T,R)],
t
hat

t
akes
a
quer
y
f
unct
i
on
of

t
ype
T R

as
ar
gument
,

and
pr
oduces
a
r
esul
t

set

of

t
upl
es
of

i
nput

val
ues
and
r
esul
t
s.

An
exampl
e
quer
y
f
unct
i
on
i
s
a
pr
oj
ect
i
on
f
unct
i
on
t
hat

pr
oduces
an
empl
oyee'
s
sal
ar
y
f
r
om
an
ent
i
r
e
empl
oyee
r
ecor
d,

and
r
et
ur
ns
t
he
empl
oyee
pl
us
i
t
s
sal
ar
y.

Not
e
t
hat

our

not
i
on
of

quer
y
i
s
sl
i
ght
l
y
unusual

i
n
t
he
sense
t
hat

we
r
et
ur
n
bot
h
t
he
or
i
gi
nal

i
t
em
and
t
he
r
esul
t
.

Typi
cal

quer
i
es
ar
e
mul
t
i

st
aged
composi
t
i
ons
of

el
ement
al

f
i
l
t
er
s,

pr
oj
ect
i
ons,

gr
oupi
ng,

aggr
egat
i
on,

and
or
der
i
ng
oper
at
or
s
.

A
f
i
l
t
er

quer
y
t
hat

oper
at
es
on
a
dat
abase
wi
t
h
i
t
ems

of

t
ype
T,

uses

a
pr
edi
cat
e
PT
t
o
r
ej
ect

val
ues
f
or

whi
ch
t
he
pr
edi
cat
e
i
s
f
al
se:

Af
t
er

f
i
l
t
er
i
ng
out

uni
nt
er
est
i
ng
i
t
ems,

a
quer
y
t
hen
usual
l
y
gr
oups
t
he
r
emai
ni
ng
el
ement
s
accor
di
ng
t
o
some
par
t
i
t
i
on
f
unct
i
on
GT K
i
nt
o
a
key
space
K,

whi
ch
yi
el
ds
a
nest
ed
col
l
ec
t
i
on
of

equi
val
ence
cl
asses
of

val
ues
of

t
ype
T
t
hat

map
i
nt
o
t
he
same
r
epr
esent
at
i
v
e
key
of

t
ype
K:

and
t
hen
ext
r
act
s
t
he
i
nf
or
mat
i
on
we
ar
e
i
nt
er
est
ed
i
n
f
r
om
eac
h
i
t
em
i
n
t
he
gr
oup
usi
ng
a
pr
oj
ect
i
on
f
unct
i
on
XT S:

We
opt
i
onal
l
y
aggr
egat
e
each
gr
oup
(
whose
i
t
ems
ar
e
now
of

t
ype
S
af
t
er

t
he
pr
oj
ect
i
on
st
ep)

i
nt
o
a
si
ngl
e
val
ue
usi
ng
an
aggr
egat
i
on
f
unct
i
on
A[S] R.

A
r
el
at
i
onal

dat
abase
i
s
not

abl
e
t
o
deal

wi
t
h
nest
ed
col
l
ect
i
ons
and
r
equi
r
es
t
he
gr
oups
t
o
be
aggr
egat
ed
i
nt
o
a
si
ngl
e
val
ue,

however
,

i
n
t
he
gener
al

case
we
can
aggr
egat
e
col
l
ect
i
ons
i
nt
o
ot
her

col
l
ect
i
ons.

3
5

8
4

Fi
nal
l
y,

we
sor
t

t
he
r
esul
t
i
ng
col
l
ect
i
on
of

val
ues
of

t
ype
R
usi
ng
an
or
der
i
ng,

or

di
st
ance,

f
unct
i
on
O(R,R)
t
o
obt
ai
n
t
he
f
i
nal

r
esul
t
:

4
8

4
8

Unf
or
t
unat
el
y,

i
n
many
cases,

we
do
not

know
how
t
o
i
mpl
ement

t
he
var
i
ous
f
i
l
t
er
,

pr
oj
ec
t
,

gr
oupi
ng,

and
or
der
i
ng
f
unct
i
ons
t
hat

make
up
a
quer
y.

For

exampl
e,

i
f

we
ar
e
quer
yi
ng
our

emai
l
,

we
mi
ght

want

t
o
f
i
r
st

f
i
l
t
er

out

al
l

spam
messages.

Whi
l
e
we
know
(
i
nt
ui
t
i
vel
y)

what

i
s
and
what

i
sn
t

spam,

i
t

i
s
i
mpossi
bl
e
t
o
def
i
ne
an
exact

al
gor
i
t
hmi
c

i
mpl
ement
at
i
on
of

a
pr
edi
c
at
e
IsSpamEmail .

However
,

gi
ven
a
set

of

exampl
es

of

spam
and
non
spam
emai
l
s
r
epr
esent
ed
as
a
col
l
ect
i
ons
of

exampl
es
of

t
ype
[(Email,)]
,

we
can
use
Bayesianclassification
[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
Nai
ve_Bayes_cl
assi
f
i
er
]
,

DecisionsTrees[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
Deci
si
on_t
r
ee]

or

ot
her

of
f

t
he
shel
f

machi
ne
l
ear
ni
ng
al
gor
i
t
hms
t
o
i
nf
er

a
f
unct
i
on
of

t
ype
(Email ).

I
f

we
sl
i
ght
l
y
squi
nt

at

f
i
l
t
er
i
ng
a
dat
abase
of

emai
l
s
usi
ng
a
pr
edi
cat
e
IsSpamEmail
by
i
nst
ead
of

t
hr
owi
ng
away

emai
l
s
t
hat

ar
e
spam
havi
ng
i
t

r
et
ur
n
a
col
l
ect
i
on
[(Email,)]
of

emai
l
s
pai
r
ed
t
he
r
esul
t

of

appl
yi
ng
IsSpam

(
we
can
al
ways
gr
oup
t
he
r
esul
t
s
l
at
er

and
t
hr
ow
away
t
he
spam
messages
t
he,

j
ust

l
i
k
e
our

emai
l

cl
i
ent

put
s
spam
messages
i
n
a
speci
al

mai
l
box)
,

t
hen
we
see
t
hat

quer
y
i
s
t
he
dual

of

machi
ne
l
ear
ni
ng.

Gi
ven
a
f
unct
i
on
of

t
ype
(Email ),
a
quer
y
pr
oduces
a
col
l
ect
i
on
of

pai
r
s
of

t
ype
[(Email,)]
.

Si
mi
l
ar
l
y,

when
we
want

t
o
gr
oup
a
set

of

val
ues,

we
of
t
en
don
t

know
exact
l
y
how
t
o
def
i
ne
t
he
par
t
i
t
i
on
f
unct
i
on
GT K
ei
t
her
,

but

we
ar
e
abl
e
t
o
i
dent
i
f
y
smal
l

gr
oups
of

mean

val
ues

t
hat

ar
e
r
epr
esent
at
i
ve
of

t
he
gr
oups

we
want

t
o
const
r
uct

r
epr
esent
ed
as
a
k
t
upl
e
of

col
l
ec
t
i
ons
(
[T],,[T])
.

Gi
ven
suc
h
an
k
t
upl
e,

we
can
use
t
he
kmeansclustering
[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
K
means_cl
ust
er
i
ng]

al
gor
i
t
hm
t
o
cr
eat
e
a
gr
oupi
ng
f
unct
i
on
GT .

For

exampl
e,

a
soci
al

net
wor
k
l
i
ke
Googl
e+
or

Facebook
can
cl
ust
er

your

exi
st
i
ng
f
r
i
ends
(
and
r
ecommend
new
f
r
i
ends)

i
nt
o
ci
r
cl
es
or

cat
egor
i
es
based
on
an
i
ni
t
i
al

cl
assi
f
i
cat
i
on
i
nt
o
k
gr
oups.

Not
e
t
hat

t
ype
([T],,[
T])
of

k
t
upl
es
of

col
l
ect
i
ons
of

t
ype
T
i
s
i
somor
phi
c
t
o
t
he
t
ype
[(T,)]
of

col
l
ect
i
ons
of

set
s
of

pai
r
s
of

val
ues
of

t
ype
T
and
i
nt
eger
s,

and
hence
i
t

pr
eci
sel
y

f
i
t
s
t
he
pat
t
er
n
synt
hesi
zi
ng
a
f
unct
i
on
T
f
r
om
a
set

of

exampl
e
pai
r
s
[(T,)].

On
t
he
quer
y
si
de,

gr
oupi
ng
t
akes
a
f
unct
i
on
(
T )

and
pr
oduces
a
nest
ed
col
l
ect
i
on
of

t
ype

[([T],)]
,

whi
ch
i
s
i
somor
phi
c
t
o
[(T,)]assumi
ng
each
gr
oup
has
a
uni
que
key,

agai
n
t
he
dual

of

t
he
l
ear
ni
ng
al
gor
i
t
hm.

Si
nce
aut
omat
i
c
cl
ust
er
i
ng
i
s
based
on
t
he

si
mi
l
ar
i
t
y

of

val
ues,

a
common
t
r
i
ck
i
s
t
o
f
i
r
st

pr
oj
ect

T
t
o
a
anot
her

t
ype
K
on
whi
ch
you
t
hen
cl
ust
er
.

To
sor
t

a
col
l
ect
i
on
of

val
ues
of

t
ype
T,

we
need
a
measur
e
f
unct
i
on
d(T,T) .

But

of
t
en
pr
esent
ed
wi
t
h
t
wo
val
ues
of

t
ype
T,

we
can
t
el
l

t
he
di
st
ance
bet
ween
t
hem,

t
hat

i
s
we
can
pr
oduce
a
col
l
ect
i
on
of

pai
r
s
[((T,T),)]

t
hat

compar
es
i
ndi
vi
dual

val
ues
of

t
ype
T.

When
you
si
gn
up
f
or

Net
f
l
i
x,

i
t

wi
l
l

ask
you
a
f
ew
quest
i
ons
about

movi
es.

Your

answer
s
ef
f
ect
i
vel
y
seed
t
he
t
r
ai
ni
ng
set

f
or

t
hei
r

r
anki
ng
al
gor
i
t
hm
t
hat

wi
l
l

sugges
t

whi
ch
movi
es
you
wi
l
l

most

enj
oy.

Uns
ur
pr
i
si
ngl
y,

we
can
t
ur
n
t
hat

col
l
ect
i
on
i
nt
o
a
di
st
ance
f
unct
i
on
(T,T)
usi
ng
one
of

many
rankingalgorithms
[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
Lear
ni
ng_t
o_r
ank]
.

As
wi
t
h
gr
oupi
ng,

i
t

i
s
common
t
o
f
i
r
st

pr
oj
ect

T
i
nt
o
anot
her

t
ype
S
t
hat

i
s
easi
er

t
o
t
r
ai
n.

Based
on
your

act
ual

pi
cks
of

movi
es,

Net
f
l
i
x
wi
l
l

use
t
hi
s
i
nf
or
mat
i
on
t
o
i
ncr
ement
al
l
y
r
ef
i
ne
t
he
i
nf
er
r
ed
di
st
ance
f
unct
i
on.

Al
l

t
he
above
machi
ne
l
ear
ni
ng
al
gor
i
t
hms
ar
e
exampl
es
of

supervised
l
ear
ni
ng
met
hods

[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
Super
vi
s
ed_l
ear
ni
ng]

Cat
egor
y
t
heor
i
st
s
def
i
ne
a
"
(
par
t
i
al
)

f
unct
i
on"

of

t
ype
T R
as
"
a
subset

of

t
he
car
t
esi
an
pr
oduct

(T,R)
such
t
hat

no
el
ement

of

appear

mor
e
t
han
once
i
n
t
he
subset
.
"

(
we
l
l

say
mor
e
about

t
hi
s
r
est
r
i
ct
i
on
l
at
er
)
.

The
super
vi
sed
l
ear
ni
ng
t
r
ai
ni
ng
set

i
s
a
noi
sed
up
sampl
e
of

t
he
f
unct
i
on.

The
Machi
ne
Lear
ni
ng
al
gor
i
t
hm
us
es
heur
i
st
i
cs
t
o
i
nf
er

(
a
r
ul
e
f
or
)

t
he
f
unct
i
on
f
r
om
t
he
noi
sy
sampl
e.

Of
t
en
k
means
cl
ust
er
i
ng
i
s
not

consi
der
ed
as
a
super
vi
sed
l
ear
ni
ng
met
hod,

but

t
hat

r
ef
l
ect
s
t
he
case
wher
e
we
st
ar
t

wi
t
h
a
k
t
upl
e
wi
t
h
empt
y
col
l
ect
i
ons
of

seed
val
ues.

Sel
ect
or

f
unct
i
ons
do
not

f
i
t

i
nt
o
t
he
super
vi
sed
l
ear
ni
ng
model

t
hey
ar
e
t
he
pr
i
me
exampl
e
of

unsupervised
l
ear
ni
ng.

I
nst
ead
of

gi
vi
ng
a
t
r
ai
ni
ng
set

of

exampl
e
pai
r
s
[(S,R)],

we
f
i
x
t
he
r
esul
t

t
ype
R
and
i
nf
er

t
he

ent
i
t
y
ext
r
act
i
on

f
unct
i
on
f
r
om
S
t
o
R.

I
n
many
cases,

ent
i
t
y
ext
r
act
i
on
uses
domai
n
knowl
edge
and
nat
ur
al

l
anguage
pr
ocessi
ng
t
o
ext
r
act

phone
number
s,

zi
p
codes,

names,

et
c
f
r
om
unst
r
uct
ur
ed
t
ext
.

Fi
ndi
ng
aggr
egat
i
on
f
unct
i
ons
i
s
usual
l
y
not

a
pr
obl
em
and
such
f
unct
i
ons
ar
e
chosen
f
r
om
a
st
andar
d
set

l
i
ke
summat
i
on,

aver
age,

Wher
e
aggr
egat
i
on
becomes
i
nt
er
est
i
ng
i
s
when
deal
i
ng
wi
t
h
i
nf
i
ni
t
e
st
r
eams,

i
n
whi
ch
case
we
need
t
o
appl
y
appr
oxi
mat
e
st
r
eami
ng
al
gor
i
t
hms
[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
St
r
eami
ng_al
gor
i
t
hm]

t
hat

r
un
i
n
const
ant

space.

For

exampl
e,

gi
ven
an
i
nf
i
ni
t
e,

or

ver
y
l
ar
ge,

st
r
eam,

we
cannot

exact
l
y
det
er
mi
ne
t
he
nmost
frequentlyoccurringitems
i
n
t
he
st
r
eam,

and
we
must

use
t
echni
ques
l
i
ke
t
he

space
savi
ng

al
gor
i
t
hm
t
o
comput
e
a
r
unni
ng
appr
oxi
mat
i
on.

Bl
oom
f
i
l
t
er
s,

count

mi
n
sket
ch,

and
"
sket
chi
ng"

i
n
gener
al

ar
e
t
er
ms
r
ef
er
r
i
ng
t
o
al
gor
i
t
hms
f
or

f
i
ndi
ng
appr
oxi
mat
e
answer
s
i
n
bi
g
dat
a.

But

t
hat

i
s
a
t
opi
c
out
si
de
t
he
s
c
ope
of

t
he
cur
r
ent

paper
.

The
above
exampl
es
show
t
hat

we
c
an
vi
ew
super
vi
sed
machi
ne
l
ear
ni
ng
as
co
quer
y

pr
ocessi
ng
t
hat

t
akes
as
i
nput

a
t
r
ai
ni
ng
set

[(T,R)]
of

pai
r
s
of

exampl
es,

and
t
ur
n
t
hi
s
i
nt
o
a

f
unct
i
on
of

t
ype
(T R).

Machi
ne
l
ear
ni
ng
t
hus
i
s
t
he
dual

of

quer
y.

Fr
om
a
dev
el
oper

s
per
s
pect
i
ve,

we
can
al
so
say
t
hat

machi
ne
l
ear
ni
ng
i
s
aut
omat
ed
Test

Dr
i
ven
Devel
opment

(
TDD)
,

wher
ei
n
a
human
devel
oper

gener
al
i
zes
f
r
om
a
smal
l

set

of

speci
f
i
c
exampl
es
t
o
a
f
ul
l

wor
ki
ng
pr
ogr
am.

Machi
ne
Lear
ni
ng
i
s
much
mor
e
exc
i
t
i
ng
t
han
quer
y.

Quer
y
speci
al
i
zes
f
r
om
a
l
ar
ge
set

of

dat
a
i
nt
o
a
smal
l

f
r
act
i
on
(
by
f
i
l
t
er
i
ng,

pr
oj
ect
i
on,

and
aggr
egat
i
on)
,

wher
eas
Machi
ne
Lear
ni
ng
gener
al
i
zes
f
r
om
a
smal
l

set

of

exampl
es
t
o
handl
e
a
l
ar
ge
set

of

i
nput
s.

Caveat Emptor

Many

peopl
e
wi
l
l

obj
ect

t
hat

we
have
pai
nt
ed
a
ver
y
nai
ve
pi
ct
ur
e
of

machi
ne
l
ear
ni
ng
and
t
hat

we
i
gnor
e
many
of

t
he
essent
i
al

el
ement
s
of

usi
ng
machi
ne
l
ear
ni
ng
such
as

cr
oss
val
i
dat
i
on,

.
.
.

That

i
s
a
t
ot
al
l
y

f
ai
r

cr
i
t
i
ci
sm,

however
,

i
nst
ead
of

poi
nt
i
ng
at

t
hose
i
s
sues

as
a
r
eason
t
hat

bl
ack
box
machi
ne
l
ear
ni
ng
won
t

wor
k,

we
shoul
d
t
ake
t
hem
i
n
t
he
spi
r
i
t

of

Hammi
ng
s
advi
ce

and
st
ar
t

wor
ki
ng
t
o
el
i
mi
nat
e
t
hese
obst
acl
es
t
o
adopt
i
on.

When
(
r
el
at
i
onal
)

dat
abases
wer
e
i
nt
r
oduced
i
n
t
he
1970
t
i
es,

usi
ng
t
hem
r
equi
r
ed
t
weaki
ng
a
l
ot

of

par
amet
er
s,

set
t
i
ng
up
i
ndexes
,

and
paper
s
about

dat
abases
wer
e
f
ul
l

of

gr
eek
symbol
s
and
har
d
t
o
under
st
and
mat
hemat
i
c
s
f
or

pr
act
i
t
i
oner
s.

Onl
y
speci
al
i
zed
DBAs
(
t
he
dat
a
sci
ent
i
st
s
of

t
he
1970
t
i
es)

coul
d
oper
at
e
a
dat
abase.

Si
nce
t
hen
however
,

t
he
dat
abase
has

been
f
ul
l
y
democr
at
i
zed,

and
t
oday,

any
t
oddl
er

can
i
nst
al
l

and
use
a
r
el
at
i
onal

dat
abase
such
as
SQLi
t
e
wi
t
hout

any
knowl
edge
of

mat
hemat
i
cal

t
heor
y
such
as
r
el
at
i
onal

al
gebr
a
or

i
mpl
ement
at
i
on
det
ai
l
s
l
i
ke
B
t
r
ees
and
i
ndexes.

The
r
ol
e
of

DBA
i
s
now
i
n
t
he
same
cl
ass
of

obsol
et
e
occupat
i
ons
as
bl
acksmi
t
h
and
newspaper

del
i
ver
y
boy.

For

Machi
ne
Lear
ni
ng
we
ar
e
not

t
her
e
yet
.

Especi
al
l
y
t
he
l
i
t
er
at
ur
e
on
Machi
ne
Lear
ni
ng
i
s

r
i
f
e
wi
t
h
obscur
e
t
er
mi
nol
ogy
and
har
d
mat
hemat
i
cs.

However
,

t
hi
ngs
ar
e
changi
ng
f
ast
.

Count
l
ess
hopef
ul

st
ar
t
ups
and
est
abl
i
shed
compani
es
l
i
ke
Googl
e
and
Mi
cr
osof
t

ar
e
of
f
er
i
ng
machi
ne
l
ear
ni
ng
as
a
ser
vi
ce
t
hat

br
i
ngs
t
he
power

of

t
ur
ni
ng
dat
a
i
nt
o
code
t
o
t
he
masses.

Wi
t
hi
n
a
f
ew
year
s
we
bel
i
eve
t
hat

usi
ng
machi
ne
l
ear
ni
ng
wi
l
l

be
as
easy
f
or

pr
act
i
t
i
oner
s
as

usi
ng
dat
abases
i
s
t
oday.

Just

l
i
ke
dat
abases
ar
e
sel
f

t
uni
ng,

machi
ne
l
ear
ni
ng
ser
vi
ces

and
API
s
wi
l
l

t
ake
car
e
of

al
l

t
he
gor
y
det
ai
l
s
t
hat

cur
r
ent
l
y
pr
event

mass
adopt
i
on
t
he
t
echnol
ogy
by
pr
act
i
t
i
oner
s.

We
shoul
d
al
so
not

f
or
get

t
hat

j
ust

l
i
k
e
quer
y
opt
i
mi
zat
i
on,

ever
y
Machi
ne
Lear
ni
ng
al
gor
i
t
hm
i
s
essent
i
al
l
y
a
hack
t
hat

uses
heur
i
st
i
cs
t
o
i
nf
er

t
he
best

possi
bl
e

f
unct
i
on
f
r
om
noi
sy
dat
a.

Gluing the future and the past using Rx

We
have
expl
ai
ned
machi
ne
l
ear
ni
ng
as
t
he
dual

of

pul
l

based
quer
yi
ng
of

per
si
st
ed
dat
a,

but

machi
ne
l
ear
ni
ng
r
eal
l
y
s
hi
nes
when
we
use
i
t

t
o
quer
y
st
r
eami
ng
dat
a.

Rel
at
i
onal

dat
abases

l
ever
age
t
he
f
act

t
hat

dat
a
i
s
f
ai
r
l
y
st
at
i
c
(
no
need
t
o
r
ebui
l
d
i
ndexes)

and
has
hi
ghl
y

pr
edi
ct
abl
e
st
at
i
st
i
cs
(
used
by
t
he
quer
y
opt
i
mi
zer
)
.

For

st
r
eami
ng
dat
a,

pat
t
er
ns
ar
e
const
ant
l
y
changi
ng
over

t
i
me
(
weat
her

or

seasons
change,

t
r
endi
ng
Twi
t
t
er

t
opi
cs
come
and
go,

.
.
.
)
,

so
new
quer
i
es
must

be
gener
at
ed
cont
i
nuousl
y
t
o
r
ef
l
ect

changes
i
n
char
act
er
i
st
i
cs
of

t
he
i
nput

s
t
r
eams.

St
r
eami
ng
dat
a
i
s
of
t
en
gener
at
ed
by
sensor
s,

so
i
t

al
r
eady
st
ar
t
s
out

as
an
appr
oxi
mat
i
on,

and
ar
e
used
i
n
non
cr
i
t
i
cal

si
t
uat
i
ons

we
ar
e
not

t
al
ki
ng
about

ai
r
pl
anes,

car
s,

cer
t
ai
n
medi
cal

equi
pment
,

and
nucl
ear

r
eact
or
s,

but

about

home
t
her
most
at
s,

per
sonal

heal
t
h
moni
t
or
i
ng
devi
ces,

Twi
t
t
er

and
Facebook
f
eeds,

Hence
i
t

i
s
not

suf
f
i
ci
ent
l
y
val
uabl
e
t
o
j
ust
i
f
y
expensi
ve

cl
eansi
ng,

nor
mal
i
zat
i
on,

and
cur
at
i
on
l
i
ke
mi
ssi
on
cr
i
t
i
cal

ent
er
pr
i
se
dat
a
i
n
SQL
dat
abases.

Last
l
y
,

t
he
r
esul
t
s
of

pr
ocessi
ng
st
r
eami
ng
dat
a
must

be
speci
al
i
zed
f
or

each
i
ndi
vi
dual

user
,

so
a

one
s
i
ze
f
i
t
s
al
l

appr
oach
or

f
i
x
ed
l
i
br
ar
y
of

st
andar
d
quer
i
es
(
cal
cul
at
e
net

sal
ar
y
and
t
axes
f
or

al
l

empl
oyees)

wi
l
l

not

wor
k
.

When
we
use
machi
ne
l
ear
ni
ng
i
n
t
he
cont
ext

of

st
r
eami
ng
dat
a
pr
ocessi
ng,

we
ar
e
l
ear
ni
ng
t
he
transitionfunction
f(S,T) (S,R)

of

a
Meal
y
aut
omat
on
[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
Meal
y_machi
ne]
,

or

mor
e
gener
al
l
y,

a
hi
dden
Mar
kov
model

[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
Hi
dden_Mar
kov_model
]

t
hat

al
l
ows

t
he
aut
omat
on
t
o
r
eact

on
cur
r
ent

event
s,

f
r
om
i
t

(
desi
r
ed)

behavi
or

[((S,T),(S,R))]
on
past

event
s:

I
n
t
he
next

sect
i
ons
we
wi
l
l

have
a
c
l
oser

l
ook
at

a
coupl
e
of

machi
ne
l
ear
ni
ng
ser
vi
ces
such
as
t
he
Googl
e
Pr
edi
ct
i
on
API
,

Bi
gML
and
Text
Razor

t
o
pr
ocess
st
r
eami
ng
dat
a
f
r
om
weat
her

pr
edi
ct
i
ons

and
Googl
e
c
al
endar

updat
es
t
hat

t
el
l

us
i
f

we
shoul
d
pl
ay
gol
f

or

not
.

Anot
her

i
nt
er
est
i
ng
appr
oach
i
s
t
o
use
genet
i
c
al
gor
i
t
hms

[
ht
t
p:
/
/
en.
wi
ki
pedi
a.
or
g/
wi
ki
/
Genet
i
c_al
gor
i
t
hm]
,

l
i
ke
Eur
eqa,

t
o
pr
oduce
a
(
synt
act
i
cal
)

mat
hemat
i
cal

expr
essi
on
t
hat

mi
mi
c

i
nput

dat
a,

but

but

t
hat

i
s
out
si
de
t
he
scope
of

t
hi
s
paper
.

Classification: Google Prediction API

The
Googl
e
Pr
edi
ct
i
on
API

i
s
a
ser
vi
c
e
t
hat

(
i
ncr
ement
al
l
y)

accept
s
t
r
ai
ni
ng
dat
a
[(T,R)]

i
n
t
he
f
or
m
of

CSV
f
i
l
es.

Each
r
ow
i
n
t
he
i
nput

f
i
l
e
r
epr
esent
s
an
exampl
e
pai
r

consi
st
i
ng
of

t
he
ser
i
al
i
zed
r
epr
esent
at
i
on
f
or

t
ype
T,

pl
us
an
answer

R,

t
hat

must

be
ei
t
her

cat
egor
i
cal

(
a
f
i
ni
t
e
set

r
epr
esent
ed
by
st
r
i
ngs)

or

numer
i
c.

Once
t
he
model

i
s
t
r
ai
ned,

t
he
pr
edi
ct
i
on
API

act
s

as
a
r
emot
e
i
mpl
ement
at
i
on
of

a
f
unct
i
on
of

t
ype
T R
t
hat

gi
ven
a
ser
i
al
i
zed
val
ue
of

t
ype
T
wi
l
l

pr
oduce
a
pr
edi
ct
i
on
of

t
ype
R.

The
Googl
e
Pr
edi
ct
i
on
API

does
not

speci
f
y
whi
ch
machi
ne
l
ear
ni
ng
al
gor
i
t
hms
ar
e
used
under

t
he
cover
s.

I
t

onl
y
al
l
ows
user
s
t
o
speci
f
y
t
he
t
ype
of

pr
edi
ct
or

as
cat
egor
i
cal

or

r
egr
essi
on,

or

i
nf
er
s
t
hi
s
f
r
om
t
he
t
ype
of

t
he
answer
s
i
n
t
he
t
r
ai
ni
ng
dat
a
(
when
t
he
answer

t
ype
i
s
numer
i
c,

i
t

wi
l
l

use
r
egr
essi
on)
.

To
t
r
ai
n
our

model
,

we
f
i
r
st

pull
hi
st
or
i
cal

dat
a
f
r
om
Googl
e
Cal
endar

of

t
ype
Iterable[Event]

and
t
he
NOAA
hi
st
or
i
cal

weat
her

dat
a
t
o
obt
ai
n
a
l
i
st

of

exampl
e
pai
r
s
of

weat
her

i
nf
or
mat
i
on
and
a
bool
ean
i
ndi
cat
i
ng
whet
her

we
cancel
l
ed
our

gol
f

appoi
nt
ment

f
or

t
hat

day
or

not
.

I
n
pseudo
code
(
t
he
act
ual

code
you
need
t
o
wr
i
t
e
i
s
cl
ut
t
er
ed
by
non
essent
i
al

OAut
h
aut
hent
i
cat
i
on
t
o
connect

t
o
t
he
Googl
e
API
)

t
hi
s
l
ooks
somet
hi
ng
l
i
ke
t
hi
s:

valtrainingData:List[{va
lweather:Weather;valcancelled:Boolean}]=
myCalendar.events(start
=,end=).filter(e=>
isA
boutGolf(e.description)
)
.map(e=>new{
valweather=noaa.l
ookup(e.location,e.start),
valcancelled=e.st
atus==cancelled
}).toList
The
f
unct
i
on
isAboutGolf
i
s
def
i
ned
bel
ow
usi
ng
ent
i
t
y
ext
r
act
i
on,

and
i
s
an
exampl
e
of

unsuper
vi
s
ed
machi
ne
l
ear
ni
ng.

Gi
ven
t
hi
s
t
r
ai
ni
ng
dat
a
i
n
t
he
f
or
m
of

a
l
i
st

of

exampl
e
i
nput
/
out
put

pai
r
s,

we
can
submi
t

i
t

t
o
t
he
Googl
e
pr
edi
ct
i
on
API

t
o
get

back
a
f
unct
i
on
f
r
om
Weather
t
o
Boolean:

valplay:Weather=>Boolean=createModel(trainingData)

The
act
ual

pr
edi
ct
i
on
API

obvi
ousl
y
r
equi
r
es
mor
e
el
abor
at
e
code
t
han
t
hi
s,

but
,

i
n
essence,

t
hat

i
s
pr
eci
sel
y
what

i
t

pr
ovi
des:

a
way
t
o
t
ur
n
dat
a
i
nt
o
code.

Gi
ven
t
he
model
,

we
swi
t
ch
t
o
r
eact
i
ng
t
o
r
eal

t
i
me
weat
her

updat
es
of

t
ype
Observabl
e[Forecast],

f
r
om
whi
ch
we
gener
at
e
a
st
r
eam
of

event
s
t
hat

we
must

cancel

on
our

cal
endar
:

valshouldCancel:Observab
le[Event]=
weatherUpdate
s.flatMap(forecast=>
myCalendar
.events(date=forecast.date)
.filter(e=
>isAboutGolf(e.description)&&!play(forecast))
)
sho
uldCancel.subscribe(e=
>e.status=cancelled)

The
bi
g
quest
i
on
i
s
whet
her

we
shoul
d
aut
omat
i
cal
l
y
cancel

our

gol
f

appoi
nt
ment
s
based
on
t
he
gener
at
ed
model
.

I
n
ot
her

wor
ds
,

how
sur
e
can
we
be
t
hat

t
he
model

gi
ves
us
t
he
cor
r
ect

answer
?

For

cat
egor
i
cal

model
s,

t
he
Googl
e
pr
edi
ct
i
on
API

(
and
i
n
f
act

t
he
ot
her

API
s
we
i
l
l
ust
r
at
e
bel
ow)

r
et
ur
ns,

i
n
addi
t
i
on
t
o
t
he
most

l
i
kel
y
answer
,

a
sor
t
ed
l
i
st

of

al
t
er
nat
i
ve
answer
s
i
n
decr
easi
ng
pr
obabi
l
i
t
y.

Sl
i
ght
l
y
si
mpl
i
f
i
ed,

t
he
model

i
s
a
r
epr
esent
at
i
on
of

a
f
unct
i
on
of

t
y
pe
T [R]
t
hat

gi
ve
a
val
ue
of

t
ype
T,

r
et
ur
ns
a
di
scr
et
e
pr
obabi
l
i
t
y
di
st
r
i
but
i
on
f
or

t
he
poss
i
bl
e
answer
s
i
n
R.

I
n
some
sense
t
hi
s
i
s
a
mor
e
r
eal
i
st
i
c
r
ef
l
ect
i
on
of

what

i
s
happeni
ng:

"
Mac
hi
ne
Lear
ni
ng
i
s

statistical
f
unct
i
on
i
nver
si
on"
.

The
t
r
ai
ni
ng
dat
a
can
cont
ai
n
mul
t
i
pl
e
sampl
e
out
put

val
ues
f
or

t
he
same
i
nput

val
ue,

i
.
e.

t
he
t
r
ai
ni
ng
dat
a
[(T,R)]

i
s
ef
f
ect
i
vel
y
of

t
he
f
or
m
[(T,[R])]
,

and
hence
we
expect

t
he
synt
hesi
zed
model

t
o
a

non
det
er
mi
ni
st
i
c

f
unct
i
on
of

t
ype
T [R]
.

For

dat
abases,

i
t

i
s
ver
y
common
t
o
have
a
quer
y
of

t
he
f
or
m
T [R],

whi
ch
i
s
t
hen

f
l
at

mapped

over

t
he
dat
a
t
o
yi
el
d
a
r
esul
t

of

t
ype
[R],

or

usi
ng
t
upl
i
ng
i
nt
o
a
r
esul
t
set

of

t
ype
[(T,R)].

Learningbycounting:BayesianClassification

The
condi
t
i
onal

pr
obabi
l
i
t
y
(R|T)

of

R
gi
ven
T,

i
s
r
eal
l
y
a
di
f
f
er
ent

not
at
i
on
f
or

a
monadi
c

f
unct
i
on
of

t
ype
T (R)

t
hat

maps
i
nput
s
T
t
oprobabilitymassfunctions
(R)
f
or

t
he
out
put

Nai
vel
y,

we
can
r
epr
esent

pr
obabi
l
i
t
y
mass
f
unct
i
ons
as
(
sor
t
ed)

col
l
ect
i
ons(R,)
t
hat

associ
at
e
a
r
eal

val
ued
pr
obabi
l
i
t
y
wi
t
h
each
val
ue
i
n
R.

For

pr
act
i
cal

appl
i
cat
i
ons,

we
need
a
mor
e
comput
at
i
onal
l
y
ef
f
i
ci
ent

r
epr
esent
at
i
on,

f
or

i
nst
ance
usi
ng
sampl
i
ng
f
unct
i
ons

[
ht
t
p:
/
/
www.
kei
t
hschwar
z.
com/
dar
t
s

di
ce
coi
ns/
]
,

but
,

f
or

now
t
he
concr
et
e
r
epr
esent
at
i
on
usi
ng
l
i
st
s
of

pai
r
s
wi
l
l

be
much
mor
e
i
l
l
ust
r
at
i
ve.

We
wi
l
l

use
speci
al

l
i
st

br
acket
s

t
o
i
ndi
c
at
e
t
hat

t
he
sum
of

t
he
pr
obabi
l
i
t
i
es
i
s
nor
mal
i
zed
t
o
1.

Gi
ven
a
pr
obabi
l
i
t
y
mas
s
f
unct
i
on
pt(T)
and
a
condi
t
i
onal

pr
obabi
l
i
t
y
f(R|T)=T (R)
,

we
can
use
monadi
c
bi
nd,

or

flatMap
as
devel
oper
s
cal
l

i
t
,

t
o
cal
cul
at
e
a
post
er
i
or

pr
obabi
l
i
t
y
mass
f
unct
i
on
of

t
ype
(R),

or

usi
ng
t
upl
i
ng
(TR)
,

as
f
ol
l
ows:

pt.flatMap(f,id)= ((t,r),p
pr)|(t,pt) pt,(r,pr) f(t).

BayesLaw
t
el
l
s

us
t*
-1
t
hat

gi
ven
an
f(R|T)
and
a

pr
i
or

pt(T),

t
he
i
nver
se
of

f
i
s
equi
val
ent

t
o
f R (T)
=r (
r,p)|((t,
r),p) pt.flatMap(f,id),r=r.

Fr
om
a
t
r
ai
ni
ng
set

training[(T,R
)]

we
can
cal
cul
at
e
pt(
T)
and
f(R|T)
usi
ng
(
si
mpl
e)

count
i
ng
namel
y
pt=(t,#[t|(t,r) training,t=t])|tT,

and f(t)
=(r,#[r
|(t,r) training,t=t,r=r])|rR
.

I
t

coul
d
happen
t
hat

t
he
t
r
ai
ni
ng
dat
a
does
not

cont
ai
n
sampl
es
f
or

ever
y
tT,

but

we
can
al
ways
pretend
t
he
t
r
ai
ni
ng
dat
a
i
s
i
ni
t
i
al
i
zed
wi
t
h
(t,)
f
or

each
val
ue
tT
wher
e

mat
ches
any
val
ue
i
n
i
n
R,

and
hence
comput
e
pt=(t,#[t|(t,r) training,t=t])+1|tT
and

f(t)=(r,#[r
|
(t,r) training,t=t,r=r]+1)|rR.Smoot
hi
ng
t
he
count
s
l
i
ke
t
hi
s
(
ar
t
i
f
i
ci
al
l
y)

makes
ev
er
yt
hi
ng
t
ot
al

and
pr
event
s

pr
obabi
l
i
t
i
es
f
or

unknown
val
ues
t
o
become
zer
o.

I
n
t
he
r
eal

wor
l
d,

peopl
e
happi
l
y
use
appr
oxi
mat
e
answer
s
t
o
wat
ch
movi
es,

buy
books,

f
i
nd
par
t
ner
s,

et
c.

wi
t
hout

any
pr
obl
ems.

But

pr
ogr
ammi
ng
l
anguages
ar
e
st
uck
i
n
t
he
1950
s

and
demand
exact

answer
s
l
i
ke
t
r
ue
and
f
al
se.

We
shoul
d
embr
ace
comput
i
ng
wi
t
h
uncer
t
ai
nt
y

i
nst
ead
of

expect
i
ng
exact

answer
s.

I
n
f
act
,

f
l
oat
i
ng
poi
nt

number
s
ar
e
a
pr
i
me
exampl
e
of

comput
i
ng
wi
t
h
appr
oxi
mat
i
ons

most

of

t
he
t
i
me
we
don
t

even
t
hi
nk
about
.

The
devel
opment

of

novel

pr
obabi
l
i
st
i
c
pr
ogr
ammi
ng
l
anguages
shoul
d
be
an
i
mpor
t
ant

f
ocus
f
or

pr
ogr
ammi
ng
l
anguage
r
esear
cher
s

i
nst
ead
of

ar
gui
ng
about

t
r
i
vi
al
i
t
i
es
such
as
st
at
i
c
ver
sus
dynami
c
t
y
pes.

Entity Extraction: TextRazor

I
n
t
he
exampl
e
code
above
we
used
a
yet

undef
i
ned
f
unct
i
on
isAboutGolf:String=>
Boolean.

To
i
mpl
ement

t
hi
s
f
unct
i
on
we
wi
l
l

use
t
he

t
opi
c
t
aggi
ng

f
eat
ur
e
of

t
he
Text
Razor

API
,

whi
ch,

gi
ven
a
t
ext

st
r
i
ng,

cr
eat
es
a
l
i
st

of

t
opi
cs
t
hat

i
t

i
nf
er
s
t
he
t
ext

i
s
about
.

isA
boutGolf:String=>Boo
lean=input=>
textRazor.to
pics(input).contains(golf)

Agai
n,

t
he
act
ual

code
i
s

a
l
i
t
t
l
e
mor
e
i
nvol
ved,

but
,

i
n
essence,

i
t

i
s
as
si
mpl
e
as
t
hi
s.

I
f

we
submi
t

t
he
t
ext

HeyErik,fancyafewroundsofgolfnextweek?
,

Text
Razor

wi
l
l

i
dent
i
f
y

t
he
wor
d
golf
as
r
ef
er
i
ng
t
o
ht
t
p:
/
/
www.
f
r
eebase.
com/
m/
037hz.

Clustering: BigML

Fr
om
a
dev
el
oper

per
spect
i
ve,

t
he
Bi
gML
machi
ne
l
ear
ni
ng
API

i
s
ver
y
si
mi
l
ar

t
o
t
he
Googl
e
Pr
edi
ct
i
on
API

you
submi
t

t
r
ai
ni
ng
dat
a
as
a
CSV
f
i
l
e,

t
r
ai
n
t
he
model
,

and
t
hen
quer
y
t
he
model

by
s
endi
ng
i
t

i
nput
s
t
o
r
ecei
ve
out
put
s.

Wher
e
Bi
gML
di
f
f
er
s
i
s
t
hat

i
t

al
so
pr
ovi
des
a
downl
oadabl
e
r
epr
esent
at
i
on
of

t
he
t
r
ai
ned
model

i
n
var
i
ous
f
or
ms,

i
ncl
udi
ng
as
sour
ce
code
i
n
var
i
ous
l
anguages.

Al
s
o
Bi
gML
of
f
er
s
mor
e
choi
ce
of

l
ear
ni
ng
al
gor
i
t
hms
i
ncl
udi
ng
cl
ust
er
i
ng.


[
[
TODO:

Code
sampl
e
f
or

ht
t
ps:
/
/
bi
gml
.
com/
devel
oper
s/
cl
ust
er
s]
]

Conclusion

Sooner

t
han
we
expect
,

we
wi
l
l

al
l

be
f
ami
l
i
ar

wi
t
h
t
he
concept

of

machi
ne
l
ear
ner
s.

For

most

devel
oper
s
,

a
machi
ne
l
ear
ner

wi
l
l

be
a
magi
c
bl
ack
box
t
o
whi
ch
t
hey
can
send
t
r
ai
ni
ng
dat
a
and
t
hat

as
a
r
esponse
wi
l
l

r
et
ur
n
a
f
unct
i
on
(
model
)

synt
hesi
zed
f
r
om
t
he
t
r
ai
ni
ng
dat
a,

based
on
t
he
al
gor
i
t
hms
hi
dden
i
nsi
de
t
he
machi
ne
l
ear
ner
.

Just

as
t
he
democr
at
i
zat
i
on
of

dat
abases
i
n
past

decades
has
aut
omat
ed
and
hi
dden
al
l

compl
exi
t
y
behi
nd
easy
t
o
use
i
nt
er
f
aces,

t
he
same
i
s
happeni
ng
f
or

machi
ne
l
ear
ner
s,

except

at

a
much
mor
e
mor
e
r
api
d
pace.

Just

as
no
pr
act
i
t
i
oner

i
n
a
sane
st
at
e
of

mi
nd
wi
l
l

i
mpl
ement

t
hei
r

own
dat
abase
(
r
at
her

l
eavi
ng
t
hi
s
t
o
speci
al
i
zed
compani
es)
,

no
one
wi
l
l

i
mpl
ement

t
hei
r

own
machi
ne
l
ear
ni
ng
al
gor
i
t
hms
ei
t
her

(
but

l
eave
t
hi
s
t
o
speci
al
i
zed
compani
es)
.

Easy
access
t
o
machi
ne
l
ear
ner
s
wi
l
l

enabl
e
a
new
wance
of

smar
t

appl
i
cat
i
ons
t
hat

ar
e
hi
ghl
y
speci
al
i
zed
f
or

each
user
,

especi
al
l
y
appl
i
cat
i
ons
t
hat

wor
k
over

st
r
eami
ng
and
(
human)

r
eal
t
i
me
dat
a.

The
f
ut
ur
e
of

bi
g
dat
a
i
s
Machi
ne
Lear
ni
ng.

Vous aimerez peut-être aussi