Analysis of massive data using R (CAEPIA2015)

1CAEPIA 2015 Albacete, November 9, 2015
Analysis of Massive
Data Streams Using
R
Antonio
Salmerón1,
Helge
Langseth2

Anders
L.
Madsen3,4,
Thomas
D.
Nielsen4

1Dept.
Mathema?cs,
University
of
Almería,
Spain

2Dept.
Computer
and
Informa?on
Science.
Norwegian
University
of
Science
and

Technology,
Trondheim,
Norway

3Hugin
Expert
A/S,
Aalborg,
Denmark

4Dept.
Computer
Science,
Aalborg
University,
Denmark

Outline
1.  Introduc+on

o  Data
streams

o  Challenges
when
processing
data
streams

o  Why
Bayesian
networks?

o  The
AMIDST
project

2.  Bayesian
networks

o  Sta?c
and
dynamic
models

o  Inference
and
learning

3.  Exploratory
analysis

o  Exploratory
?me
series
analysis
in
R

o  Report
genera?on:
LaTeX
+
R

4.  The
Ramidst
package

o  The
AMIDST
toolbox

o  Using
the
AMIDST
toolbox
from
R

CAEPIA 2015 Albacete, November 9, 2015 2

Introduc?on

Part
I


Data Streams everywhere
•  Unbounded
ﬂows
of
data
are
generated
daily:

•  Social
Networks

•  Network
Monitoring

•  Financial/Banking
industry

•  ….


Data Stream Processing
•  Processing
data
streams
is
challenging:

–  They
do
not
ﬁt
in
main
memory

–  Con?nuous
model
upda?ng

–  Con?nuous
inference
/
predic?on

–  Concept
dri[


Processing Massive Data Streams
• Scalability
is
a
main
issue:

•  Scalable
compu?ng
infrastructure

•  Scalable
models
and
algorithms


Why Bayesian networks?
§  Example:

§  Stream
of
sensor
measurements
about
temperature
and

smoke
presence
in
a
given
geographical
area.

§  The
stream
is
analysed
to
detect
the
presence
of
ﬁre
(event

detec?on
problem)

?


§  The
problem
can
be
approached
as
an
anomaly

detec+on
task
(outliers)

§  A
commonly
used
method
is
Streaming
K-‐Means

Anomaly


§  OJen,
data
streams
are
handled
using
black-‐box
models:

§  Pros:

§  No
need
to
understand
the
problem

§  Cons:

§  Hyper-‐parameters
to
be
tuned

§  Black-‐box
models
can
seldom
explain
away

Stream

Black-‐box
Model

Predic?ons


§  Bayesian
Networks:

§  Open-‐box
models

§  Encode
prior
knowledge.

§  Con?nuous
and
discrete
variables
(CLG
networks).

§  Example:

Fire

Temp
Smoke

T1
T2
T3
S1

p(Fire=true|t1,t2,t3,s1)


Stream
Predic+ons

Open-‐box
Models


Stream
Predic+ons

Open-‐box
Models

Black-‐box
Inference
Engine

(mul+-‐core
paralleliza+on)


The AMIDST project
§  FP7-‐funded
EU
project

§  Large
number
of
variables

§  Data
arriving
in
streams

§  Based
on
hybrid
Bayesian
networks

§  Open
source
toolbox
with
learning
and
inference
capabili?es

§  Two
use
cases
provided
by
industrial
partners

§  Predic+on
of
maneuvers
in
highway
traﬃc
(Daimler)

§  Risk
predic+on
in
credit
opera+ons
and
customer
proﬁling
(BCC)

§  hZp://www.amidst.eu

ODELADO CON REDES BAYESIANAS
DINÁMICAS HÍBRIDAS
RESULTADOS OBTENIDOS EN LA
PREDICCIÓN DE MANIOBRAS DE TRÁF
REDES BAYESIANAS DINÁMICAS
DE 2 ETAPAS TEMPORALES
S DE MARKOV
AYESIANAS
s entre variables vienen dadas
dirigido. Se conocen las
s condicionales de probabilidad,
ores de los padres.
SIANA DINÁMICA PARA LA EVIDENCIA LATERAL EN UN VEHÍCULO
• Es preferible analizar l
tendencia a fijar un valo
en la probabilidad.
• Con ello, es posible prede
maniobras con mayor ante
usando otros métodos.
• Las redes Bayesianas diná
mediante el uso de algorit
inferencia aproximados, so
herramienta adecuada par
dificultades de este proble
• El paquete AMIDST perm
análisis de datos en tiempo
mediante el uso de redes B
dinámicas, proporcionando
adecuado para intentar res
problema.
• Se espera que estas y otr
contribuciones reduzcan e
víctimas de accidentes de
buscando el objetivo de co
vehículo totalmente segur

Bayesian
networks

Part
II


Definition
§  Formally,
a
Bayesian
network
consists
of

§  A
directed
acyclic
graph
(DAG)
where
each
node
is
a
random

variable

§  A
set
of
condi?onal
probability
distribu?ons,
one
for
each

variable
condi?onal
on
its
parents
in
the
DAG

§  For
a

set
of
variables

,
the
joint

distribu+on
factorizes
as

§  The
factoriza?on
allows
local
computa?ons

CT 619209 / AMIDST
Page 8 of 63
mally, let X = {X1, . . . , XN } denote the set of stochastic random variables d
domain problem. A BN deﬁnes a joint distribution P(X) in the following for
p(X) =
NY
i=1
p(Xi|Pa(Xi))
e Pa(Xi) ⇢ XXi represents the so-called parent variables of Xi. Bayesian ne
be graphically represented by a directed acyclic graph (DAG). Each node, la
n the graph, is associated with a factor or conditional probability p(Xi|Pa
FP7-ICT 619209 / AMIDST
Page 8 of 63
Formally, let X = {X1, . . . , XN } denote the set o
our domain problem. A BN deﬁnes a joint distrib
p(X) =
NY
i=1
p(Xi|P
where Pa(Xi) ⇢ XXi represents the so-called pa

Reading independencies
Independence
rela+ons
can
be
read
off
from
the

structure

There
are
three
types
of
connec?ons:

§  Serial

§  Diverging

§  Converging

Tipos de conexiones
Conexión en serie:
A B C
Conexión divergente:
A B C
Conexión convergente:
A B C
Tipos de conexiones
A B C
A B C
A B C
Tipos de conexiones
A B C
A B C
A B C

Reading independencies.
Example
Fire

Temp
Smoke

T1
T2
T3
S1

•  Knowing
the
temperature
with
certainty
makes
the
temperature
sensor
readings

and
the
event
of
ﬁre
independent

•  The
smoke
sensor
reading
is
also
irrelevant
to
the
event
of
ﬁre
if
Smoke
is
known
for

sure

Reading independencies.
Example
Fire

Temp
Smoke

T1
T2
T3
S1

•  Knowing
the
temperature
with
certainty
makes
the
temperature
sensor
readings

and
the
event
of
ﬁre
independent

•  The
smoke
sensor
reading
is
also
irrelevant
to
the
event
of
ﬁre
if
Smoke
is
known
for

sure

•  If
there
is
no
info
about
Temp
or
sensor
readings,
Sun
and
Fire
are
independent

Sun

Hybrid Bayesian networks
•  In
a
hybrid
Bayesian
network,
discrete
and
con?nuous
variables
coexist

•  Mixtures
of
truncated
basis
func?ons
(MoTBFs)
have
been
successfully
used

in
this
context
(Langseth
et
al.
2012)

•  Mixtures
of
truncated
exponen?als
(MTEs)

•  Mixtures
of
polynomials
(MoPs)

•  MoTBFs
support
eﬃcient
inference
and
learning
in
a
sta?c
seeng

•  Learning
from
streams
is
more
problema?c

•  The
reason
is
that
they
do
not
belong
to
the
exponen?al
family

The exponential family
•  A
family
of
probability
func?ons
belongs
to
the
k
parametric

exponen?al
family
if
it
can
be
expressed
as

uación 2.14 puede expresarse, de forma equivalente, como
f(x; θ) = H(x)C(θ) exp{Q(θ)T(x)}
ón 2.11 La familia de funciones de densidad o de masa de probabilid
θ ∈ Θ ⊆ Rk} pertenece a la familia exponencial k-paramétrica si
f(x; θ) = exp
k
i=1
Qi(θ)Ti(x) + D(θ) + S(x)
e se considera como el soporte de una distribución el conjunto {x ∈ X | f(x; θ) > 0} au
tribuciones de tipo continuo tal definición podría no ser adecuada ya que podríamos re
densidad en una cantidad numerable de puntos sin cambiar la distribución por lo que e
o estaría definido de forma única. Una definición más precisa es considerar que x ∈ X p
P{x − h < X < x + h} > 0 para cualquier h > 0.
melo Rodríguez Torreblanca
adística y Mat. Aplicada. UAL
•  The
Ti
func?ons
are
the
sufficient
sta?s?cs
for
the
unknown

parameters,
i.e.,
they
contain
all
the
informa?on
in
the
sample

that
is
relevant
for
es?ma?ng
the
parameters

•  They
have
dimension
1

•  We
can
“compress”
all
the
informa?on
in
the
stream
so
far
as
a

single
number
for
each
parameter

Hybrid Bayesian networks. CLGs
Conditional Linear Gaussian networks
A Conditional Linear Gaussian (CLG) network is a hybrid Bayesian
network where
I The conditional distribution of each discrete variable XD given
its parents is a multinomial
I The conditional distribution of each continuous variable Z
with discrete parents XD and continuous parents XC , is
p(z|XD = xD, XC = xC ) = N(z; ↵(xD) + (xD)T
xC , (xD))
for all xD and xC , where ↵ and are the coefficients of a
linear regression model of Z given XC , potentially different for
each configuration of XD.
ECSQARU 2015, Compiegne, July 17, 2015 4
CLGs
belong
to
the
exponen?al
family

CLGs: Example
Y
W
TU
S
P(Y ) = (0.5, 0.5)
P(S) = (0.1, 0.9)
f (w|Y = 0) = N(w; 1, 1)
f (w|Y = 1) = N(w; 2, 1)
f (t|w, S = 0) = N(t; w, 1)
f (t|w, S = 1) = N(t; w, 1)
f (u|w) = N(u; w, 1)
Conditional Linear Gaussian networks. Example
Y
W
TU
S
P(Y ) = (0.5, 0.5)
P(S) = (0.1, 0.9)
f (w|Y = 0) = N(w; 1, 1)
f (w|Y = 1) = N(w; 2, 1)
f (t|w, S = 0) = N(t; w, 1)
f (t|w, S = 1) = N(t; w, 1)
f (u|w) = N(u; w, 1)
§  Limita+on:
discrete

nodes
are
not
allowed
to
have

con?nuous
parents

§  This
is
not
a
big
problem
for
Bayesian
classiﬁers

Bayesian network classifiers
§  The
structure
is
usually
restricted

§  There
is
a
dis?nguished
(discrete)
variable
called
the

class
while
the
rest
are
called
features

§  Examples:

C
X2X1
... Xn
(a)
C
X2X1 X3 X4
(b)
Figure 1: Structure of naive Bayes (a) and TAN (b) classiﬁers.
In general, there are several possible TAN structures for a given set of138
variables. The way to choose among them is to construct a maximum weight139
spanning tree containing the features, where the weight of each edge is the140
Naive
Bayes
Tree-‐augmented
network
(TAN)

Bayesian network classifiers
§  The
class
value
is
determined
as

§  In
the
case
of
Naïve
Bayes,

ayesian network can be used as a classifier if it contains a cla
a set of continuous or discrete explanatory variables X1, . . . ,
ect with observed features x1, . . . , xn will be classified as be
2 ⌦C obtained as
c⇤
= arg max
c2⌦C
p(c|x1, . . . , xn),
⌦C denotes the set of all posible values of C.
nsidering that p(c|x1, . . . , xn) is proportional to p(c) ⇥ p(x1,
cification of an n dimensional distribution for X1, . . . , Xn
d in order to solve the classification problem, which implies a
mputational cost, as the number of parameters necessary to
stribution is exponential in the number of variables, in the w
er, this problem is simplified if we take advantage of the fac
d by the BN. Since building a network without any structur
not always feasible (they might be as complex as the above m
ded by the BN. Since building a network without any structur
is not always feasible (they might be as complex as the above m
distribution), networks with fixed or restricted and simple
utilized instead when facing classification tasks. The extreme c
e Bayes (NB) structure, where all the feature variables are c
pendent given C, as depicted in Fig. 1(a). The strong assu
pendence behind NB models is somehow compensated by the
e number of parameters to be estimated from data, since in th
s that
p(c|x1, . . . , xn) / p(c)
nY
i=1
p(xi|c) ,
h means that, instead of one n-dimensional conditional densi
nsional conditional densities must be estimated.
n TAN models, more dependencies are allowed, expanding the

Reasoning over time: Dynamic
Bayesian networks
§  Temporal
reasoning
can
be
accommodated
within
BNs

§  Variables
are
indexed
over
?me,
giving
rise
to
dynamic

Bayesian
networks

§  We
have
to
model
the
joint
distribu?on
over
?me

§  Dynamic
BNs
reduce
the
factoriza?on
complexity
by

adop?ng
the
Markov
assump?on

Similarly to static BNs, we model our problem/system using a set of stochastic ran
variables, denoted Xt, with the main di↵erence that variables are indexed here
discrete time index t. In this way, we explicitly model the state of the system a
given time. Moreover, we always assume that the system is described at a fixed frequ
and use Xa:b ⌘ Xa, Xa+1, . . . , Xb to denote the set of variables between two time p
a and b.
For reasoning over time, we need to model the joint probability p(X1:T ) which ha
following natural cascade decomposition:
p(X1:T ) =
TY
t=1
p(Xt|X1:t 1),
where p(Xt|X1:t 1) is equal to p(X1) for t = 1. As t increases, the conditional
ability p(Xt|X1:t 1) becomes intractable. Similarly to static BNs, dynamic BNs
more compact factorization of the above joint probability. The first kind of condit
independence assumption encoded by DBNs to reduce the factorization complex
the well-known Markov assumption. Under this assumption, the current state is
pendent from the previous one given a finite number of previous steps and the resu
models are referred to as Markov chains. Basically, a Markov chain can be defin
either discrete or continuous variables X1:T . It exploits the following equality:
lowing natural cascade decomposition:
p(X1:T ) =
TY
t=1
p(Xt|X1:t 1),
here p(Xt|X1:t 1) is equal to p(X1) for t = 1. As t increases, the conditio
ility p(Xt|X1:t 1) becomes intractable. Similarly to static BNs, dynamic B
ore compact factorization of the above joint probability. The first kind of c
dependence assumption encoded by DBNs to reduce the factorization com
e well-known Markov assumption. Under this assumption, the current sta
ndent from the previous one given a finite number of previous steps and the
odels are referred to as Markov chains. Basically, a Markov chain can be d
her discrete or continuous variables X1:T . It exploits the following equality
p(Xt|X1:t 1) = p(Xt|Xt V :t 1)
here V 1 is the order of the Markov chain. Figure 3.3 shows two example
rresponding to first-order (i.e., V = 1) and third-order (i.e., V = 3) Markov

Reasoning over time: Dynamic
Bayesian networks
§  DBN
assuming
third
order
Markov
assump?on

§  DBN
assuming
first
order
Markov
assump?on

9209 / AMIDST
Page 11 of 63
Publi
.3: An example of DBNs assuming a third-order (above) and a first-orde
Markov property.
an unrealistic assumption in some problems leading to poor approximations o
distribution. One could increase the Markov order to improve the approxima
9209 / AMIDST
Page 11 of 63
Publi
.3: An example of DBNs assuming a third-order (above) and a first-order
Markov property.

Particular cases of Dynamic
Bayesian networks
§  Hidden
Markov
models

§  The
joint
distribu?on
of
the
hidden
(X)
and
observed
(Y)

variables
is

FP7-ICT 619209 / AMIDST
Page 12 of 63
Publi
Figure 3.4: An example of a BN structure corresponding to a HMM.
P(X1:T , Y1:T ) =
tY
t=1
P(Xt|Xt 1)P(Yt|Xt). (3.1)
Although most of our models will ﬁt into this description of observed and hidden (state)
variables, there will be cases in which the transition model takes place in the observed
CT 619209 / AMIDST
Page 12 of 63
Figure 3.4: An example of a BN structure corresponding to a HMM.
P(X1:T , Y1:T ) =
tY
t=1
P(Xt|Xt 1)P(Yt|Xt).
hough most of our models will ﬁt into this description of observed and hidden (s
ables, there will be cases in which the transition model takes place in the obs

Particular cases of Dynamic
Bayesian networks
§  Input-‐output
Hidden
Markov
models

§  Linear
dynamic
systems:
switching
Kalman
filter

variables (see, e.g., the case of Cajamar), which in general simplifies the learning-
inference processes of the problem.
An extension of the HMM is the so-called input-output hidden Markov model (IOHMM)
shown in Figure 3.5. IOHMM incorporates an extra top layer of input variables Y0
1:T ,
which can be either continuous or discrete. The existing HMM layer of observed vari-
ables, Y1:T , is referred to as the output set of variables.
Figure 3.5: An example of a BN structure corresponding to an IO-HMM.
IOHMM is usually employed in supervised classification problems. In this case, both
input and output variables are known during training, but only the former is known
during testing. In fact, during testing, inference is performed to predict the output
variables at each time step. In AMIDST we use this model in a di↵erent way. In our
case, both set of input and output variables are always known, so that inference is only
performed to predict the latent variables. The input variables Y0
1:T are introduced as
a way to “relax” the stationary assumption, by explicitly introducing a dependency to
some observed information at each time slice, that is, the transition probability between
Similar to the extension of the static BN model to hybrid domains, DBNs have likewise
been extended to continuous and hybrid domains. In purely continuous domains, where
the continuous variables follow linear Gaussian distributions, the DBN corresponds to
(a factorized version of) a Kalman filter (KF). The structure of a KF is exactly the same
as the one displayed in Figure 3.4 for the HMM, however with the restriction that all
variables should be continuous. In this case, the state variables can be a combination of
continuous variables with di↵erent dependences, and where the dynamics of the process
are assumed to be linear.
When modelling non-linear domains, the dynamics and observational distributions are
often approximated through, e.g., the extended Kalman filter, which models the system
as locally linear in the mean of the current state distribution. Another type of model
ensuring non-linear predictions with a more expressive representation is the switching
Kalman filter (SKF). The type of SKF that we are going to consider here includes an
extra discrete state variable that is able to use a weighted combination of the linear
sub-models. That is, the discrete state variable assigns a probability to each linear term
in the mixture, hence, representing the belief state as a mixture of Gaussians. In this
way, it can deal, to some extent, with violations of both the assumption of linearity and
Gaussian noise. Figure 3.6 depicts the graphical structure of this dynamic model.
Figure 3.6: An example of a switching Kalman filter. Zt represents the discrete state

Two-time slice Dynamic
Bayesian networks (2T-DBN)
§  They
conform
the
main
dynamic
model
in
AMIDST

§  The
transi+on
distribu+on
is

Figure 3.7: An example of a BN structure corresponding to a 2T-DBN.
T-DBN, the transition distribution is represented as follows:
p(Xt+1|Xt) =
Y
Xt+12Xt+1
p(Xt+1|Pa(Xt+1)),
Pa(Xt+1) refers to the set of parents of the variable Xt+1 in the transition m
In general, DBNs can model arbitrary distributions over time. However, in AMIDST,
we will especially focus on the so-called two-time slice DBNs (2T-DBNs). 2T-DBNs
are characterised by an initial model representing the initial joint distribution of the
process and a transition model representing a standard BN repeated over time. This
kind of DBN model satisﬁes both the ﬁrst-order Markov assumption and the stationarity
assumption. Figure 3.7 shows an example of a graphical structure of a 2T-DBN model.
Figure 3.7: An example of a BN structure corresponding to a 2T-DBN.
In a 2T-DBN, the transition distribution is represented as follows:
p(Xt+1|Xt) =
Y
Xt+12Xt+1
p(Xt+1|Pa(Xt+1)),
where Pa(Xt+1) refers to the set of parents of the variable Xt+1 in the transition model,

Inference in CLG networks
§  There
are
three
ways
of
querying
a
BN

§  Belief
upda?ng
(probability
propaga?on)

§  Maximum
a
posteriori
(MAP)

§  Most
probable
explana?on
(MPE)


Querying a Bayesian network (I)
I Probabilistic inference: Computing the posterior distribution of
a target variable:
p(xi |xE ) =
X
xD
Z
xC
p(x, xE )dxC
X
xDi
Z
xCi
p(x, xE )dxCi

Querying a Bayesian network (II)
I Maximum a posteriori (MAP): For a set of target variables XI ,
the goal is to compute
x⇤
I = arg max
xI
p(xI |XE = xE )
where p(xI |XE = xE ) is obtained by ﬁrst marginalizing out
from p(x) the variables not in XI and not in XE
I Most probable explanation (MPE): A particular case of MAP
where XI includes all the unobserved variables

Probability propagation in CLG
networks: Importance sampling
•  Let’s
denote
by

the
posterior
probability
for
the
target

variable,
and

•  Then,

Therefore,
we
have
transformed
the
problem
of
probability

propaga?on
into
es?ma?ng
the
expected
value
of
a
random

variable
from
a
sample
drawn
from
a
distribu?on
of
our
own

choice

Scalable approximate inference in CLG networks 5
umerator of Eq. (2), i.e. ✓ =
R b
a
h(xi)dxi with
(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
e ✓ as
xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤

h(X⇤
i )
p⇤(X⇤
i )
, (6)
ity density function on (a, b) called the sampling distribu-
dom variable with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
be a
. Then it is easy to prove that
ˆ 1
mX h(X⇤
i
(j)
)
Scalable approximate inference in CLG networks 5
Let ✓ denote the numerator of Eq. (2), i.e. ✓ =
R b
a
h(xi)dxi with
h(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
Then, we can write ✓ as
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤

h(X⇤
i )
p⇤(X⇤
i )
, (6)
here p⇤
is a probability density function on (a, b) called the sampling distribu-
on, and X⇤
i is a random variable with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
be a
ample drawn from p⇤
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
(7)
Scalable approximate inference in CLG networks
R b
a
h(xi)dxi with
h(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤

h(X⇤
i )
p⇤(X⇤
i )
,
where p⇤
is a probability density function on (a, b) called the sampling distr
tion, and X⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
b
sample drawn from p⇤
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)

•  The
expected
value
can
be
es?mated
using
a
sample
mean

es?mator.
Let

be
a
sample
drawn
from

p*.
Then
a
consistent
unbiased
es?mator
of

is
given
by

•  In
AMIDST,
the
sampling
distribu?on
is
formed
by
the

condi?onal
distribu?ons
in
the
network
(Evidence
weigh?ng)

(xi)
(xi)
p⇤
(xi)dxi = Ep⇤

h(X⇤
i )
p⇤(X⇤
i )
, (6)
ction on (a, b) called the sampling distribu-
with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
be a
sy to prove that
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
(7)
e estimation is determined by its variance,
⇤(j)
)
⇤
i
(j)
)
1
A =
1
m2
mX
j=1
Var
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
!
Scalable approximate inference in CLG network
R b
a
h(xi)dxi with
h(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤

h(X⇤
i )
p⇤(X⇤
i )
,
where p⇤
is a probability density function on (a, b) called the sampling
tion, and X⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
sample drawn from p⇤
hen, we can write ✓ as
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤

h(X⇤
i )
p⇤(X⇤
i )
,
p⇤
is a probability density function on (a, b) called the sampling dis
and X⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
e drawn from p⇤
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
unbiased estimator of ✓.
s ˆ✓1 is unbiased, the error of the estimation is determined by its var
is
Var(ˆ✓1) = Var
0
@ 1
m
mX h(X⇤
i
(j)
)
p⇤(X⇤(j)
)
1
A =
1
m2
mX
Var
h(X⇤
i
(j)
)
p⇤(X⇤(j)
)
!
=
a
h(xi)dxi =
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤
h
p
probability density function on (a, b) called th
⇤
is a random variable with density p⇤
. Let Xi
n from p⇤
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
d estimator of ✓.
unbiased, the error of the estimation is determi

Stream

S.
Dist
Map
Reduce

Stream

C.U.
C.U.

C.U.
C.U.

Sample

genera?on

Suﬃcient

sta?s?cs

Response
for
an
input
stream
with
a
network
of
500
variables

Response
for
an
input
stream
with
a
network
of
10
variables

MAP in CLG networks

MAP
is
similar
to
probability
propaga?on
but:

•  First
marginalize
out
by
sum/integral
(sum
phase)

•  Then
maximize
(max
phase)

Constrained
order
-‐>
higher
complexity

MAP in CLG networks
MAP
in
the
AMIDST
Toolbox

•  Hill
Climbing
(global
and
local
change)

•  Simulated
Annealing
(global
and
local
change)

•  Sampling

MAP in CLG networks
Stream

S.
Dist
Map
Reduce

Stream

C.U.
C.U.

C.U.
C.U.

Mul?ple

star?ng

points

Local

solu?ons

Inference in dynamic networks
Task 3.3. Inference in dynamic networks
Inference in DBNs faces the problem of entanglement:
All variables used to encode the belief state at time t = 2 become
dependent after observing {e0, e1, e2}.
AMIDST, Review, Luxembourg, January 22, 2015 16

Inference in dynamic networks
•  Varia?onal
message
passing
based
on
the
varia?onal

approxima?on
to
a
posterior
distribu?on
p(xI)
which

is
deﬁned
as

•  Factored
fron?er,
which
assumes

independence
of

the
nodes
connec?ng
to
the
past
given
the

observa?ons

ference in DBNs will be approached following a Bayesian
rmulation + Variational Bayes.
he variational approximation to a posterior distribution p(xI
deﬁned as
q⇤
(xI ) = arg min
q2Q
D(q(xI )||p(xI )),
here D(q||p) is the KL divergence from q to p.
n alternative is to focus on D(p(xI )||q(xI )), which
orresponds to expectation propagation.
he optimal variational distribution is computed iteratively.

Learning CLG networks from
data
§  Learning
the
structure

§  Methods
based
on
condi?onal
independence
tests

§  Score
based
techniques

§  Es+ma+ng
the
parameters

§  Bayesian
approach

§  Frequen?st
approach
(maximum
likelihood)


data

§  Bayesian
parameter
learning

§  Parameters
are
considered
random
variables
rather
than
ﬁxed

quan??es

§  A
prior
distribu?on
is
assigned
to
the
parameters,
represen?ng

the
state
of
knowledge
before
observing
the
data

§  The
prior
is
updated
in
the
light
of
new
data.

§  The
Bayesian
framework
naturally
deals
with
data
streams

p(✓|d1, . . . , dn, dn+1) / p(dn+1|✓)p(✓|d1, . . . , dn)

data
AMIDST, 1st Annual Meeting, Copenhagen, November 27-28, 2014
Parameter learning by inference
Simple example:
I Random walk over Y1, Y2, . . .
I f (yt|yt 1) ⇠ N(yt 1, ⌧ 1
).
I Precision ⌧ is unknown.
Y1 Y2 Y3 Y4 Y5

data
AMIDST, 1st Annual Meeting, Copenhagen, November 27-28, 2014
Parameter learning by inference
Simple example:
I Random walk over Y1, Y2, . . .
I f (yt|yt 1) ⇠ N(yt 1, 1/⌧).
I Precision ⌧ is unknown.
Y1 Y2 Y3 Y4 Y5
⌧
↵
The Bayesian solution:
I Model unknown parameters as random variables.
I Use Bayes formula with “clever” distribution families:
f (⌧|y1:T , a, b) =
f (⌧|a, b)
QT
t=1 f (yt|yt 1, ⌧)
f (y1:T |a, b)
.
Eﬃcient inference leads to eﬃcient learning!

Modeling concept drift with
DBNs

Exploratory
analysis

Part
III


Exploratory analysis
§  Exploratory
analysis
helps
us
in
tes?ng
model
assump?ons

§  It
also
improves
the
modeler's
knowledge
about
the
problem

and
its
nature

§  Dynamic
Bayesian
networks
aim
at
modeling
complex
?me

correla?ons


Sample correlogram
§  Let
x1,...,xT be
a
univariate
?me
series.
The
sample

autocorrela?on
coeﬃcient
at
lag
v
is
given
by

§  It
represents
Pearson’s
correla?on
coeﬃcient
between
series

and

ich may strongly limit the extent of the extracted conclusions. How
tations, this analysis will give us some interesting insights which us
ted from experts, as we will see below for the di↵erent use cases.
correlograms: Let x1, ..., xT be a univariate time series. The sa
lation coe cient at lag v is given by
ˆ⇢v =
PT v
t=1 (xt ¯x)(xt+v ¯x)
PT
t=1(xt ¯x)2
s the sample mean and T is the total length of the considered data.
v versus v, for v = 1, . . . , M for some maximum M is called the sa
am of the data. ˆpv corresponds to the Pearson correlation between
}t2{1,...,T} and {xt+v}t+v2{1,...,T}.
orrelograms can be interpreted as a way to measure the strength o
unconditional dependences: Xt 6? Xt+v for some lag v 1. When
ero, this indicates that there exists a strong unconditional independ
Sample correlograms: Let x1, ..., xT be a univariate time serie
autocorrelation coe cient at lag v is given by
ˆ⇢v =
PT v
PT
t=1(xt ¯x)2
where ¯x is the sample mean and T is the total length of the conside
plot of ˆpv versus v, for v = 1, . . . , M for some maximum M is cal
correlogram of the data. ˆpv corresponds to the Pearson correlatio
series {xt}t2{1,...,T} and {xt+v}t+v2{1,...,T}.
Sample correlograms can be interpreted as a way to measure the s
following unconditional dependences: Xt 6? Xt+v for some lag v
close to zero, this indicates that there exists a strong unconditiona
between Xt and Xt+v. However, when ˆ⇢v is close to either 1 or 1
not be elicited from experts, as we will see below for the di↵erent use ca
• Sample correlograms: Let x1, ..., xT be a univariate time series.
autocorrelation coe cient at lag v is given by
ˆ⇢v =
PT v
PT
t=1(xt ¯x)2
where ¯x is the sample mean and T is the total length of the considere
plot of ˆpv versus v, for v = 1, . . . , M for some maximum M is called
correlogram of the data. ˆpv corresponds to the Pearson correlation
series {xt}t2{1,...,T} and {xt+v}t+v2{1,...,T}.
Sample correlograms can be interpreted as a way to measure the str
following unconditional dependences: Xt 6? Xt+v for some lag v 1.
close to zero, this indicates that there exists a strong unconditional in
between Xt and Xt+v. However, when ˆ⇢v is close to either 1 or 1, th
The
sample
correlogram
is
the
plot
of
the
sample

autocorrela?on
vs.
v

Sample correlogram for
independent data
the Markov chain generating the time data
expressed for the sample correlogram.
(a) Correlogram for i.i.d. data (b)

Sample correlogram for time
correlated data
gram.

Sample partial correlogram
§  Consider
the
regression
model

§  Let

denote
the
residuals

§  The
sample
par?al
auto-‐correla?on
coeﬃcient
of
lag
v
is
the

standard
sample
auto-‐correla?on
between
the
series

{xt−v}t−v∈{1,...,T} and {et,v}t∈{1,...,T}
§  It can be seen as the correlation between Xt and Xt−v
after having removed the common linear effect of the
data in between.
al relationship, or more intuitively, the “memory” of the time seri
s example, this “memory” inversely depends of the variance of t
✏ value.
e partial correlograms: Let Xt be a random variable associa
values at time t. We can build the following regression problem:
Xt = a0 + a1Xt 1 + a2Xt 2 + ...av 1Xt v+1
tion, let et,v denotes the residuals of this regression problem (i.e.,
stimating Xt using a linear combination of v 1 previous observati
partial auto-correlation coe cient of lag v, denoted as ˆ✓v, is the
auto-correlation between the series {xt v}t v2{1,...,T } and {et,v}
vely, the sample partial auto-correlation coe cient of lag v can b
relation between Xt and Xt v after having removed the comm
f the data in between.
viously, we plot in Figure 3.8 (c) and (d) the sample partial cor
us example, this “memory” inversely depends of the variance
✏ value.
ple partial correlograms: Let Xt be a random variable as
g values at time t. We can build the following regression probl
Xt = a0 + a1Xt 1 + a2Xt 2 + ...av 1Xt v+1
dition, let et,v denotes the residuals of this regression problem
estimating Xt using a linear combination of v 1 previous obse
e partial auto-correlation coe cient of lag v, denoted as ˆ✓v, i
e auto-correlation between the series {xt v}t v2{1,...,T} and
ively, the sample partial auto-correlation coe cient of lag v c
orrelation between Xt and Xt v after having removed the c
of the data in between.
eviously, we plot in Figure 3.8 (c) and (d) the sample partia

Sample partial correlogram for
independent data
(a) Correlogram for i.i.d. data (b) C
(c) Partial correlogram for i.i.d. data (d) Part

Sample partial correlogram for
time correlated data
a (b) Correlogram for a time series data
data (d) Partial correlogram for a time series data

Bivariate contour plots
time series data described above. As it can be seen, the bivariate contou
or time series data shows how Xt and Xt+1 seems to be distributed acco
to a bivariate normal with a covariance matrix that displays a strong deg
correlation. In the case of i.i.d. data, the bivariate contour plot does not
any temporal dependence between Xt and Xt 1.
(a) i.i.d. data (b) Time series data
Figure 3.9: Bivariate contour plots for a set of i.i.d. and time series data.

The R statistical software
§  R
has
become
a
successful
tool
for
data
analysis

§  Well known in Statistics, Machine Learning and Data
Science communities
§  “Free software environment for statistical computing and
graphics”
hpp://www.cran.r-‐project.org

The Rstudio IDE
hpp://www.rstudio.com

The R statistical software
•  Exploratory
analysis
demo
using
R

•  Latex
document
genera?on
from
R

using
Sweave

The
Ramidst
package

Part
IV


The Ramidst package
§  The
package
provides
an
interface
for
using
the

AMIDST
toolbox
func?onality
from
R

§  The interaction is actually carried out through the rJava
package
§  So far Ramidst provides functions for inference in static
networks and concept drift detection using DBNs
§  Extensive extra functionality available thanks to the
HUGIN link

The AMIDST toolbox
•  Scalable
framework
for
data
stream
processing.

•  Based
on
Probabilis?c
Graphical
Models.

•  Unique
FP7
project
for
data
stream
mining
using
PGMs.

•  Open
source
so[ware
(Apache
So[ware
License
2.0).

The AMIDST toolbox official
website
hZp://amidst.github.io/toolbox/

Available for download at GitHub
§  Download:

:>
git
clone
hZps://github.com/amidst/toolbox.git

§  Compile:

:>
./compile.sh

§  Run:

:>
./run.sh
<class-‐name>

Please give our project a “star”!

Processing data streams in R
§  RMOA

§  MOA
is
a
state-‐of-‐the-‐art
tool
for
data
stream
mining.

§  RMOA
provides
func?onality
for
accessing
MOA
from
R

§  Several
sta?c
models
are
available

§  They
can
be
learnt
from
streams

§  Streams
can
be
created
from
csv
ﬁles
or
from
diﬀerent
R
objects

hZp://moa.cms.waikato.ac.nz

The Ramidst package
Inference
and
concept
dri[
demo

using
Ramidst

This project has received funding from the European Union’s
Seventh Framework Programme for research, technological
development and demonstration under grant agreement no 619209

Analysis of massive data using R (CAEPIA2015)

Recommended

Recommended

More Related Content

Similar to Analysis of massive data using R (CAEPIA2015)

Similar to Analysis of massive data using R (CAEPIA2015) (20)

More from AMIDST Toolbox

More from AMIDST Toolbox (7)

Recently uploaded

Recently uploaded (20)

Analysis of massive data using R (CAEPIA2015)