Today, omnipresent sensors are continuously providing streaming data on the environments in which they operate. For instance, a typical monitoring and analysis system may use streaming data generated by sensors to monitor the status of a particular device and to make predictions about its future behaviour, or diagnostically infer the most likely system configuration that has produced the observed data. Sources of streaming data with even a modest updating frequency can produce extremely large volumes of data, thereby making efficient and accurate data analysis and prediction difficult. One of the main challenges is related to handling uncertainty in data, where principled methods and algorithms for dealing with uncertainty in massive data applications are required. Probabilistic graphical models (PGMs) provide a well-founded and principled approach for performing inference and belief updating in complex domains endowed with uncertainty. The on-going EU-FP7 research project AMIDST (Analysis of MassIve Data STreams, http://www.amidst.eu) is aimed at producing scalable methods able to handle massive data streams based on Bayesian networks technology. All of the developed methods will be made available through the AMIDST toolbox, a software suite composed by the HUGIN software (http://amidst.hugin.com) and the open source AMIDST Toolbox. On the other hand, the R statistical package (http://www.cran.r-project.org) has become a widely spread standard for data manipulation and statistical analysis.
The main goal of the tutorial will be to learn how R and the AMIDST toolbox can be linked to assist in the complete lifecycle of data streams processing, from exploratory analysis to probabilistic inference. To achieve this goal, several existing R packages will be used, and the Ramidst package will be introduced to the community.
More info: http://simd.albacete.org/caepia15/conferencia/tutoriales/analysis-of-massive-data-streams-using-r/
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
Analysis of massive data using R (CAEPIA2015)
1. 1CAEPIA 2015 Albacete, November 9, 2015
Analysis of Massive
Data Streams Using
R
Antonio
Salmerón1,
Helge
Langseth2
Anders
L.
Madsen3,4,
Thomas
D.
Nielsen4
1Dept.
Mathema?cs,
University
of
Almería,
Spain
2Dept.
Computer
and
Informa?on
Science.
Norwegian
University
of
Science
and
Technology,
Trondheim,
Norway
3Hugin
Expert
A/S,
Aalborg,
Denmark
4Dept.
Computer
Science,
Aalborg
University,
Denmark
2. Outline
1. Introduc+on
o Data
streams
o Challenges
when
processing
data
streams
o Why
Bayesian
networks?
o The
AMIDST
project
2. Bayesian
networks
o Sta?c
and
dynamic
models
o Inference
and
learning
3. Exploratory
analysis
o Exploratory
?me
series
analysis
in
R
o Report
genera?on:
LaTeX
+
R
4. The
Ramidst
package
o The
AMIDST
toolbox
o Using
the
AMIDST
toolbox
from
R
CAEPIA 2015 Albacete, November 9, 2015 2
4. Data Streams everywhere
• Unbounded
flows
of
data
are
generated
daily:
• Social
Networks
• Network
Monitoring
• Financial/Banking
industry
• ….
CAEPIA 2015 Albacete, November 9, 2015 4
5. Data Stream Processing
• Processing
data
streams
is
challenging:
– They
do
not
fit
in
main
memory
– Con?nuous
model
upda?ng
– Con?nuous
inference
/
predic?on
– Concept
dri[
CAEPIA 2015 Albacete, November 9, 2015 5
6. Processing Massive Data Streams
• Scalability
is
a
main
issue:
• Scalable
compu?ng
infrastructure
• Scalable
models
and
algorithms
CAEPIA 2015 Albacete, November 9, 2015 6
7. Why Bayesian networks?
§ Example:
§ Stream
of
sensor
measurements
about
temperature
and
smoke
presence
in
a
given
geographical
area.
§ The
stream
is
analysed
to
detect
the
presence
of
fire
(event
detec?on
problem)
?
CAEPIA 2015 Albacete, November 9, 2015 7
8. § The
problem
can
be
approached
as
an
anomaly
detec+on
task
(outliers)
§ A
commonly
used
method
is
Streaming
K-‐Means
Why Bayesian networks?
Anomaly
CAEPIA 2015 Albacete, November 9, 2015 8
9. Why Bayesian networks?
§ OJen,
data
streams
are
handled
using
black-‐box
models:
§ Pros:
§ No
need
to
understand
the
problem
§ Cons:
§ Hyper-‐parameters
to
be
tuned
§ Black-‐box
models
can
seldom
explain
away
Stream
Black-‐box
Model
Predic?ons
CAEPIA 2015 Albacete, November 9, 2015 9
10. § Bayesian
Networks:
§ Open-‐box
models
§ Encode
prior
knowledge.
§ Con?nuous
and
discrete
variables
(CLG
networks).
§ Example:
Why Bayesian networks?
Fire
Temp
Smoke
T1
T2
T3
S1
p(Fire=true|t1,t2,t3,s1)
CAEPIA 2015 Albacete, November 9, 2015 10
13. The AMIDST project
§ FP7-‐funded
EU
project
§ Large
number
of
variables
§ Data
arriving
in
streams
§ Based
on
hybrid
Bayesian
networks
§ Open
source
toolbox
with
learning
and
inference
capabili?es
§ Two
use
cases
provided
by
industrial
partners
§ Predic+on
of
maneuvers
in
highway
traffic
(Daimler)
§ Risk
predic+on
in
credit
opera+ons
and
customer
profiling
(BCC)
§ hZp://www.amidst.eu
CAEPIA 2015 Albacete, November 9, 2015 13
ODELADO CON REDES BAYESIANAS
DINÁMICAS HÍBRIDAS
RESULTADOS OBTENIDOS EN LA
PREDICCIÓN DE MANIOBRAS DE TRÁF
REDES BAYESIANAS DINÁMICAS
DE 2 ETAPAS TEMPORALES
S DE MARKOV
AYESIANAS
s entre variables vienen dadas
dirigido. Se conocen las
s condicionales de probabilidad,
ores de los padres.
SIANA DINÁMICA PARA LA EVIDENCIA LATERAL EN UN VEHÍCULO
• Es preferible analizar l
tendencia a fijar un valo
en la probabilidad.
• Con ello, es posible prede
maniobras con mayor ante
usando otros métodos.
• Las redes Bayesianas diná
mediante el uso de algorit
inferencia aproximados, so
herramienta adecuada par
dificultades de este proble
• El paquete AMIDST perm
análisis de datos en tiempo
mediante el uso de redes B
dinámicas, proporcionando
adecuado para intentar res
problema.
• Se espera que estas y otr
contribuciones reduzcan e
víctimas de accidentes de
buscando el objetivo de co
vehículo totalmente segur
15. Definition
§ Formally,
a
Bayesian
network
consists
of
§ A
directed
acyclic
graph
(DAG)
where
each
node
is
a
random
variable
§ A
set
of
condi?onal
probability
distribu?ons,
one
for
each
variable
condi?onal
on
its
parents
in
the
DAG
§ For
a
set
of
variables
,
the
joint
distribu+on
factorizes
as
§ The
factoriza?on
allows
local
computa?ons
CAEPIA 2015 Albacete, November 9, 2015 15
CT 619209 / AMIDST
Page 8 of 63
mally, let X = {X1, . . . , XN } denote the set of stochastic random variables d
domain problem. A BN defines a joint distribution P(X) in the following for
p(X) =
NY
i=1
p(Xi|Pa(Xi))
e Pa(Xi) ⇢ XXi represents the so-called parent variables of Xi. Bayesian ne
be graphically represented by a directed acyclic graph (DAG). Each node, la
n the graph, is associated with a factor or conditional probability p(Xi|Pa
FP7-ICT 619209 / AMIDST
Page 8 of 63
Formally, let X = {X1, . . . , XN } denote the set o
our domain problem. A BN defines a joint distrib
p(X) =
NY
i=1
p(Xi|P
where Pa(Xi) ⇢ XXi represents the so-called pa
16. Reading independencies
Independence
rela+ons
can
be
read
off
from
the
structure
There
are
three
types
of
connec?ons:
§ Serial
§ Diverging
§ Converging
CAEPIA 2015 Albacete, November 9, 2015 16
Tipos de conexiones
Conexi´on en serie:
A B C
Conexi´on divergente:
A B C
Conexi´on convergente:
A B C
Tipos de conexiones
Conexi´on en serie:
A B C
Conexi´on divergente:
A B C
Conexi´on convergente:
A B C
Tipos de conexiones
Conexi´on en serie:
A B C
Conexi´on divergente:
A B C
Conexi´on convergente:
A B C
17.
Reading independencies.
Example
Fire
Temp
Smoke
T1
T2
T3
S1
CAEPIA 2015 Albacete, November 9, 2015 17
• Knowing
the
temperature
with
certainty
makes
the
temperature
sensor
readings
and
the
event
of
fire
independent
• The
smoke
sensor
reading
is
also
irrelevant
to
the
event
of
fire
if
Smoke
is
known
for
sure
18.
Reading independencies.
Example
Fire
Temp
Smoke
T1
T2
T3
S1
CAEPIA 2015 Albacete, November 9, 2015 18
• Knowing
the
temperature
with
certainty
makes
the
temperature
sensor
readings
and
the
event
of
fire
independent
• The
smoke
sensor
reading
is
also
irrelevant
to
the
event
of
fire
if
Smoke
is
known
for
sure
• If
there
is
no
info
about
Temp
or
sensor
readings,
Sun
and
Fire
are
independent
Sun
19. Hybrid Bayesian networks
CAEPIA 2015 Albacete, November 9, 2015 19
• In
a
hybrid
Bayesian
network,
discrete
and
con?nuous
variables
coexist
• Mixtures
of
truncated
basis
func?ons
(MoTBFs)
have
been
successfully
used
in
this
context
(Langseth
et
al.
2012)
• Mixtures
of
truncated
exponen?als
(MTEs)
• Mixtures
of
polynomials
(MoPs)
• MoTBFs
support
efficient
inference
and
learning
in
a
sta?c
seeng
• Learning
from
streams
is
more
problema?c
• The
reason
is
that
they
do
not
belong
to
the
exponen?al
family
20. The exponential family
CAEPIA 2015 Albacete, November 9, 2015 20
• A
family
of
probability
func?ons
belongs
to
the
k
parametric
exponen?al
family
if
it
can
be
expressed
as
uación 2.14 puede expresarse, de forma equivalente, como
f(x; θ) = H(x)C(θ) exp{Q(θ)T(x)}
ón 2.11 La familia de funciones de densidad o de masa de probabilid
θ ∈ Θ ⊆ Rk} pertenece a la familia exponencial k-paramétrica si
f(x; θ) = exp
k
i=1
Qi(θ)Ti(x) + D(θ) + S(x)
e se considera como el soporte de una distribución el conjunto {x ∈ X | f(x; θ) > 0} au
tribuciones de tipo continuo tal definición podría no ser adecuada ya que podríamos re
densidad en una cantidad numerable de puntos sin cambiar la distribución por lo que e
o estaría definido de forma única. Una definición más precisa es considerar que x ∈ X p
P{x − h < X < x + h} > 0 para cualquier h > 0.
melo Rodríguez Torreblanca
adística y Mat. Aplicada. UAL
• The
Ti
func?ons
are
the
sufficient
sta?s?cs
for
the
unknown
parameters,
i.e.,
they
contain
all
the
informa?on
in
the
sample
that
is
relevant
for
es?ma?ng
the
parameters
• They
have
dimension
1
• We
can
“compress”
all
the
informa?on
in
the
stream
so
far
as
a
single
number
for
each
parameter
21. Hybrid Bayesian networks. CLGs
CAEPIA 2015 Albacete, November 9, 2015 21
Conditional Linear Gaussian networks
A Conditional Linear Gaussian (CLG) network is a hybrid Bayesian
network where
I The conditional distribution of each discrete variable XD given
its parents is a multinomial
I The conditional distribution of each continuous variable Z
with discrete parents XD and continuous parents XC , is
p(z|XD = xD, XC = xC ) = N(z; ↵(xD) + (xD)T
xC , (xD))
for all xD and xC , where ↵ and are the coefficients of a
linear regression model of Z given XC , potentially different for
each configuration of XD.
ECSQARU 2015, Compiegne, July 17, 2015 4
CLGs
belong
to
the
exponen?al
family
22. CLGs: Example
CAEPIA 2015 Albacete, November 9, 2015 22
Y
W
TU
S
P(Y ) = (0.5, 0.5)
P(S) = (0.1, 0.9)
f (w|Y = 0) = N(w; 1, 1)
f (w|Y = 1) = N(w; 2, 1)
f (t|w, S = 0) = N(t; w, 1)
f (t|w, S = 1) = N(t; w, 1)
f (u|w) = N(u; w, 1)
ECSQARU 2015, Compiegne, July 17, 2015 5
Conditional Linear Gaussian networks. Example
Y
W
TU
S
P(Y ) = (0.5, 0.5)
P(S) = (0.1, 0.9)
f (w|Y = 0) = N(w; 1, 1)
f (w|Y = 1) = N(w; 2, 1)
f (t|w, S = 0) = N(t; w, 1)
f (t|w, S = 1) = N(t; w, 1)
f (u|w) = N(u; w, 1)
ECSQARU 2015, Compiegne, July 17, 2015 5
23. CLGs: Example
CAEPIA 2015 Albacete, November 9, 2015 23
Y
W
TU
S
P(Y ) = (0.5, 0.5)
P(S) = (0.1, 0.9)
f (w|Y = 0) = N(w; 1, 1)
f (w|Y = 1) = N(w; 2, 1)
f (t|w, S = 0) = N(t; w, 1)
f (t|w, S = 1) = N(t; w, 1)
f (u|w) = N(u; w, 1)
ECSQARU 2015, Compiegne, July 17, 2015 5
Conditional Linear Gaussian networks. Example
Y
W
TU
S
P(Y ) = (0.5, 0.5)
P(S) = (0.1, 0.9)
f (w|Y = 0) = N(w; 1, 1)
f (w|Y = 1) = N(w; 2, 1)
f (t|w, S = 0) = N(t; w, 1)
f (t|w, S = 1) = N(t; w, 1)
f (u|w) = N(u; w, 1)
ECSQARU 2015, Compiegne, July 17, 2015 5
§ Limita+on:
discrete
nodes
are
not
allowed
to
have
con?nuous
parents
§ This
is
not
a
big
problem
for
Bayesian
classifiers
24. Bayesian network classifiers
§ The
structure
is
usually
restricted
§ There
is
a
dis?nguished
(discrete)
variable
called
the
class
while
the
rest
are
called
features
§ Examples:
CAEPIA 2015 Albacete, November 9, 2015 24
C
X2X1
... Xn
(a)
C
X2X1 X3 X4
(b)
Figure 1: Structure of naive Bayes (a) and TAN (b) classifiers.
In general, there are several possible TAN structures for a given set of138
variables. The way to choose among them is to construct a maximum weight139
spanning tree containing the features, where the weight of each edge is the140
Naive
Bayes
Tree-‐augmented
network
(TAN)
25. Bayesian network classifiers
§ The
class
value
is
determined
as
§ In
the
case
of
Naïve
Bayes,
CAEPIA 2015 Albacete, November 9, 2015 25
ayesian network can be used as a classifier if it contains a cla
a set of continuous or discrete explanatory variables X1, . . . ,
ect with observed features x1, . . . , xn will be classified as be
2 ⌦C obtained as
c⇤
= arg max
c2⌦C
p(c|x1, . . . , xn),
⌦C denotes the set of all posible values of C.
nsidering that p(c|x1, . . . , xn) is proportional to p(c) ⇥ p(x1,
cification of an n dimensional distribution for X1, . . . , Xn
d in order to solve the classification problem, which implies a
mputational cost, as the number of parameters necessary to
stribution is exponential in the number of variables, in the w
er, this problem is simplified if we take advantage of the fac
d by the BN. Since building a network without any structur
not always feasible (they might be as complex as the above m
ded by the BN. Since building a network without any structur
is not always feasible (they might be as complex as the above m
distribution), networks with fixed or restricted and simple
utilized instead when facing classification tasks. The extreme c
e Bayes (NB) structure, where all the feature variables are c
pendent given C, as depicted in Fig. 1(a). The strong assu
pendence behind NB models is somehow compensated by the
e number of parameters to be estimated from data, since in th
s that
p(c|x1, . . . , xn) / p(c)
nY
i=1
p(xi|c) ,
h means that, instead of one n-dimensional conditional densi
nsional conditional densities must be estimated.
n TAN models, more dependencies are allowed, expanding the
26. Reasoning over time: Dynamic
Bayesian networks
§ Temporal
reasoning
can
be
accommodated
within
BNs
§ Variables
are
indexed
over
?me,
giving
rise
to
dynamic
Bayesian
networks
§ We
have
to
model
the
joint
distribu?on
over
?me
§ Dynamic
BNs
reduce
the
factoriza?on
complexity
by
adop?ng
the
Markov
assump?on
CAEPIA 2015 Albacete, November 9, 2015 26
Similarly to static BNs, we model our problem/system using a set of stochastic ran
variables, denoted Xt, with the main di↵erence that variables are indexed here
discrete time index t. In this way, we explicitly model the state of the system a
given time. Moreover, we always assume that the system is described at a fixed frequ
and use Xa:b ⌘ Xa, Xa+1, . . . , Xb to denote the set of variables between two time p
a and b.
For reasoning over time, we need to model the joint probability p(X1:T ) which ha
following natural cascade decomposition:
p(X1:T ) =
TY
t=1
p(Xt|X1:t 1),
where p(Xt|X1:t 1) is equal to p(X1) for t = 1. As t increases, the conditional
ability p(Xt|X1:t 1) becomes intractable. Similarly to static BNs, dynamic BNs
more compact factorization of the above joint probability. The first kind of condit
independence assumption encoded by DBNs to reduce the factorization complex
the well-known Markov assumption. Under this assumption, the current state is
pendent from the previous one given a finite number of previous steps and the resu
models are referred to as Markov chains. Basically, a Markov chain can be defin
either discrete or continuous variables X1:T . It exploits the following equality:
lowing natural cascade decomposition:
p(X1:T ) =
TY
t=1
p(Xt|X1:t 1),
here p(Xt|X1:t 1) is equal to p(X1) for t = 1. As t increases, the conditio
ility p(Xt|X1:t 1) becomes intractable. Similarly to static BNs, dynamic B
ore compact factorization of the above joint probability. The first kind of c
dependence assumption encoded by DBNs to reduce the factorization com
e well-known Markov assumption. Under this assumption, the current sta
ndent from the previous one given a finite number of previous steps and the
odels are referred to as Markov chains. Basically, a Markov chain can be d
her discrete or continuous variables X1:T . It exploits the following equality
p(Xt|X1:t 1) = p(Xt|Xt V :t 1)
here V 1 is the order of the Markov chain. Figure 3.3 shows two example
rresponding to first-order (i.e., V = 1) and third-order (i.e., V = 3) Markov
27. Reasoning over time: Dynamic
Bayesian networks
§ DBN
assuming
third
order
Markov
assump?on
§ DBN
assuming
first
order
Markov
assump?on
CAEPIA 2015 Albacete, November 9, 2015 27
9209 / AMIDST
Page 11 of 63
Publi
.3: An example of DBNs assuming a third-order (above) and a first-orde
Markov property.
an unrealistic assumption in some problems leading to poor approximations o
distribution. One could increase the Markov order to improve the approxima
9209 / AMIDST
Page 11 of 63
Publi
.3: An example of DBNs assuming a third-order (above) and a first-order
Markov property.
28. Particular cases of Dynamic
Bayesian networks
§ Hidden
Markov
models
§ The
joint
distribu?on
of
the
hidden
(X)
and
observed
(Y)
variables
is
CAEPIA 2015 Albacete, November 9, 2015 28
FP7-ICT 619209 / AMIDST
Page 12 of 63
Publi
Figure 3.4: An example of a BN structure corresponding to a HMM.
P(X1:T , Y1:T ) =
tY
t=1
P(Xt|Xt 1)P(Yt|Xt). (3.1)
Although most of our models will fit into this description of observed and hidden (state)
variables, there will be cases in which the transition model takes place in the observed
CT 619209 / AMIDST
Page 12 of 63
Figure 3.4: An example of a BN structure corresponding to a HMM.
P(X1:T , Y1:T ) =
tY
t=1
P(Xt|Xt 1)P(Yt|Xt).
hough most of our models will fit into this description of observed and hidden (s
ables, there will be cases in which the transition model takes place in the obs
29. Particular cases of Dynamic
Bayesian networks
§ Input-‐output
Hidden
Markov
models
§ Linear
dynamic
systems:
switching
Kalman
filter
CAEPIA 2015 Albacete, November 9, 2015 29
variables (see, e.g., the case of Cajamar), which in general simplifies the learning-
inference processes of the problem.
An extension of the HMM is the so-called input-output hidden Markov model (IOHMM)
shown in Figure 3.5. IOHMM incorporates an extra top layer of input variables Y0
1:T ,
which can be either continuous or discrete. The existing HMM layer of observed vari-
ables, Y1:T , is referred to as the output set of variables.
Figure 3.5: An example of a BN structure corresponding to an IO-HMM.
IOHMM is usually employed in supervised classification problems. In this case, both
input and output variables are known during training, but only the former is known
during testing. In fact, during testing, inference is performed to predict the output
variables at each time step. In AMIDST we use this model in a di↵erent way. In our
case, both set of input and output variables are always known, so that inference is only
performed to predict the latent variables. The input variables Y0
1:T are introduced as
a way to “relax” the stationary assumption, by explicitly introducing a dependency to
some observed information at each time slice, that is, the transition probability between
Similar to the extension of the static BN model to hybrid domains, DBNs have likewise
been extended to continuous and hybrid domains. In purely continuous domains, where
the continuous variables follow linear Gaussian distributions, the DBN corresponds to
(a factorized version of) a Kalman filter (KF). The structure of a KF is exactly the same
as the one displayed in Figure 3.4 for the HMM, however with the restriction that all
variables should be continuous. In this case, the state variables can be a combination of
continuous variables with di↵erent dependences, and where the dynamics of the process
are assumed to be linear.
When modelling non-linear domains, the dynamics and observational distributions are
often approximated through, e.g., the extended Kalman filter, which models the system
as locally linear in the mean of the current state distribution. Another type of model
ensuring non-linear predictions with a more expressive representation is the switching
Kalman filter (SKF). The type of SKF that we are going to consider here includes an
extra discrete state variable that is able to use a weighted combination of the linear
sub-models. That is, the discrete state variable assigns a probability to each linear term
in the mixture, hence, representing the belief state as a mixture of Gaussians. In this
way, it can deal, to some extent, with violations of both the assumption of linearity and
Gaussian noise. Figure 3.6 depicts the graphical structure of this dynamic model.
Figure 3.6: An example of a switching Kalman filter. Zt represents the discrete state
30. Two-time slice Dynamic
Bayesian networks (2T-DBN)
§ They
conform
the
main
dynamic
model
in
AMIDST
§ The
transi+on
distribu+on
is
CAEPIA 2015 Albacete, November 9, 2015 30
Figure 3.7: An example of a BN structure corresponding to a 2T-DBN.
T-DBN, the transition distribution is represented as follows:
p(Xt+1|Xt) =
Y
Xt+12Xt+1
p(Xt+1|Pa(Xt+1)),
Pa(Xt+1) refers to the set of parents of the variable Xt+1 in the transition m
In general, DBNs can model arbitrary distributions over time. However, in AMIDST,
we will especially focus on the so-called two-time slice DBNs (2T-DBNs). 2T-DBNs
are characterised by an initial model representing the initial joint distribution of the
process and a transition model representing a standard BN repeated over time. This
kind of DBN model satisfies both the first-order Markov assumption and the stationarity
assumption. Figure 3.7 shows an example of a graphical structure of a 2T-DBN model.
Figure 3.7: An example of a BN structure corresponding to a 2T-DBN.
In a 2T-DBN, the transition distribution is represented as follows:
p(Xt+1|Xt) =
Y
Xt+12Xt+1
p(Xt+1|Pa(Xt+1)),
where Pa(Xt+1) refers to the set of parents of the variable Xt+1 in the transition model,
31. Inference in CLG networks
§ There
are
three
ways
of
querying
a
BN
§ Belief
upda?ng
(probability
propaga?on)
§ Maximum
a
posteriori
(MAP)
§ Most
probable
explana?on
(MPE)
CAEPIA 2015 Albacete, November 9, 2015 31
32. Inference in CLG networks
CAEPIA 2015 Albacete, November 9, 2015 32
Querying a Bayesian network (I)
I Probabilistic inference: Computing the posterior distribution of
a target variable:
p(xi |xE ) =
X
xD
Z
xC
p(x, xE )dxC
X
xDi
Z
xCi
p(x, xE )dxCi
33. Inference in CLG networks
CAEPIA 2015 Albacete, November 9, 2015 33
Querying a Bayesian network (II)
I Maximum a posteriori (MAP): For a set of target variables XI ,
the goal is to compute
x⇤
I = arg max
xI
p(xI |XE = xE )
where p(xI |XE = xE ) is obtained by first marginalizing out
from p(x) the variables not in XI and not in XE
I Most probable explanation (MPE): A particular case of MAP
where XI includes all the unobserved variables
ECSQARU 2015, Compiegne, July 17, 2015 8
34. Probability propagation in CLG
networks: Importance sampling
CAEPIA 2015 Albacete, November 9, 2015 34
• Let’s
denote
by
the
posterior
probability
for
the
target
variable,
and
• Then,
Therefore,
we
have
transformed
the
problem
of
probability
propaga?on
into
es?ma?ng
the
expected
value
of
a
random
variable
from
a
sample
drawn
from
a
distribu?on
of
our
own
choice
Scalable approximate inference in CLG networks 5
umerator of Eq. (2), i.e. ✓ =
R b
a
h(xi)dxi with
(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
e ✓ as
xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤
h(X⇤
i )
p⇤(X⇤
i )
, (6)
ity density function on (a, b) called the sampling distribu-
dom variable with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
be a
. Then it is easy to prove that
ˆ 1
mX h(X⇤
i
(j)
)
Scalable approximate inference in CLG networks 5
Let ✓ denote the numerator of Eq. (2), i.e. ✓ =
R b
a
h(xi)dxi with
h(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
Then, we can write ✓ as
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤
h(X⇤
i )
p⇤(X⇤
i )
, (6)
here p⇤
is a probability density function on (a, b) called the sampling distribu-
on, and X⇤
i is a random variable with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
be a
ample drawn from p⇤
. Then it is easy to prove that
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
(7)
Scalable approximate inference in CLG networks
Let ✓ denote the numerator of Eq. (2), i.e. ✓ =
R b
a
h(xi)dxi with
h(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
Then, we can write ✓ as
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤
h(X⇤
i )
p⇤(X⇤
i )
,
where p⇤
is a probability density function on (a, b) called the sampling distr
tion, and X⇤
i is a random variable with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
b
sample drawn from p⇤
. Then it is easy to prove that
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
35. Probability propagation in CLG
networks: Importance sampling
CAEPIA 2015 Albacete, November 9, 2015 35
• The
expected
value
can
be
es?mated
using
a
sample
mean
es?mator.
Let
be
a
sample
drawn
from
p*.
Then
a
consistent
unbiased
es?mator
of
is
given
by
• In
AMIDST,
the
sampling
distribu?on
is
formed
by
the
condi?onal
distribu?ons
in
the
network
(Evidence
weigh?ng)
(xi)
(xi)
p⇤
(xi)dxi = Ep⇤
h(X⇤
i )
p⇤(X⇤
i )
, (6)
ction on (a, b) called the sampling distribu-
with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
be a
sy to prove that
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
(7)
e estimation is determined by its variance,
⇤(j)
)
⇤
i
(j)
)
1
A =
1
m2
mX
j=1
Var
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
!
Scalable approximate inference in CLG network
Let ✓ denote the numerator of Eq. (2), i.e. ✓ =
R b
a
h(xi)dxi with
h(xi) =
X
xD2⌦XD
Z
xC 2⌦XC
p(x; xE)dxC.
Then, we can write ✓ as
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤
h(X⇤
i )
p⇤(X⇤
i )
,
where p⇤
is a probability density function on (a, b) called the sampling
tion, and X⇤
i is a random variable with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
sample drawn from p⇤
. Then it is easy to prove that
hen, we can write ✓ as
✓ =
Z b
a
h(xi)dxi =
Z b
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤
h(X⇤
i )
p⇤(X⇤
i )
,
p⇤
is a probability density function on (a, b) called the sampling dis
and X⇤
i is a random variable with density p⇤
. Let X⇤
i
(1)
, . . . , X⇤
i
(m)
e drawn from p⇤
. Then it is easy to prove that
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
unbiased estimator of ✓.
s ˆ✓1 is unbiased, the error of the estimation is determined by its var
is
Var(ˆ✓1) = Var
0
@ 1
m
mX h(X⇤
i
(j)
)
p⇤(X⇤(j)
)
1
A =
1
m2
mX
Var
h(X⇤
i
(j)
)
p⇤(X⇤(j)
)
!
=
a
h(xi)dxi =
a
h(xi)
p⇤(xi)
p⇤
(xi)dxi = Ep⇤
h
p
probability density function on (a, b) called th
⇤
is a random variable with density p⇤
. Let Xi
n from p⇤
. Then it is easy to prove that
ˆ✓1 =
1
m
mX
j=1
h(X⇤
i
(j)
)
p⇤(X⇤
i
(j)
)
d estimator of ✓.
unbiased, the error of the estimation is determi
36. Probability propagation in CLG
networks: Importance sampling
CAEPIA 2015 Albacete, November 9, 2015 36
Stream
S.
Dist
Map
Reduce
Stream
C.U.
C.U.
C.U.
C.U.
Sample
genera?on
Sufficient
sta?s?cs
37. Probability propagation in CLG
networks: Importance sampling
CAEPIA 2015 Albacete, November 9, 2015 37
Response
for
an
input
stream
with
a
network
of
500
variables
38. Probability propagation in CLG
networks: Importance sampling
CAEPIA 2015 Albacete, November 9, 2015 38
Response
for
an
input
stream
with
a
network
of
10
variables
39. MAP in CLG networks
CAEPIA 2015 Albacete, November 9, 2015 39
MAP
is
similar
to
probability
propaga?on
but:
• First
marginalize
out
by
sum/integral
(sum
phase)
• Then
maximize
(max
phase)
Constrained
order
-‐>
higher
complexity
40. MAP in CLG networks
CAEPIA 2015 Albacete, November 9, 2015 40
MAP
in
the
AMIDST
Toolbox
• Hill
Climbing
(global
and
local
change)
• Simulated
Annealing
(global
and
local
change)
• Sampling
41. MAP in CLG networks
CAEPIA 2015 Albacete, November 9, 2015 41
Stream
S.
Dist
Map
Reduce
Stream
C.U.
C.U.
C.U.
C.U.
Mul?ple
star?ng
points
Local
solu?ons
42. Inference in dynamic networks
CAEPIA 2015 Albacete, November 9, 2015 42
Task 3.3. Inference in dynamic networks
Inference in DBNs faces the problem of entanglement:
All variables used to encode the belief state at time t = 2 become
dependent after observing {e0, e1, e2}.
AMIDST, Review, Luxembourg, January 22, 2015 16
43. Inference in dynamic networks
CAEPIA 2015 Albacete, November 9, 2015 43
• Varia?onal
message
passing
based
on
the
varia?onal
approxima?on
to
a
posterior
distribu?on
p(xI)
which
is
defined
as
• Factored
fron?er,
which
assumes
independence
of
the
nodes
connec?ng
to
the
past
given
the
observa?ons
ference in DBNs will be approached following a Bayesian
rmulation + Variational Bayes.
he variational approximation to a posterior distribution p(xI
defined as
q⇤
(xI ) = arg min
q2Q
D(q(xI )||p(xI )),
here D(q||p) is the KL divergence from q to p.
n alternative is to focus on D(p(xI )||q(xI )), which
orresponds to expectation propagation.
he optimal variational distribution is computed iteratively.
44. Learning CLG networks from
data
§ Learning
the
structure
§ Methods
based
on
condi?onal
independence
tests
§ Score
based
techniques
§ Es+ma+ng
the
parameters
§ Bayesian
approach
§ Frequen?st
approach
(maximum
likelihood)
CAEPIA 2015 Albacete, November 9, 2015 44
45. Learning CLG networks from
data
§ Bayesian
parameter
learning
§ Parameters
are
considered
random
variables
rather
than
fixed
quan??es
§ A
prior
distribu?on
is
assigned
to
the
parameters,
represen?ng
the
state
of
knowledge
before
observing
the
data
§ The
prior
is
updated
in
the
light
of
new
data.
§ The
Bayesian
framework
naturally
deals
with
data
streams
p(✓|d1, . . . , dn, dn+1) / p(dn+1|✓)p(✓|d1, . . . , dn)
CAEPIA 2015 Albacete, November 9, 2015 45
46. Learning CLG networks from
data
CAEPIA 2015 Albacete, November 9, 2015 46
AMIDST, 1st Annual Meeting, Copenhagen, November 27-28, 2014
Parameter learning by inference
Simple example:
I Random walk over Y1, Y2, . . .
I f (yt|yt 1) ⇠ N(yt 1, ⌧ 1
).
I Precision ⌧ is unknown.
Y1 Y2 Y3 Y4 Y5
AMIDST, Review, Luxembourg, January 22, 2015 12
47. Learning CLG networks from
data
CAEPIA 2015 Albacete, November 9, 2015 47
AMIDST, 1st Annual Meeting, Copenhagen, November 27-28, 2014
Parameter learning by inference
Simple example:
I Random walk over Y1, Y2, . . .
I f (yt|yt 1) ⇠ N(yt 1, 1/⌧).
I Precision ⌧ is unknown.
Y1 Y2 Y3 Y4 Y5
⌧
↵
The Bayesian solution:
I Model unknown parameters as random variables.
I Use Bayes formula with “clever” distribution families:
f (⌧|y1:T , a, b) =
f (⌧|a, b)
QT
t=1 f (yt|yt 1, ⌧)
f (y1:T |a, b)
.
Efficient inference leads to efficient learning!
AMIDST, Review, Luxembourg, January 22, 2015 13
51. Exploratory analysis
§ Exploratory
analysis
helps
us
in
tes?ng
model
assump?ons
§ It
also
improves
the
modeler's
knowledge
about
the
problem
and
its
nature
§ Dynamic
Bayesian
networks
aim
at
modeling
complex
?me
correla?ons
CAEPIA 2015 Albacete, November 9, 2015 51
52. Sample correlogram
§ Let
x1,...,xT be
a
univariate
?me
series.
The
sample
autocorrela?on
coefficient
at
lag
v
is
given
by
§ It
represents
Pearson’s
correla?on
coefficient
between
series
and
CAEPIA 2015 Albacete, November 9, 2015 52
ich may strongly limit the extent of the extracted conclusions. How
tations, this analysis will give us some interesting insights which us
ted from experts, as we will see below for the di↵erent use cases.
correlograms: Let x1, ..., xT be a univariate time series. The sa
lation coe cient at lag v is given by
ˆ⇢v =
PT v
t=1 (xt ¯x)(xt+v ¯x)
PT
t=1(xt ¯x)2
s the sample mean and T is the total length of the considered data.
v versus v, for v = 1, . . . , M for some maximum M is called the sa
am of the data. ˆpv corresponds to the Pearson correlation between
}t2{1,...,T} and {xt+v}t+v2{1,...,T}.
orrelograms can be interpreted as a way to measure the strength o
unconditional dependences: Xt 6? Xt+v for some lag v 1. When
ero, this indicates that there exists a strong unconditional independ
Sample correlograms: Let x1, ..., xT be a univariate time serie
autocorrelation coe cient at lag v is given by
ˆ⇢v =
PT v
t=1 (xt ¯x)(xt+v ¯x)
PT
t=1(xt ¯x)2
where ¯x is the sample mean and T is the total length of the conside
plot of ˆpv versus v, for v = 1, . . . , M for some maximum M is cal
correlogram of the data. ˆpv corresponds to the Pearson correlatio
series {xt}t2{1,...,T} and {xt+v}t+v2{1,...,T}.
Sample correlograms can be interpreted as a way to measure the s
following unconditional dependences: Xt 6? Xt+v for some lag v
close to zero, this indicates that there exists a strong unconditiona
between Xt and Xt+v. However, when ˆ⇢v is close to either 1 or 1
not be elicited from experts, as we will see below for the di↵erent use ca
• Sample correlograms: Let x1, ..., xT be a univariate time series.
autocorrelation coe cient at lag v is given by
ˆ⇢v =
PT v
t=1 (xt ¯x)(xt+v ¯x)
PT
t=1(xt ¯x)2
where ¯x is the sample mean and T is the total length of the considere
plot of ˆpv versus v, for v = 1, . . . , M for some maximum M is called
correlogram of the data. ˆpv corresponds to the Pearson correlation
series {xt}t2{1,...,T} and {xt+v}t+v2{1,...,T}.
Sample correlograms can be interpreted as a way to measure the str
following unconditional dependences: Xt 6? Xt+v for some lag v 1.
close to zero, this indicates that there exists a strong unconditional in
between Xt and Xt+v. However, when ˆ⇢v is close to either 1 or 1, th
The
sample
correlogram
is
the
plot
of
the
sample
autocorrela?on
vs.
v
53. Sample correlogram for
independent data
CAEPIA 2015 Albacete, November 9, 2015 53
the Markov chain generating the time data
expressed for the sample correlogram.
(a) Correlogram for i.i.d. data (b)
54. Sample correlogram for time
correlated data
CAEPIA 2015 Albacete, November 9, 2015 54
gram.
55. Sample partial correlogram
CAEPIA 2015 Albacete, November 9, 2015 55
§ Consider
the
regression
model
§ Let
denote
the
residuals
§ The
sample
par?al
auto-‐correla?on
coefficient
of
lag
v
is
the
standard
sample
auto-‐correla?on
between
the
series
{xt−v}t−v∈{1,...,T} and {et,v}t∈{1,...,T}
§ It can be seen as the correlation between Xt and Xt−v
after having removed the common linear effect of the
data in between.
al relationship, or more intuitively, the “memory” of the time seri
s example, this “memory” inversely depends of the variance of t
✏ value.
e partial correlograms: Let Xt be a random variable associa
values at time t. We can build the following regression problem:
Xt = a0 + a1Xt 1 + a2Xt 2 + ...av 1Xt v+1
tion, let et,v denotes the residuals of this regression problem (i.e.,
stimating Xt using a linear combination of v 1 previous observati
partial auto-correlation coe cient of lag v, denoted as ˆ✓v, is the
auto-correlation between the series {xt v}t v2{1,...,T } and {et,v}
vely, the sample partial auto-correlation coe cient of lag v can b
relation between Xt and Xt v after having removed the comm
f the data in between.
viously, we plot in Figure 3.8 (c) and (d) the sample partial cor
us example, this “memory” inversely depends of the variance
✏ value.
ple partial correlograms: Let Xt be a random variable as
g values at time t. We can build the following regression probl
Xt = a0 + a1Xt 1 + a2Xt 2 + ...av 1Xt v+1
dition, let et,v denotes the residuals of this regression problem
estimating Xt using a linear combination of v 1 previous obse
e partial auto-correlation coe cient of lag v, denoted as ˆ✓v, i
e auto-correlation between the series {xt v}t v2{1,...,T} and
ively, the sample partial auto-correlation coe cient of lag v c
orrelation between Xt and Xt v after having removed the c
of the data in between.
eviously, we plot in Figure 3.8 (c) and (d) the sample partia
56. Sample partial correlogram for
independent data
CAEPIA 2015 Albacete, November 9, 2015 56
(a) Correlogram for i.i.d. data (b) C
(c) Partial correlogram for i.i.d. data (d) Part
57. Sample partial correlogram for
time correlated data
CAEPIA 2015 Albacete, November 9, 2015 57
a (b) Correlogram for a time series data
data (d) Partial correlogram for a time series data
58. Bivariate contour plots
CAEPIA 2015 Albacete, November 9, 2015 58
time series data described above. As it can be seen, the bivariate contou
or time series data shows how Xt and Xt+1 seems to be distributed acco
to a bivariate normal with a covariance matrix that displays a strong deg
correlation. In the case of i.i.d. data, the bivariate contour plot does not
any temporal dependence between Xt and Xt 1.
(a) i.i.d. data (b) Time series data
Figure 3.9: Bivariate contour plots for a set of i.i.d. and time series data.
59. The R statistical software
CAEPIA 2015 Albacete, November 9, 2015 59
§ R
has
become
a
successful
tool
for
data
analysis
§ Well known in Statistics, Machine Learning and Data
Science communities
§ “Free software environment for statistical computing and
graphics”
hpp://www.cran.r-‐project.org
61. The R statistical software
CAEPIA 2015 Albacete, November 9, 2015 61
• Exploratory
analysis
demo
using
R
• Latex
document
genera?on
from
R
using
Sweave
63. The Ramidst package
CAEPIA 2015 Albacete, November 9, 2015 63
§ The
package
provides
an
interface
for
using
the
AMIDST
toolbox
func?onality
from
R
§ The interaction is actually carried out through the rJava
package
§ So far Ramidst provides functions for inference in static
networks and concept drift detection using DBNs
§ Extensive extra functionality available thanks to the
HUGIN link
64. The AMIDST toolbox
CAEPIA 2015 Albacete, November 9, 2015 64
• Scalable
framework
for
data
stream
processing.
• Based
on
Probabilis?c
Graphical
Models.
• Unique
FP7
project
for
data
stream
mining
using
PGMs.
• Open
source
so[ware
(Apache
So[ware
License
2.0).
65. The AMIDST toolbox official
website
CAEPIA 2015 Albacete, November 9, 2015 65
hZp://amidst.github.io/toolbox/
66. Available for download at GitHub
CAEPIA 2015 Albacete, November 9, 2015 66
§ Download:
:>
git
clone
hZps://github.com/amidst/toolbox.git
§ Compile:
:>
./compile.sh
§ Run:
:>
./run.sh
<class-‐name>
67. Please give our project a “star”!
CAEPIA 2015 Albacete, November 9, 2015 67
68. Processing data streams in R
CAEPIA 2015 Albacete, November 9, 2015 68
§ RMOA
§ MOA
is
a
state-‐of-‐the-‐art
tool
for
data
stream
mining.
§ RMOA
provides
func?onality
for
accessing
MOA
from
R
§ Several
sta?c
models
are
available
§ They
can
be
learnt
from
streams
§ Streams
can
be
created
from
csv
files
or
from
different
R
objects
hZp://moa.cms.waikato.ac.nz
69. The Ramidst package
CAEPIA 2015 Albacete, November 9, 2015 69
Inference
and
concept
dri[
demo
using
Ramidst
70. CAEPIA 2015 Albacete, November 9, 2015 70
This project has received funding from the European Union’s
Seventh Framework Programme for research, technological
development and demonstration under grant agreement no 619209