# Bayesian networks

## Introduction

Directed graphical models, popularly known as Bayesian networks, are an important family of probabilistic graphical models. They are a convenience method to express complicated relationships among random variables. Being graphical, Bayesian networks offer a simple way to visualize probabilistic dependencies between variables. Moreover, complex computations such as those required to perform learning and inference can be expressed in terms of graphical operations.

The undirected counterparts to Bayesian networks are known as Markov random fields, or simply undirected graphical models. We cover those separately.

## Prerequisites

To understand Bayesian networks, we recommend familiarity with the concepts in

• Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
• Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.

Follow the above links to first get acquainted with the corresponding concepts.

## Motivation

Consider the random variables $a, b, c, d, e$. Their joint distribution factorizes as follows

\begin{align} p(a, b, c, d, e) &= p(a|b,c,d,e)p(b,c,d,e) \\\\ &= p(a|b,c,d,e)p(b|c,d,e)p(c,d,e) \\\\ &= \cdots \\\\ &= p(a|b,c,d,e)p(b|c,d,e)p(c|d,e)p(d|e)p(e) \end{align}

Now, suppose we have been provided additional information

• The variable $d$ is independent of the variable $e$. Succinctly, $d \independent e$.
• The variable $a$ is independent of all other variables, given $b$. In shorthand, $a \independent \set{c,d,e} | b$.

How does this change our joint distribution formulation? Here's how.

\begin{align} p(a, b, c, d, e) &= p(a|b,c,d,e)p(b|c,d,e)p(c|d,e)p(d|e)p(e) \\\\ &= p(a|b)p(b|c,d,e)p(c|d,e)p(d)p(e) \end{align}

This additional information significantly simplifies the joint distribution. In a collection of many random variables, we need a mechanism to encode such information and the corresponding independence relationships in a principled way. The encoding of such conditional independence relationships is the primary benefit of directed graphical models.

## Distribution represented by a directed graph

Bayesian networks are a directed graph $G=(V,E)$, where $V$ denotes the set of nodes and $E$ denotes the set of directed edges between nodes. The nodes represent random variables and a directed arrow implies the dependence relationship between the connected variables.

• A directed arrow from the node representing random variable $p$ to the node representing random variable $q$ means the variable $p$ is the parent of the variable $q$.
• A variable is never independent of its parent. In other words, a directed arrow implies the terminating node is not independent of the starting node. Succinctly, $x \notindependent \sP_x$, where $\sP_x$ is the set of all parent variables of the variable $x$.
• A variable is independent of all other variables in the network, conditioned on its parents. In other words, missing connections in the graph imply conditional independence. In shorthand, $x \independent y | \sP_x$, where, the variable $y \in \left[ V\setminus \left( x \cup \sP_x \right)\right]$.

As a result, the joint distribution of all variables in the graph consisting of variables $\vx = \set{x_1,\ldots,x_K}$ can be written as

$$p(\vx) = \prod_{k=1}^K p(x_k|\sP_{x_k})$$

## D-separation

A crucial concept in the context of probabilistic graphical models is that D-separation refnum. This concept states the graph properties under which certain conditional independence statements hold for a particular graph.

We will look at specific path types in the graph to understand this concept.

Consider a graph $a \rightarrow c \rightarrow b$. The head of the arrow $a \rightarrow c$ reaches $c$. The tail of the arrow $c \rightarrow b$ starts at $c$. In other words, the path is head-to-tail at $c$.

For this graph, is $b$ independent of $a$ given $c$?

Yes. By the definition of the directed probabilistic model, the node $b$ is independent of $a$ given the node $c$. In short, $b \independent a | c$.

Why? The node $c$ blocked the path from $a$ to $b$. By knowing the value of $c$, we do not need any information about $a$ to infer the value of $b$.

The concept of D-separation extends this idea to subsets of nodes of the graph $\sA, \sB, \sC$.

A set of nodes $\sC$ block paths between sets of nodes $\sA$ and $\sB$, if all paths from the nodes in $\sA$ to the nodes in $\sB$ pass through the nodes in $\sC$ in a head-to-tail manner. This implies the conditional independence $\sB \independent \sA | \sC$.

Note that tail-to-head is just head-to-tail in an opposite direction and the same reasoning applies.

## Tail-to-tail paths

Now imagine the graph $a \leftarrow c \rightarrow b$. In this case, the node $c$ is the parent of both the nodes $a$ and $b$. The tails of both arrows start at $c$. In other words, the path at node $c$ is tail-to-tail.

For this graph, is $b$ independent of $a$ given $c$?

Yes again! $c$ is the parent of $b$ and of $a$. By definition, $b \independent a | c$.

Even in the tail-to-tail case, the node $c$ blocked information from node $a$ to node $b$.

The concept of D-separation extends this idea to subsets of nodes of the graph $\sA, \sB, \sC$.

A set of nodes $\sC$ block paths between sets of nodes $\sA$ and $\sB$, if all paths from the nodes in $\sA$ to the nodes in $\sB$ pass through the nodes in $\sC$ in a tail-to-tail manner. This implies the conditional independence $\sB \independent \sA | \sC$.

Finally, consider the graph $a \rightarrow c \leftarrow b$. In this case, the node $c$ is the child of both the nodes $a$ and $b$. The heads of both arrows end at $c$. In other words, the path at node $c$ is head-to-head.

For this graph, is $b$ independent of $a$ given $c$?

NO! Why? $c$ is the child of both $b$ and $a$. This means, if $c$ is known, then $b$'s value will depend on that of $a$. For example, imagine these three nodes were representing the following binary physical phenomena:

• $a$: Rainfall. If $a=1$, then it has rained. If $a=0$, then it has not rained.
• $b$: Sprinklers. If $b=1$, then sprinklers were ON. If $b=0$, then sprinklers were OFF.
• $c$: Grass wetness. If $c=1$, then the grass is wet. If $c=0$, then the grass is dry.

Clearly, this situation correctly reflects our example head-to-head graph $a \rightarrow c \leftarrow b$.

If we do not know the condition of the grass, then clearly the rainfall and sprinklers are independent events.

Now, consider the wet grass situation: In this situation, $a=0$ implies $b=1$ and $b=0$ implies $a=1$, because one of them (rainfall or sprinklers) are the culprit in wetting the grass. This means, $a \notindependent b | c$.

Thus, if $c$ is in a head-to-head path between $a$ and $b$, then it does not block information flow.

The concept of D-separation extends this idea to subsets of nodes of the graph $\sA, \sB, \sC$.

A set of nodes $\sC$ does not block paths between sets of nodes $\sA$ and $\sB$, if all paths from the nodes in $\sA$ to the nodes in $\sB$ pass through the nodes in $\sC$ in a head-to-head manner. This condition also holds if the paths pass through a node that is a descendent of a node in $\sC$. This implies the conditional independence $\sB \notindependent \sA | \sC$.

## Markov blanket

In a probabilistic graphical model, the set of nodes consisting of the parents, the children, and the co-parents is known as the Markov blanket or Markov boundary of a node. Co-parents are other nodes that share a child with the node.

Why is this concept useful? Let's look at an example.

Consider a joint distribution of $N$ random variables $x_1, \ldots, x_N$. The conditional distribution of a node $x_i$ conditioned on all the other nodes in the graph $x_{j\ne i}$ is

\begin{align} p(x_i | x_{j\ne i}) &= \frac{ p(x_1,\ldots, x_N)}{\sum_{x_i} p(x_1,\ldots,x_N)} \\\\ &= \frac{ \prod_{n=1}^N p(x_n|\sP_{x_n})}{\sum_{x_i} \prod_{n=1}^N p(x_n|\sP_{x_n})} \end{align}

Here, we have again denoted the set of parents of node $x$ with $\sP_x$.

Now imagine terms in the denominator that do not contain $x_i$. All such terms can be taken outside the sum in the denominator and will cancel with the corresponding terms in the numerator.

What terms will remain? Those that contain $x_i$. And these terms involve the Markov blanket of the variable $x_i$ — its parents, children, and co-parents.