Difference between revisions of "Information Theory/Different Entropy Measures of Two-Dimensional Random Variables"

Revision as of 14:04, 21 July 2021

Definition of entropy using supp(P_XY)

We briefly summarise the results of the last chapter again, assuming the two-dimensional random variable $XY$ with the probability mass function $P_{XY}(X,\ Y)$ . At the same time we use the notation

$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm}, \hspace{0.3cm} {\rm where} \hspace{0.15cm} P_{XY}(X,\ Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm};$$

$\text{Summarising the last chapter:}$ With this subset $\text{supp}(P_{XY}) ⊂ P_{XY}$, the following holds for

the joint entropy :

$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$

the entropies of the one-dimensional random variables $X$ and $Y$:

$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})} \hspace{-0.2cm} P_{X}(x) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(x)} \hspace{0.05cm},$$

$$H(Y) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(Y)}\right ] =\hspace{-0.2cm} \sum_{y \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Y})} \hspace{-0.2cm} P_{Y}(y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(y)} \hspace{0.05cm}.$$

$\text{Example 1:}$ We refer again to the examples on the joint probability and joint entropy in the last chapter.

For the two-dimensional probability mass function $P_{RB}(R, B)$ in $\text{Example 5}$ with the parameters

$R$ ⇒ points of the red cube,
$B$ ⇒ points of the blue cube,

the sets $P_{RB}$ and $\text{supp}(P_{RB})$ are identical. Here, all $6^2 = 36$ squares are occupied by non-zero values.

For the two-dimensional probability mass function $P_{RS}(R, S)$ in $\text{Example 6}$ mit den Parametern

$R$ ⇒ points of the red cube,
$S = R + B$ ⇒ sum of both cubes,

there are $6 · 11 = 66$ squares, many of which, however, are empty, i.e. stand for the probability "0" .

The subset $\text{supp}(P_{RS})$ , on the other hand, contains only the $36$ shaded squares with non-zero probabilities.
The entropy remains the same no matter whether one averages over all elements of $P_{RS}$ or only over the elements of $\text{supp}(P_{RS})$ since for $x → 0$ the limit is $x · \log_2 ({1}/{x}) = 0$.

Conditional probability and conditional entropy

In the book "Theory of Stochastic Signals" the following conditional probabilities were given for the case of two events $X$ and $Y$ ⇒ Bayes' theorem:

$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm} {\rm Pr} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (X)} \hspace{0.05cm}.$$

Applied to probability mass functions, one thus obtains:

$$P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{P_{XY}(X, Y)}{P_{Y}(Y)} \hspace{0.05cm}, \hspace{0.5cm} P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$

Analogous to the joint entropy $H(XY)$ , the following entropy functions can be derived here:

$\text{Definitions:}$

The conditional entropy of the random variable $X$ under condition $Y$ is:

$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (x \hspace{-0.05cm}\mid \hspace{-0.05cm} y)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{Y}(y)}{P_{XY}(x, y)} \hspace{0.05cm}.$$

Similarly, for the second conditional entropy we obtain:

$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}Y\hspace{-0.03cm} \mid \hspace{-0.01cm} X} (y \hspace{-0.05cm}\mid \hspace{-0.05cm} x)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{X}(x)}{P_{XY}(x, y)} \hspace{0.05cm}.$$

In the argument of the logarithm function there is always a conditional probability function ⇒ $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$ or $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$ resp., while the joint probability ⇒ $P_{XY}(·)$ is needed for the expectation value formation.

For the conditional entropies, there are the following limitations:

Both $H(X|Y)$ and $H(Y|X)$ are always greater than or equal to zero. From $H(X|Y) = 0$ it follows directly $H(Y|X) = 0$.
Both are only possible for "disjoint sets" $X$ and $Y$.
$H(X|Y) ≤ H(X)$ and $H(Y|X) ≤ H(Y)$ always apply. These statements are plausible if one realises that one can also use "uncertainty" synonymously for "entropy". For: The uncertainty with respect to the quantity $X$ cannot be increased by knowing $Y$.
Except in the case of statistical independence ⇒ $H(X|Y) = H(X)$ , $H(X|Y) < H(X)$ always holds. Because of $H(X) ≤ H(XY)$ and $H(Y) ≤ H(XY)$ , therefore also $H(X|Y) ≤ H(XY)$ and $H(Y|X) ≤ H(XY)$ hold. Thus, a conditional entropy can never become larger than the joint entropy.

$\text{Example 2:}$ We consider the joint probabilities $P_{RS}(·)$ of our dice experiment, which were determined in the last chapter as $\text{Example 6}$. The corresponding $P_{RS}(·)$ is given again in the middle of the following graph.

Joint probabilities $P_{RS}$ and conditional probabilities $P_{S \vert R}$ and $P_{R \vert S}$

The two conditional probability functions are drawn on the outside:

$\text{On the left}$ you see the conditional probability mass function

$$P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅).$$

Because of $P_R(R) = \big [1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6 \big ]$ the probability $1/6$ is in all shaded fields
That means: $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$ .
From this follows for the conditional entropy:

$$H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = \hspace{-0.2cm} \sum_{(r, s) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{RS})} \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} $$

$$\Rightarrow \hspace{0.3cm}H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = 36 \cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.$$

$\text{On the right}$, $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$ is given with $P_S(⋅)$ according to $\text{Example 6}$.

$\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$ ⇒ same non-zero fields result.
However, the probability values now increase continuously from the centre $(1/6)$ towards the edges up to $1$ in the corners.
It follows that:

$$H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} S) = \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) + \frac{2}{36} \cdot \sum_{i=1}^5 \big [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \big ]= 1.896\ {\rm bit} \hspace{0.05cm}.$$

On the other hand, for the conditional probabilities of the 2D random variable $RB$ according to $\text{Example 5}$, one obtains because of $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:

$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R) \hspace{-0.15cm} & = \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm},\\ H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B) \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.\end{align*}$$

Mutual information between two random variables

We consider the two-dimensional random variable $XY$ with PMF $P_{XY}(X, Y)$. Let the one-dimensional functions $P_X(X)$ and $P_Y(Y)$ also be known.

Now the following questions arise:

How does the knowledge of the random variable $Y$ reduce the uncertainty with respect to $X$?
How does the knowledge of the random variable $X$ reduce the uncertainty with respect to $Y$?

To answer this question, we need a definition that is substantial for information theory:

$\text{Definition:}$ The mutual information between the random variables $X$ and $Y$ – both over the same alphabet – is given as follows:

$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)} {P_{X}(X) \cdot P_{Y}(Y) }\right ] =\hspace{-0.25cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY})} \hspace{-0.8cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(x, y)} {P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$

A comparison with the last chapter shows that the mutual information can also be written as a Kullback–Leibler distance between the two-dimensional probability mass function $P_{XY}$ and the product $P_X · P_Y$ :

$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$

It is thus obvious that $I(X;\ Y) ≥ 0$ always holds. Because of the symmetry, $I(Y;\ X)$ = $I(X;\ Y)$ is also true.

By splitting the $\log_2$ argument according to

$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1} {P_{X}(X) }\right ] - {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac {P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$

is obtained using $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:

$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$

This means: The uncertainty regarding the random quantity $X$ ⇒ entropy $H(X)$ decreases by the amount $H(X|Y)$ when $Y$ is known. The remainder is the mutual information $I(X; Y)$.
With a different splitting, one arrives at the result

$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$

Ergo: The mutual information $I(X; Y)$ is symmetrical ⇒ $X$ says just as much about $Y$ as $Y$ says about $X$ ⇒ "mutual". The semicolon indicates equality.

$\text{Conclusion:}$ Often the equations mentioned here are clarified by a diagram, as in the following examples.
From this you can see that the following equations also apply:

$$I(X;\ Y) = H(X) + H(Y) - H(XY) \hspace{0.05cm},$$

$$I(X;\ Y) = H(XY) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$

$\text{Example 3:}$ We return (for the last time) to the dice experiment with the red $(R)$ and blue $(B)$ cube. The random variable $S$ gives the sum of the two dice: $S = R + B$. Here we consider the 2D random variable $RS$.

In earlier examples we calculated

the entropies $H(R) = 2.585 \ \rm bit$ and $H(S) = 3.274 \ \rm bit$ ⇒ Example 6 in the last chapter,
the join entropies $H(RS) = 5.170 \ \rm bit$ ⇒ Example 6 in the last chapter,
the conditional entropies $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$ and $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = 1.896 \ \rm bit$ ⇒ Example 2 in the previous section.

Diagram of all entropies of the „dice experiment”

These quantities are compiled in the graph, with the random quantity $R$ marked by the basic colour „red” and the sum $S$ marked by the basic colour „green” . Conditional entropies are shaded. One can see from this representation:

The entropy $H(R) = \log_2 (6) = 2.585\ \rm bit$ is exactly half as large as the joint entropy $H(RS)$. Because: If one knows $R$, then $S$ provides exactly the same information as the random quantity $B$, namely $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = H(B) = \log_2 (6) = 2.585\ \rm bit$.
Note: $H(R)$ = $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ only applies in this example, not in general.
As expected, here the entropy $H(S) = 3.274 \ \rm bit$ is greater than $H(R)= 2.585\ \rm bit$. Because of $H(S) + H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = H(R) + H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ , $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S)$ must therefore be smaller than $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ by $I(R;\ S) = 0.689 \ \rm bit$ . $H(R)$ is also smaller than $H(S)$ by $I(R;\ S) = 0.689 \ \rm bit$ .
The mutual information between the random variables $R$ and $S$ also results from the equation

$$I(R;\ S) = H(R) + H(S) - H(RS) = 2.585\ {\rm bit} + 3.274\ {\rm bit} - 5.170\ {\rm bit} = 0.689\ {\rm bit} \hspace{0.05cm}. $$

Conditional mutual information

We now consider three random variables $X$, $Y$ and $Z$, that can be related to each other.

$\text{Definition:}$ The conditional mutual information between the random variables $X$ and $Y$ for a given $Z = z$ is as follows:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

One denotes as the conditional conditional mutual information between the random variables $X$ and $Y$ for the random variable $Z$ in general
after averaging over all $z \in Z$:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y Z )= \hspace{-0.3cm} \sum_{z \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Z})} \hspace{-0.25cm} P_{Z}(z) \cdot I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

$P_Z(Z)$ is the probability mass function $\rm (PMF)$ of the random variable $Z$ and $P_Z(z)$ is the probability for the realisation $Z = z$.

$\text{Please note:}$

For the conditional entropy, as is well known, the relation $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$ holds.
For the mutual information, this relation does not necessarily hold:
$I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$ can be smaller, equal, but also larger than als $I(X; Y)$.

2D PMF $P_{XZ}$

$\text{Example 4:}$ We consider the binary random variables $X$, $Y$ and $Z$ with the following properties:

$X$ and $Y$ be statistically independent. Let the following be true for their probability mass functions:

$$P_X(X) = \big [1/2, \ 1/2 \big], \hspace{0.2cm} P_Y(Y) = \big[1– p, \ p \big] \ ⇒ \ H(X) = 1\ {\rm bit}, \hspace{0.2cm} H(Y) = H_{\rm bin}(p).$$

$Z$ is the modulo-2 sum of $X$ and $Y$: $Z = X ⊕ Y$.

From the joint probability mass function $P_{XZ}$ according to the upper graph, it follows:

Summing the column probabilities gives
$P_Z(Z) = \big [1/2, \ 1/2 \big ]$ ⇒ $H(Z) = 1\ {\rm bit}.$
$X$ and $Z$ are also statistically independent, since for the 2D PMF holds
$P_{XZ}(X, Z) = P_X(X) · P_Z(Z)$ .

Conditional 2D PMF $P_{X\hspace{0.05cm}\vert\hspace{0.05cm}YZ}$

It follows that:
$H(Z\hspace{0.05cm}\vert\hspace{0.05cm} X) = H(Z),\hspace{0.5cm}(X \hspace{0.05cm}\vert\hspace{0.05cm} Z) = H(X),\hspace{0.5cm} I(X; Z) = 0.$

From the conditional probability mass function $P_{X\vert YZ}$ according to the graph below, we can calculate:

$H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = 0$, since all $P_{X\hspace{0.05cm}\vert\hspace{0.05cm} YZ}$ entries are $0$ or $1$ ⇒ "conditional entropy",
$I(X; YZ) = H(X) - H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = H(X)= 1 \ {\rm bit}$ ⇒ "mutual information",
$I(X; Y\vert Z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z) =H(X)=1 \ {\rm bit} $ ⇒ "conditional mutual information".

In the present example:

The conditional mutual information $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm} Z) = 1$ is greater than the conventional mutual information $I(X; Y) = 0$.

Chain rule of the mutual information

So far we have only considered the mutual information between two one-dimensional random variables. Now we extend the definition to a total of $n + 1$ random variables, which, only for reasons of representation, we denote with $X_1$, ... , $X_n$ and $Z$ . Then applies:

$\text{Chain rule of mutual information:}$

The mutual information between the $n$–dimensional random variable $X_1 X_2 \hspace{0.05cm}\text{...} \hspace{0.05cm} X_n$ and the random variable $Z$ can be represented and calculated as follows:

$$I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_n;Z) = I(X_1;Z) + I(X_2;Z \vert X_1) + \hspace{0.05cm}\text{...} \hspace{0.1cm}+ I(X_n;Z\vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{n-1}) = \sum_{i = 1}^{n} I(X_i;Z \vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{i-1}) \hspace{0.05cm}.$$

$\text{Proof:}$ We restrict ourselves here to the case $n = 2$, i.e. to a total of three random variables, and replace $X_1$ by $X$ and $X_2$ by $Y$. Then we obtain:

$$\begin{align*}I(X\hspace{0.05cm}Y;Z) & = H(XY) - H(XY\hspace{0.05cm} \vert \hspace{0.05cm}Z) = \\ & = \big [ H(X)+ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X)\big ] - \big [ H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm} XZ)\big ] =\\ & = \big [ H(X)- H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z)\big ] - \big [ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm}XZ)\big ]=\\ & = I(X;Z) + I(Y;Z \hspace{0.05cm} \vert \hspace{0.05cm} X) \hspace{0.05cm}.\end{align*}$$

From this equation one can see that the relation $I(X Y; Z) ≥ I(X; Z)$ is always given.
Equality results for the conditional mutual information $I(Y; Z \hspace{0.05cm} \vert \hspace{0.05cm} X) = 0$, i.e. when the random variables $Y$ and $Z$ for a given $X$ are statistically independent.

$\text{Example 5:}$ We consider the Markov chain $X → Y → Z$. For such a constellation, the "Data Processing Theorem" always holds with the following consequence, which can be derived from the chain rule of mutual information:

$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$

$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm} I(Y;Z ) \hspace{0.05cm}.$$

The theorem thus states:

One cannot gain any additional information about the input $X$ by manipulating the data $Y$ by processing $Y → Z$.
Data processing $Y → Z$ $($by a second processor$)$ only serves the purpose of making the information about $X$ more visible.

For more information on the "Data Processing Theorem" see Exercise 3.15.

Exercises for the chapter

Exercise 3.7: Some Entropy Calculations

Exercise 3.8: Once more Mutual Information

Exercise 3.8Z: Tuples from Ternary Random Variables

Exercise 3.9: Conditional Mutual Information

@@ Line 1: / Line 1: @@
 {{Header
-|Untermenü=Information zwischen zwei wertdiskreten Zufallsgrößen
+|Untermenü=Mutual Information Between Two Discrete Random Variables
 |Vorherige Seite=Einige Vorbemerkungen zu zweidimensionalen Zufallsgrößen
 |Nächste Seite=Anwendung auf die Digitalsignalübertragung
@@ Line 7: / Line 7: @@
-==Definition der Entropie unter Verwendung von supp(<i>P<sub>XY</sub></i>)==
+==Definition of entropy using supp(<i>P<sub>XY</sub></i>)==
-Wir fassen die Ergebnisse des letzten Abschnitts nochmals kurz zusammen, wobei wir von der zweidimensionalen Zufallsgröße $XY$ mit der Wahrscheinlichkeitsfunktion $P_{XY}(X, Y)$ ausgehen. Gleichzeitig verwenden wir die Schreibweise
+<br>
+We briefly summarise the results of the last chapter again, assuming the two-dimensional random variable&nbsp; $XY$&nbsp; with the probability mass function&nbsp; $P_{XY}(X,\ Y)$&nbsp;.&nbsp; At the same time we use the notation
-:$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x, y) \in XY \hspace{0.05cm},
+:$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm},
-\hspace{0.3cm} {\rm wobei} \hspace{0.15cm} P_{XY}(X, Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm}.$$
+\hspace{0.3cm} {\rm where} \hspace{0.15cm} P_{XY}(X,\ Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm};$$
-{{BlaueBox|
+{{BlaueBox|TEXT=
-TEXT=Mit dieser Teilmenge $\text{supp}(P_{XY}) ⊂ P_{XY}$ gilt für
+$\text{Summarising the last chapter:}$&nbsp; With this subset&nbsp; $\text{supp}(P_{XY}) ⊂ P_{XY}$,&nbsp; the following holds for
-*die '''Verbundentropie''' (englisch: ''Joint Entropy''):
+*the&nbsp; '''joint entropy'''&nbsp;:
 :$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})}
   \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$
-*die '''Entropien der 1D–Zufallsgrößen''' $X$ und $Y$:
+*the&nbsp; '''entropies of the one-dimensional random variables'''&nbsp; $X$&nbsp; and&nbsp; $Y$:
 :$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})}
@@ Line 28: / Line 29: @@
-{{GraueBox|
+{{GraueBox|TEXT=
-TEXT='''Beispiel 1''':&nbsp; Wir beziehen uns nochmals auf das '''Experiment mit zwei Würfeln''' auf der Seite [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Verbundwahrscheinlichkeit und Verbundentropie]] im letzten Kapitel.
+$\text{Example 1:}$&nbsp; We refer again to the examples on the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|joint probability and joint entropy]]&nbsp; in the last chapter.&nbsp;
-Bei der 2D–Wahrscheinlichkeitsfunktion $P_{RB}(R, B)$ im dortigen  &bdquo;Beispiel 4&rdquo; mit den Parametern
+For the two-dimensional probability mass function&nbsp; $P_{RB}(R, B)$&nbsp; in&nbsp;  [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 5}$]]&nbsp; with the parameters
-*$R$ &nbsp; &rArr; &nbsp;  Augenzahl des roten Würfels und
+*$R$ &nbsp; &rArr; &nbsp;  points of the red cube,
-*$B$ &nbsp; &rArr; &nbsp;  Augenzahl des blauen Würfels
+*$B$ &nbsp; &rArr; &nbsp;  points of the blue cube,
-sind die Mengen $P_{RB}$ und $\text{supp}(P_{RB})$ identisch. Hier sind alle $6^2 = 36$ Felder mit Werten $≠ 0$ belegt.
-Bei der 2D Wahrscheinlichkeitsfunktion $P_{RS}(R, S)$  im &bdquo;Beispiel 5&rdquo; mit den Parametern
+the sets&nbsp; $P_{RB}$&nbsp; and&nbsp; $\text{supp}(P_{RB})$&nbsp; are identical.&nbsp; Here, all&nbsp; $6^2 = 36$&nbsp; squares are occupied by non-zero values.
-*$R$ &nbsp; &rArr; &nbsp;  Augenzahl des roten Würfels und
-*$S = R + B$ &nbsp; &rArr; &nbsp; Summe der beiden Würfel
-gibt es $6 · 11 = 66$ Felder, von denen allerdings viele leer sind, also für die  Wahrscheinlichkeit &bdquo;0&rdquo;.
+For the two-dimensional probability mass function&nbsp; $P_{RS}(R, S)$&nbsp;  in&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 6}$]]&nbsp; mit den Parametern
-*Die Teilmenge $\text{supp}(P_{RS})$ beinhaltet dagegen nur die 36 schraffierten Felder mit von &bdquo;0&rdquo; verschiedenen Wahrscheinlichkeiten.
+*$R$ &nbsp; &rArr; &nbsp;  points of the red cube,
-*Die Entropie bleibt gleich, ganz egal, ob man die Mittelung über alle Elemente von $P_{RS}$ oder nur über die Elemente von $\text{supp}(P_{RS})$ erstreckt, da für $x → 0$ der Grenzwert $x · \log_2 ({1}/{x}) = 0$  ist.}}
+*$S = R + B$ &nbsp; &rArr; &nbsp; sum of both cubes,
-==Bedingte Wahrscheinlichkeit und bedingte Entropie ==
+there are&nbsp; $6 · 11 = 66$ squares, many of which, however, are empty, i.e. stand for the probability&nbsp; "0" .
+*The subset&nbsp; $\text{supp}(P_{RS})$&nbsp;, on the other hand, contains only the&nbsp; $36$&nbsp; shaded squares with non-zero probabilities.
+*The entropy remains the same no matter whether one averages over all elements of&nbsp; $P_{RS}$&nbsp; or only over the elements of &nbsp; $\text{supp}(P_{RS})$&nbsp; since for&nbsp; $x → 0$&nbsp; the limit is&nbsp; $x · \log_2 ({1}/{x}) = 0$.}}
-Im Buch &bdquo;Stochastische Signaltheorie&rdquo; wurden für den Fall zweier Ereignisse $X$ und $Y$ die folgenden [[Stochastische_Signaltheorie/Statistische_Abhängigkeit_und_Unabhängigkeit#Bedingte_Wahrscheinlichkeit_.281.29|bedingten Wahrscheinlichkeiten]] angegeben &nbsp;  ⇒  &nbsp '''Satz von Bayes''':
+==Conditional probability and conditional entropy ==
+<br>
+In the book&nbsp; "Theory of Stochastic Signals"&nbsp; the following &nbsp; [[Theory_of_Stochastic_Signals/Statistical_Dependence_and_Independence#Conditional_Probability|conditional probabilities]]&nbsp;  were given for the case of two events&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp;  ⇒  &nbsp; '''Bayes' theorem''':
 :$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm}
 {\rm Pr} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (X)} \hspace{0.05cm}.$$
-Angewendet auf  Wahrscheinlichkeitsfunktionen erhält man somit:
+Applied to probability mass functions, one thus obtains:
 :$$P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{P_{XY}(X, Y)}{P_{Y}(Y)} \hspace{0.05cm}, \hspace{0.5cm}
 P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  =  \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$
-Analog zur [[Informationstheorie/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Definition_der_Entropie_unter_Verwendung_von_supp.28PXY.29|Verbundentropie]] $H(XY)$ lassen sich hier folgende Entropiefunktionen ableiten:
+Analogous to the&nbsp; [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Definition_of_entropy_using_supp.28PXY.29|joint entropy]]&nbsp; $H(XY)$&nbsp;, the following entropy functions can be derived here:
+{{BlaueBox|TEXT=
-{{BlaueBox|
+$\text{Definitions:}$&nbsp;
-TEXT='''Definition:'''&nbsp; Die '''bedingte Entropie''' (englisch: ''Conditional Entropy'') der Zufallsgröße $X$ unter der Bedingung $Y$ lautet:
+*The&nbsp; '''conditional entropy''' of the random variable&nbsp; $X$&nbsp; under condition&nbsp; $Y$&nbsp; is:
 :$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}
@@ Line 69: / Line 72: @@
   \hspace{0.05cm}.$$
-In gleicher Weise erhält man für die zweite bedingte Entropie:
+*Similarly, for the&nbsp; '''second conditional entropy''' we obtain:
-:$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}
+:$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}
   \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}Y\hspace{-0.03cm} \mid \hspace{-0.01cm} X} (y \hspace{-0.05cm}\mid \hspace{-0.05cm} x)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}
   \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{X}(x)}{P_{XY}(x, y)}
@@ Line 77: / Line 80: @@
-Im Argument der Logarithmusfunktion steht stets eine bedingte Wahrscheinlichkeitsfunktion ⇒ $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$ bzw. $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$, während zur Erwartungswertbildung die Verbundwahrscheinlichkeit $P_{XY}(·)$ benötigt wird.
+In the argument of the logarithm function there is always a conditional probability function &nbsp; ⇒ &nbsp; $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$&nbsp; or&nbsp; $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$&nbsp; resp.,&nbsp; while the joint probability &nbsp; ⇒ &nbsp; $P_{XY}(·)$ is needed for the expectation value formation.
-Für die bedingten Entropien gibt es folgende Begrenzungen:
-*Sowohl $H(X|Y)$ als auch $H(Y|X)$ sind stets größer oder gleich 0. Aus $H(X|Y) = 0$ folgt direkt auch $H(Y|X) = 0$. Beides ist nur für [[Stochastische_Signaltheorie/Mengentheoretische_Grundlagen#Disjunkte_Mengen|disjunkte Mengen]] $X$ und $Y$ möglich.
-*Es gilt stets $H(X|Y) ≤ H(X)$ sowie $H(Y|X) ≤ H(Y)$. Diese Aussage ist einleuchtend, wenn man sich bewusst macht, dass man für &bdquo;Entropie&rdquo; synonym auch &bdquo;Unsicherheit&rdquo; verwenden kann.
-*Denn: Die Unsicherheit bezüglich $X$ kann nicht dadurch größer werden, dass man $Y$ kennt. Außer bei statistischer Unabhängigkeit   ⇒   $H(X|Y) = H(X)$ gilt stets $H(X|Y) < H(X)$.
-*Wegen $H(X) ≤ H(XY)$, $H(Y) ≤ H(XY)$ gilt somit auch $H(X|Y) ≤ H(XY)$ und $H(Y|X) ≤ H(XY)$. Eine bedingte Entropie kann also nie größer werden als die Verbundentropie.
+For the conditional entropies, there are the following limitations:
+*Both&nbsp; $H(X|Y)$&nbsp; and&nbsp; $H(Y|X)$&nbsp; are always greater than or equal to zero.&nbsp; From&nbsp; $H(X|Y) = 0$&nbsp; it follows directly&nbsp; $H(Y|X) = 0$.&nbsp; <br>Both are only possible for &nbsp; [[Theory_of_Stochastic_Signals/Mengentheoretische_Grundlagen#Disjunkte_Mengen|"disjoint sets"]]&nbsp; $X$&nbsp; and&nbsp; $Y$.
+*$H(X|Y) ≤ H(X)$&nbsp; and&nbsp; $H(Y|X) ≤ H(Y)$ always apply.&nbsp; These statements are plausible if one realises that one can also use&nbsp; "uncertainty"&nbsp; synonymously for&nbsp; "entropy".&nbsp; For: &nbsp; The uncertainty with respect to the quantity&nbsp;  $X$&nbsp; cannot be increased by knowing&nbsp; $Y$.&nbsp;
+*Except in the case of statistical independence   &nbsp; ⇒ &nbsp;   $H(X|Y) = H(X)$&nbsp;, &nbsp; $H(X|Y) < H(X)$ always holds.&nbsp; Because of&nbsp; $H(X) ≤ H(XY)$&nbsp; and&nbsp; $H(Y) ≤ H(XY)$&nbsp;,&nbsp; therefore also&nbsp; $H(X|Y) ≤ H(XY)$&nbsp; and&nbsp; $H(Y|X) ≤ H(XY)$&nbsp;  hold.&nbsp; Thus, '''a conditional entropy can never become larger than the joint entropy'''.
-{{GraueBox|
-TEXT='''Beispiel 2''':&nbsp; Wir betrachten die Verbundwahrscheinlichkeiten $P_{RS}(·)$ unseres Würfelexperiments, die im letzten Kapitel als [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 5]] ermittelt wurden. In der Mitte der folgenden Grafik ist $P_{RS}(·)$ nochmals angegeben.
-[[File:P_ID2764__Inf_T_3_2_S3.png|Bedingte Wahrscheinlichkeitsfunktionen  <i>P<sub>S|R</sub></i> und <i>P<sub>R|S</sub></i>]]
+{{GraueBox|TEXT=
+$\text{Example 2:}$&nbsp; We consider the joint probabilities&nbsp; $P_{RS}(·)$&nbsp; of our dice experiment, which were determined in the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Conditional_probability_and_conditional_entropy|last chapter]]&nbsp; as&nbsp; $\text{Example 6}$.&nbsp; The corresponding &nbsp;$P_{RS}(·)$&nbsp; is given again in the middle of the following graph.
-Außen sind die beiden bedingten Wahrscheinlichkeitsfunktionen gezeichnet:
+[[File:P_ID2764__Inf_T_3_2_S3.png|right|frame|Joint probabilities&nbsp; $P_{RS}$&nbsp; and conditional probabilities&nbsp;  $P_{S \vert R}$&nbsp; and&nbsp; $P_{R \vert S}$]]
-*Links dargestellt ist die bedingte Wahrscheinlichkeitsfunktion $P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅)$. Wegen $P_R(R) = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]$ steht hier in allen schraffierten Feldern  ⇒  $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$ der gleiche Wahrscheinlichkeitswert $1/6$. Daraus folgt für die bedingte Entropie:
+The two conditional probability functions are drawn on the outside:
+$\text{On the left}$&nbsp; you see the conditional probability mass function&nbsp;
+:$$P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅).$$
+*Because of&nbsp; $P_R(R) = \big [1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6 \big ]$&nbsp; the probability&nbsp; $1/6$&nbsp; is in all shaded fields &nbsp;
+*That means: &nbsp; $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$&nbsp; .&nbsp;
+*From this follows for the conditional entropy:
 :$$H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = \hspace{-0.2cm} \sum_{(r, s) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{RS})}
-  \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} =
+  \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} $$
-\cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\,{\rm bit}
+:$$\Rightarrow \hspace{0.3cm}H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) =
+\cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit}
 \hspace{0.05cm}.$$
-*Für die andere bedingte Wahrscheinlichkeitsfunktion $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$ mit $P_S(⋅)$ gemäß [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 5]] ergeben sich die gleichen Felder ungleich 0 ⇒ $\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$. Die Wahrscheinlichkeitswerte nehmen nun aber von der Mitte ($1/6$) zu den Rändern hin bis zur Wahrscheinlichkeit $1$ in den Ecken kontinuierlich zu. Daraus folgt:
+$\text{On the right}$,&nbsp; $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$&nbsp; is given with&nbsp; $P_S(⋅)$&nbsp; according to&nbsp; $\text{Example 6}$.&nbsp;
+*$\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$ &nbsp; ⇒ &nbsp;same non-zero fields result.
+*However, the probability values now increase continuously from the centre&nbsp; $(1/6)$&nbsp;  towards the edges up to&nbsp; $1$&nbsp; in the corners.
+*It follows that:
 :$$H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} S)  = \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) +
-\frac{2}{36} \cdot  \sum_{i=1}^5 \left [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \right ]= 1.896\,{\rm bit} \hspace{0.05cm}.$$
+\frac{2}{36} \cdot  \sum_{i=1}^5 \big [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \big ]= 1.896\ {\rm bit} \hspace{0.05cm}.$$
-Dagegen ergibt sich für die Zufallsgröße $RB$ gemäß [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 4]] wegen $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:
+On the other hand, for the conditional probabilities of the 2D random variable&nbsp; $RB$&nbsp; according to&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 5}$]],&nbsp; one obtains because of&nbsp; $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:
-:$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R)  \hspace{-0.15cm} & =  \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\,{\rm bit} \hspace{0.05cm},\\
+:$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R)  \hspace{-0.15cm} & =  \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm},\\
-H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B)  \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\,{\rm bit} \hspace{0.05cm}.\end{align*}$$}}
+H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B)  \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.\end{align*}$$}}
-==Transinformation zwischen zwei Zufallsgrößen  	==
+==Mutual information between two random variables==
+<br>
+We consider the two-dimensional random variable&nbsp; $XY$&nbsp; with PMF&nbsp; $P_{XY}(X, Y)$.&nbsp;Let the one-dimensional functions&nbsp; $P_X(X)$&nbsp; and&nbsp; $P_Y(Y)$ also be known.
-Wir betrachten die Zufallsgröße $XY$ mit der 2D–Wahrscheinlichkeitsfunktion $P_{XY}(X, Y)$. Bekannt seien auch die 1D–Funktionen $P_X(X)$ und $P_Y(Y)$. Nun stellen sich folgende Fragen:
+Now the following questions arise:
-*Wie vermindert die Kenntnis der Zufallsgröße $Y$ die Unsicherheit bezüglich $X$?
+*How does the knowledge of the random variable&nbsp; $Y$&nbsp; reduce the uncertainty with respect to&nbsp; $X$?
-*Wie vermindert die Kenntnis der Zufallsgröße $X$ die Unsicherheit bezüglich $Y$?
+*How does the knowledge of the random variable&nbsp; $X$&nbsp; reduce the uncertainty with respect to&nbsp; $Y$?
-Zur Beantwortung benötigen wir eine für die Informationstheorie substantielle Definition:
-{{Definition}}
-Die '''Transinformation''' (englisch: ''Mutual Information'') zwischen den Zufallsgrößen $X$ und $Y$ – beide über dem gleichen Alphabet – ist gegeben durch den Ausdruck
+To answer this question, we need a definition that is substantial for information theory:
+{{BlaueBox|TEXT=
+$\text{Definition:}$&nbsp; The&nbsp; '''mutual information''' between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$ –  both over the same alphabet – is given as follows:
-$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)}
+:$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)}
 {P_{X}(X) \cdot P_{Y}(Y) }\right ] =\hspace{-0.25cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY})}
   \hspace{-0.8cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(x, y)}
 {P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$
-Ein Vergleich mit [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|Kapitel 3.1]] zeigt, dass die Transinformation auch als [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Relative_Entropie_.E2.80.93_Kullback.E2.80.93Leibler.E2.80.93Distanz|Kullback–Leibler–Distanz]] zwischen der 2D–PMF $P_{XY}(⋅)$ und dem Produkt $P_X(⋅) · P_Y(⋅)$ geschrieben werden kann:
+A comparison with the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|last chapter]]&nbsp; shows that the mutual information can also be written as a&nbsp; [[Information_Theory/Some_Preliminary_Remarks_on_Two-Dimensional_Random_Variables#Informational_divergence_-_Kullback-Leibler_distance|Kullback–Leibler distance]]&nbsp; between the two-dimensional probability mass function&nbsp; $P_{XY}$&nbsp; and the product&nbsp; $P_X · P_Y$&nbsp; :
-$$I(X;Y) = D(P_{XY} \hspace{0.05cm}|| \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$
+:$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$
-Es ist offensichtlich, dass stets $I(X; Y)$ ≥ 0 gilt. Wegen der Symmetrie ist auch $I(Y; X)$ = $I(X; Y)$.
-{{end}}
+It is thus obvious that&nbsp; $I(X;\ Y) ≥ 0$&nbsp; always holds.&nbsp; Because of the symmetry, &nbsp; $I(Y;\ X)$ = $I(X;\ Y)$ is also true.}}
-Sucht man in einem Wörterbuch die Übersetzung für „mutual”, so findet man unter Anderem die Begriffe „gemeinsam”, „gegenseitig”, „beidseitig” und „wechselseitig”. Und ebenso sind in Fachbüchern für $I(X; Y)$ auch die Bezeichnungen ''gemeinsame Entropie'' und ''gegenseitige Entropie'' üblich. Wir sprechen aber im Folgenden durchgängig von der ''Transinformation'' $I(X; Y)$ und interpretieren nun diese Größe:
+By splitting the&nbsp; $\log_2$ argument according to
-*Durch Aufspalten des log2–Arguments entsprechend
-$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1}
+:$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1}
 {P_{X}(X)  }\right ] - {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac
 {P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$
-erhält man unter Verwendung von $P_{X|Y}(⋅)$ = $P_{XY}(⋅)/_PY(Y)$:
+is obtained using&nbsp; $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:
-$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
+:$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
-Das heißt: Die Unsicherheit hinsichtlich der Zufallsgröße $X$  ⇒  Entropie $H(X)$ vermindert sich bei Kenntnis von $Y$ um den Betrag $H(X|Y)$. Der Rest ist die Transinformation $I(X; Y)$.
+*This means: &nbsp; The uncertainty regarding the random quantity&nbsp; $X$  &nbsp; ⇒  &nbsp;  entropy&nbsp; $H(X)$&nbsp; decreases by the amount&nbsp; $H(X|Y)$&nbsp; when&nbsp; $Y$ is known.&nbsp; The remainder is the mutual information&nbsp; $I(X; Y)$.
-*Bei anderer Aufspaltung kommt man zum Ergebnis:
+*With a different splitting, one arrives at the result
+:$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
-$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
+*Ergo: &nbsp; The mutual information&nbsp; $I(X; Y)$&nbsp; is symmetrical  &nbsp; ⇒ &nbsp;  $X$&nbsp; says just as much about&nbsp; $Y$&nbsp; as&nbsp; $Y$&nbsp; says about&nbsp; $X$  &nbsp; ⇒ &nbsp; "mutual".&nbsp; The semicolon indicates equality.
-Ergo: Die Transinformation $I(X; Y)$ ist symmetrisch: $X$ sagt genau so viel über $Y$ aus wie $Y$ über $X$  ⇒  gegenseitige Information. Das Semikolon weist auf die Gleichberechtigung hin.
-Oft werden die hier genannten Gleichungen durch ein Schaubild verdeutlicht, so auch in den folgenden Beispielen. Daraus erkennt man, dass auch folgende Gleichungen zutreffen:
+{{BlaueBox|TEXT=
+$\text{Conclusion:}$&nbsp;
+Often the equations mentioned here are clarified by a diagram, as in the following examples.&nbsp; <br>From this you can see that the following equations also apply:
-$$\begin{align*}I(X;Y) \hspace{-0.15cm} & =  \hspace{-0.15cm} H(X) + H(Y) - H(XY) \hspace{0.05cm},\\
+:$$I(X;\ Y) = H(X) + H(Y) - H(XY) \hspace{0.05cm},$$
-I(X;Y) \hspace{-0.15cm} & =  \hspace{-0.15cm} H(XY) -
+:$$I(X;\ Y) = H(XY) -
 H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X)
-\hspace{0.05cm}.\end{align*}$$
+\hspace{0.05cm}.$$}}
-{{Beispiel}}
-''Beispiel F'':   Wir kommen nochmals auf das [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|Würfel–Experiment]] mit dem roten $(R)$ und dem blauen $(B)$ Würfel zurück. Die Zufallsgröße $S$ gibt die Summe der beiden Würfel an: $S = R + B$.
-Wir betrachten hier die 2D–Zufallsgröße RS. In früheren Beispielen haben wir berechnet:
-*die Entropien $H(R)$ = 2.585 bit und $H(S)$ = 3.274 bit  ⇒  Beispiel D,
-*die Verbundentropie $H(RS)$ = 5.170 bit  ⇒  Beispiel D,
-*die bedingten Entropien $H(S|R)$ = 2.585 bit und $H(R|S)$ = 1.896 bit  ⇒  Beispiel F.
-[[File:P_ID2765__Inf_T_3_2_S3_neu.png|Schaubild aller Entropien des „Würfelexperiments” ]]
+{{GraueBox|TEXT=
+$\text{Example 3:}$&nbsp; We return&nbsp; (for the last time)&nbsp; to the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|dice experiment]]&nbsp; with the red&nbsp; $(R)$&nbsp; and blue&nbsp; $(B)$&nbsp; cube.&nbsp;  The random variable&nbsp; $S$&nbsp; gives the sum of the two dice:&nbsp; $S = R + B$.&nbsp;
+Here we consider the 2D random variable&nbsp; $RS$.&nbsp;
-Diese Größen sind in der Grafik zusammengestellt, wobei die Zufallsgröße $R$ durch die Grundfarbe „Rot” und die Summe $S$ durch die Grundfarbe „grün” markiert sind. Bedingte Entropien sind schraffiert.
+In earlier examples we calculated
-Man erkennt aus dieser Darstellung:
+*the entropies&nbsp; $H(R) = 2.585 \ \rm  bit$&nbsp; and&nbsp; $H(S) = 3.274 \ \rm bit$ &nbsp; ⇒  &nbsp;[[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|Example 6]]&nbsp; in the last chapter,
-*Hier ist $H(R)$ = $\log_2 $(6) = 2.585 bit genau halb so groß wie die Verbundentropie $H(RS)$. Kennt man $R$, so liefert $S$ genau die gleiche Information wie die Zufallsgröße $B$, nämlich $H(S|R)$ = $H(B)$ = $\log_2(6)$ = 2.585 bit. Hinweis: $H(R)$ = $H(S|R)$ gilt nicht allgemein.
+*the join entropies&nbsp; $H(RS) = 5.170 \ \rm bit$  &nbsp; ⇒  &nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|Example 6]]&nbsp; in the last chapter,
-*Die Entropie $H(S)$ = 3.274 bit ist im vorliegenden Beispiel erwartungsgemäß größer als $H(R)$. Wegen $H(S) + H(R|S) = H(R) + H(S|R)$ muss deshalb $H(R|S)$ gegenüber $H(S|R)$ um den gleichen Betrag $I(R; S)$ = 0.689 bit kleiner sein wie $H(R)$ gegenüber $H(S)$.
+*the conditional entropies&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$&nbsp; and&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S) = 1.896 \ \rm bit$  &nbsp; ⇒  &nbsp;  [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Conditional_probability_and_conditional_entropy|Example 2]]&nbsp; in the previous section.
-*Die Transinformation (englisch: ''Mutual Information'') zwischen den Zufallsgrößen $R$ und $S$ ergibt sich aber auch aus der Gleichung
-$$\begin{align*}I(R;S) \hspace{-0.15cm} & =  \hspace{-0.15cm}  H(R) + H(S) - H(RS) =\\
- & =  \hspace{-0.15cm}  2.585\,{\rm bit} + 3.274\,{\rm bit} -
-.170\,{\rm bit} = 0.689\,{\rm bit}
-\hspace{0.05cm}. \end{align*}$$
-{{end}}
+[[File:P_ID2765__Inf_T_3_2_S3_neu.png|frame|Diagram of all entropies of the „dice experiment” ]]
+<br>These quantities are compiled in the graph, with the random quantity&nbsp; $R$&nbsp; marked by the basic colour „red” and the sum&nbsp; $S$&nbsp; marked by the basic colour „green” .&nbsp; Conditional entropies are shaded.&nbsp;
+One can see from this representation:
+*The entropy&nbsp; $H(R) = \log_2 (6) = 2.585\ \rm bit$&nbsp; is exactly half as large as the joint entropy&nbsp; $H(RS)$.&nbsp; Because:&nbsp; If one knows&nbsp; $R$,&nbsp; then&nbsp; $S$&nbsp; provides exactly the same information as the random quantity&nbsp; $B$, namely&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R) = H(B) = \log_2 (6) = 2.585\ \rm bit$.&nbsp;
+*'''Note''': &nbsp; $H(R)$ = $H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R)$&nbsp; '''only applies in this example, not in general'''.
+*As expected, here the entropy&nbsp; $H(S) = 3.274 \ \rm bit$&nbsp; is greater than&nbsp; $H(R)= 2.585\ \rm bit$.&nbsp; Because of&nbsp; $H(S) + H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S) = H(R) + H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R)$&nbsp;,&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S)$&nbsp; must therefore be smaller than&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$&nbsp; by &nbsp; $I(R;\ S) = 0.689 \ \rm bit$&nbsp;. &nbsp; $H(R)$&nbsp; is also smaller than&nbsp; $H(S)$ by &nbsp; $I(R;\ S) = 0.689 \ \rm bit$&nbsp;.
+*The mutual information between the random variables&nbsp; $R$&nbsp; and&nbsp; $S$&nbsp; also results from the equation
+:$$I(R;\ S) = H(R) + H(S) - H(RS) =  2.585\ {\rm bit} + 3.274\ {\rm bit} - 5.170\ {\rm bit} = 0.689\ {\rm bit} \hspace{0.05cm}. $$}}
-==Bedingte Transinformation  ==
-Wir betrachten nun drei Zufallsgrößen $X$, $Y$ und $Z$, die zueinander in Beziehung stehen (können).
+==Conditional mutual information  ==
+<br>
-{{Definition}}
+We now consider three random variables&nbsp; $X$,&nbsp; $Y$&nbsp; and&nbsp; $Z$, that can be related to each other.
-Die '''bedingte Transinformation''' (englisch: ''Conditional Mutual Information'') zwischen den Zufallsgrößen $X$ und $Y$ bei gegebenem Z = z lautet:
+{{BlaueBox|TEXT=
+$\text{Definition:}$&nbsp; The &nbsp; '''conditional mutual information''' &nbsp;  between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; '''for a given'''&nbsp; $Z = z$&nbsp; is as follows:
-$$I(X;Y \hspace{0.05cm}|\hspace{0.05cm} Z = z) =  H(X\hspace{0.05cm}|\hspace{0.05cm} Z = z) - H(X|\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$
+:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$
-Dagegen bezeichnet man als die '''bedingte Transinformation''' zwischen den Zufallsgrößen $X$ und $Y$ bei gegebener '''Zufallsgröße Z''':
+One denotes as the conditional&nbsp; '''conditional mutual information'''&nbsp; between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; for the random variable&nbsp; $Z$&nbsp; '''in general'''&nbsp; <br>after averaging over all&nbsp; $z \in Z$:
-$$I(X;Y \hspace{0.05cm}|\hspace{0.05cm} Z ) =  H(X\hspace{0.05cm}|\hspace{0.05cm} Z ) - H(X|\hspace{0.05cm}Y  Z )= \hspace{-0.3cm}
+:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y  Z )= \hspace{-0.3cm}
 \sum_{z \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Z})} \hspace{-0.25cm} P_{Z}(z) \cdot
-I(X;Y \hspace{0.05cm}|\hspace{0.05cm} Z = z)
+I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z)
 \hspace{0.05cm}.$$
-Hierbei ist $P_Z(Z)$ die Wahrscheinlichkeitsfunktion der neben $X$ und $Y$ betrachteten Zufallsgröße $Z$ und $P_Z(z)$ die Wahrscheinlichkeit für $Z = z$.
+$P_Z(Z)$&nbsp; is the probability mass function&nbsp; $\rm  (PMF)$&nbsp; of the random variable&nbsp; $Z$&nbsp; and&nbsp; $P_Z(z)$&nbsp; is the&nbsp; '''probability'''&nbsp; for the realisation&nbsp; $Z = z$.}}
-{{end}}
+{{BlaueBox|TEXT=
+$\text{Please note:}$&nbsp;
+*For the conditional entropy, as is well known, the relation &nbsp; $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$&nbsp; holds.
+*For the mutual information, this relation does not necessarily hold: <br> &nbsp; &nbsp; $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$&nbsp; can be&nbsp; '''smaller, equal, but also larger than'''&nbsp; als&nbsp; $I(X; Y)$.}}
-Bitte beachten Sie: Für die bedingte Entropie gilt bekanntlich die Größenrelation $H(X|Z) ≤ H(X)$. Für die Transinformation gilt diese Größenrelation nicht unbedingt:
-: $I(X; Y|Z)$ kann kleiner, gleich, aber auch größer sein als $I(X; Y)$.
+[[File:P_ID2824__Inf_T_3_2_S4a.png|right|frame|2D PMF&nbsp; $P_{XZ}$ ]]
+{{GraueBox|TEXT=
+$\text{Example 4:}$&nbsp;
+We consider the binary random variables&nbsp; $X$,&nbsp; $Y$&nbsp; and&nbsp; $Z$&nbsp; with the following properties:
+* $X$&nbsp; and&nbsp; $Y$&nbsp; be statistically independent.&nbsp; Let the following be true for their probability mass functions:
+:$$P_X(X) = \big [1/2, \ 1/2 \big],  \hspace{0.2cm} P_Y(Y) = \big[1– p, \ p \big] \   ⇒  \  H(X) = 1\ {\rm bit},  \hspace{0.2cm}  H(Y) = H_{\rm bin}(p).$$
+* $Z$&nbsp; is the modulo-2 sum of&nbsp; $X$&nbsp; and&nbsp; $Y$: &nbsp;   $Z = X ⊕ Y$.
-{{Beispiel}}
-Wir betrachten die binären Zufallsgrößen $X$, $Y$ und $Z$ mit folgenden Eigenschaften:
-* $X$ und $Y$ seien statistisch unabhängig und für ihre Wahrscheinlichkeitsfunktionen gelte:  $P_X(X)$ = [1/2,  1/2],  $P_Y(Y)$ = [1– $p$,  $p$]   ⇒   $H(X)$ = 1 (bit),   $H(Y)$ = $H_{\text{bin}}(p)$.
-[[File:P_ID2824__Inf_T_3_2_S4a.png|Wahrscheinlichkeitsfunktion <i>P<sub>XZ</sub></i> ]]
+From the joint probability mass function&nbsp; $P_{XZ}$&nbsp; according to the upper graph, it follows:
+*Summing the column probabilities gives&nbsp; <br> &nbsp; &nbsp; $P_Z(Z) = \big [1/2, \  1/2 \big ]$ &nbsp;  ⇒ &nbsp; $H(Z) = 1\ {\rm bit}.$
+* $X$&nbsp; and&nbsp; $Z$&nbsp; are also statistically independent, since for the 2D PMF  holds&nbsp; <br> &nbsp; &nbsp; $P_{XZ}(X, Z) = P_X(X) · P_Z(Z)$&nbsp;.&nbsp;
+[[File:P_ID2826__Inf_T_3_2_S4b.png|right|frame|Conditional  2D PMF $P_{X\hspace{0.05cm}\vert\hspace{0.05cm}YZ}$]]
-* $Z$ ist die Modulo–2–Summe von $X$ und $Y$:   $Z = X ⊕ Y$.
+*It follows that: <br> &nbsp; &nbsp; $H(Z\hspace{0.05cm}\vert\hspace{0.05cm}  X) = H(Z),\hspace{0.5cm}(X \hspace{0.05cm}\vert\hspace{0.05cm}  Z) = H(X),\hspace{0.5cm} I(X; Z) = 0.$
+<br>From the conditional probability mass function&nbsp; $P_{X\vert YZ}$&nbsp; according to the graph below, we can calculate:
+* $H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = 0$,&nbsp; since all&nbsp; $P_{X\hspace{0.05cm}\vert\hspace{0.05cm} YZ}$ entries are&nbsp; $0$&nbsp; or&nbsp; $1$  &nbsp;  ⇒ &nbsp;  "conditional entropy",
+* $I(X; YZ) = H(X) - H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = H(X)= 1 \ {\rm bit}$ &nbsp;  ⇒ &nbsp;   "mutual information",
+* $I(X; Y\vert Z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z) =H(X)=1 \ {\rm bit} $  &nbsp;  ⇒ &nbsp;   "conditional mutual information".
-Aus der Verbund–PMF $P_{XZ}$ gemäß der oberen Grafik folgt:
-*Durch Summation der Spalten–Wahrscheinlichkeiten ergibt sich $P_Z(Z)$ = [1/2; 1/2] ⇒ $H(Z)$ = 1.
-* $X$ und $Z$ sind ebenfalls statistisch unabhängig, da für die 2D–PMF $P_{XZ}(X, Z)$ = $P_X(X) · P_Z(Z)$ gilt.
-*Daraus folgt: $H(Z|X)$ = $H(Z)$, $H(X|Z)$ = $H(X)$, $I(X; Z)$ = 0.
-[[File:P_ID2826__Inf_T_3_2_S4b.png|Bedingte Wahrscheinlichkeitsfunktion <i>P<sub>X|ZY</sub></i>]]
+In the present example:
-Aus der bedingten Wahrscheinlichkeitsfunktion $P_{X|YZ}$ gemäß der unteren Grafik lassen sich berechnen:
+'''The conditional mutual information'''&nbsp; $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm} Z) = 1$&nbsp; '''is greater than the conventional mutual information''' &nbsp;$I(X; Y) = 0$. }}
-* $H(X|YZ)$ = 0, da alle $P_{X|YZ}$–Einträge entweder 0 oder 1  ⇒  ''bedingte Entropie'',
-* $I(X; YZ)$ = $H(X)$ – $H(X|YZ)$ = $H(X)$ ⇒  ''Transinformation'',
-* $I(X; Y|Z)$ = $H(X|Z)$ = $H(X)$  ⇒  ''bedingte Transinformation''.
-Im vorliegenden Beispiel ist also $I(X; Y|Z)$ = 1 (bit) größer als $I(X; Y)$ = 0 (bit).
-{{end}}
-==Kettenregel der Transinformation ==
+==Chain rule of the mutual information ==
+<br>
+So far we have only considered the mutual information between two one-dimensional random variables.&nbsp; Now we extend the definition to a total of&nbsp; $n + 1$&nbsp; random variables, which, only for reasons of representation, we denote with&nbsp; $X_1$,&nbsp; ... ,&nbsp; $X_n$&nbsp; and&nbsp; $Z$&nbsp; .&nbsp; Then applies:
-Bisher haben wir die Transinformation nur zwischen zwei eindimensionalen Zufallsgrößen betrachtet. Nun erweitern wir die Definition auf insgesamt $n$ + 1 Zufallsgrößen, die wir aus Darstellungsgründen mit $X_1$, ..., $X_n$ sowie $Z$ bezeichnen. Dann gilt
+{{BlaueBox|TEXT=
+$\text{Chain rule of mutual information:}$&nbsp;
+The mutual information between the&nbsp; $n$–dimensional random variable&nbsp; $X_1 X_2  \hspace{0.05cm}\text{...} \hspace{0.05cm}  X_n$&nbsp; and the random variable&nbsp; $Z$&nbsp; can be represented and calculated as follows:
+:$$I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_n;Z) =
+I(X_1;Z) + I(X_2;Z \vert X_1) + \hspace{0.05cm}\text{...} \hspace{0.1cm}+
+I(X_n;Z\vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{n-1}) = \sum_{i = 1}^{n}
+I(X_i;Z \vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{i-1})
+\hspace{0.05cm}.$$
-{{Definition}}
+$\text{Proof:}$&nbsp;
-''Kettenregel der Transinformation'':
+We restrict ourselves here to the case&nbsp; $n = 2$, i.e. to a total of three random variables, and replace&nbsp; $X_1$&nbsp; by $X$ and&nbsp; $X_2$&nbsp; by&nbsp;  $Y$.&nbsp; Then we obtain:
-Die Transinformation zwischen der $n$–dimensionalen Zufallsgröße $X_1 X_2  ...  X_n$ und der Zufallsgröße $Z$ lässt sich wie folgt darstellen und berechnen:
-$$\begin{align*}I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}... \hspace{0.1cm}X_n;Z) \hspace{-0.15cm} & =  \hspace{-0.15cm}
+:$$\begin{align*}I(X\hspace{0.05cm}Y;Z)  & = H(XY) - H(XY\hspace{0.05cm} \vert \hspace{0.05cm}Z) = \\
-I(X_1;Z) + I(X_2;Z | X_1) + ... \hspace{0.1cm}+
+& =   \big [  H(X)+ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X)\big ]  - \big [  H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm} XZ)\big ]  =\\
-I(X_n;Z | X_1\hspace{0.05cm}X_2\hspace{0.05cm}... \hspace{0.1cm}X_{n-1}) = \\
+& =   \big [  H(X)- H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z)\big ]  - \big [  H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm}XZ)\big ]=\\
- & =  \hspace{-0.15cm} \sum_{i = 1}^{n}
+& =  I(X;Z) + I(Y;Z \hspace{0.05cm} \vert \hspace{0.05cm} X) \hspace{0.05cm}.\end{align*}$$}}
-I(X_i;Z | X_1\hspace{0.05cm}X_2\hspace{0.05cm}... \hspace{0.1cm}X_{i-1})
-\hspace{0.05cm}.\end{align*}$$
-{{end}}
+*From this equation one can see that the relation &nbsp;$I(X Y; Z) ≥ I(X; Z)$&nbsp; is always given.
+*Equality results for the conditional mutual information&nbsp; $I(Y; Z \hspace{0.05cm} \vert  \hspace{0.05cm} X) = 0$,&nbsp; i.e. when the random variables&nbsp; $Y$&nbsp; and&nbsp; $Z$&nbsp; for a given&nbsp; $X$&nbsp; are statistically independent.
-Für den Beweis beschränken wir uns hier auf den Fall $n$ = 2, also auf insgesamt drei Zufallsgrößen, und ersetzen $X_1$ und $X_2$ durch $X$ und $Y$. Damit erhalten wir:
+{{GraueBox|TEXT=
+$\text{Example 5:}$&nbsp;  We consider the&nbsp; [[Theory_of_Stochastic_Signals/Markovketten|Markov chain]] &nbsp; $X → Y → Z$.&nbsp; For such a constellation, the&nbsp; "Data Processing Theorem"&nbsp; always holds with the following consequence, which can be derived from the chain rule of mutual information:
-$$\begin{align*}I(X\hspace{0.05cm}Y;Z) \hspace{-0.15cm} & = & \hspace{-0.15cm} H(XY) - H(XY|Z) = \\
+:$$I(X;Z) \hspace{-0.05cm}  \le  \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$
-& =  \hspace{-0.15cm} \big [  H(X)+ H(Y|X)\big ]  - \big [  H(X|Z) + H(Y|XZ)\big ]  =\\
+:$$I(X;Z) \hspace{-0.05cm}  \le  \hspace{-0.05cm} I(Y;Z ) \hspace{0.05cm}.$$
-& =  \hspace{-0.15cm} \big [  H(X)- H(X|Z)\big ]  - \big [  H(Y|X) + H(Y|XZ)\big ]=\\
-& =  \hspace{-0.15cm} I(X;Z) + I(Y;Z | X)
+The theorem thus states:
-\hspace{0.05cm}.\end{align*}$$
+*One cannot gain any additional information about the input&nbsp; $X$&nbsp; by manipulating the data&nbsp; $Y$&nbsp; by processing  &nbsp; $Y → Z$.
+*Data processing&nbsp; $Y → Z$&nbsp; $($by a second processor$)$ only serves the purpose of making the information about&nbsp; $X$&nbsp; more visible.
-Aus dieser Gleichung erkennt man, dass die die Größenrelation $I(X Y; Z) ≥ I(X; Z)$ immer gegeben ist. Gleichheit ergibt sich für die bedingte Transinformation $I(Y; Z|X)$ = 0, also dann, wenn die Zufallsgrößen $Y$ und $Z$ für ein gegebenes $X$ statistisch unabhängig sind.
+For more information on the&nbsp; "Data Processing Theorem"&nbsp; see&nbsp; [[Aufgaben:Aufgabe_3.15:_Data_Processing_Theorem|Exercise 3.15]].}}
-{{Beispiel}}
-Wir betrachten die [[Stochastische_Signaltheorie/Markovketten|Markovkette]] $X → Y → Z$. Für eine solche Konstellation gilt stets das ''Data Processing Theorem'' mit der folgenden Konsequenz, die sich aus der Kettenregel der Transinformation ableiten lässt:
-$$I(X;Z) \hspace{-0.15cm}  \le  \hspace{-0.15cm}I(X;Y ) \hspace{0.05cm},\\
-I(X;Z) \hspace{-0.15cm}  \le  \hspace{-0.15cm} I(Y;Z ) \hspace{0.05cm}.$$
-Das Theorem besagt somit:
+==Exercises for the chapter==
-*Man kann durch Manipulation (''Processing Z'') der Daten $Y$ keine zusätzliche Information über den Eingang $X$ gewinnen.
+<br>
-*Die Datenverarbeitung $Y → Z$ (durch einen zweiten Prozessor) dient nur dem Zweck, die Information über $X$ besser sichtbar zu machen.
+[[Aufgaben:Exercise_3.7:_Some_Entropy_Calculations|Exercise 3.7: Some Entropy Calculations]]
-Weitere Informationen zum ''Data Processing Theorem'' finden Sie in der [[Aufgaben:3.14_Data_Processing_Theorem|Aufgabe A3.14]].
+[[Aufgaben:Exercise_3.8:_Once_more_Mutual_Information|Exercise 3.8: Once more Mutual Information]]
-{{end}}
+[[Aufgaben:Exercise_3.8Z:_Tuples_from_Ternary_Random_Variables|Exercise 3.8Z: Tuples from Ternary Random Variables]]
+[[Aufgaben:Exercise_3.9:_Conditional_Mutual_Information|Exercise 3.9: Conditional Mutual Information]]
-== Aufgaben zu Kapitel 3.2  ==
 {{Display}}