Difference between revisions of "Information Theory/Different Entropy Measures of Two-Dimensional Random Variables"

Revision as of 20:29, 3 April 2021

Definition of entropy using supp(P_XY)

We briefly summarise the results of the last chapter again, assuming the two-dimensional random variable $XY$ with the probability function $P_{XY}(X,\ Y)$ . At the same time we use the notation

$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm}, \hspace{0.3cm} {\rm where} \hspace{0.15cm} P_{XY}(X,\ Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm};$$

$\text{Summarising the last chapter:}$

With this subset $\text{supp}(P_{XY}) ⊂ P_{XY}$ , the following holds for

the joint entropy :

$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$

the entropies of the 1D random variables $X$ and $Y$:

$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})} \hspace{-0.2cm} P_{X}(x) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(x)} \hspace{0.05cm},$$

$$H(Y) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(Y)}\right ] =\hspace{-0.2cm} \sum_{y \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Y})} \hspace{-0.2cm} P_{Y}(y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(y)} \hspace{0.05cm}.$$

$\text{Example 1:}$ We refer again to the examples on the joint probability and joint entropy in the last chapter. For the 2D probability function $P_{RB}(R, B)$ in $\text{example 5}$ there with the parameters

$R$ ⇒ numbers of the red die and
$B$ ⇒ number of the blue die

the sets $P_{RB}$ and $\text{supp}(P_{RB})$ are identical. Here, all $6^2 = 36$ squares are occupied by non-zero values.

For the 2D probability function $P_{RS}(R, S)$ in $\text{example 6}$ mit den Parametern

$R$ ⇒ numbers of the red die
$S = R + B$ ⇒ sum of both dice

there are $6 · 11 = 66$ squares, many of which, however, are empty, i.e. stand for the probability „0” .

The subset $\text{supp}(P_{RS})$ , on the other hand, contains only the $36$ shaded squares with non-zero probabilities.
The entropy remains the same no matter whether one averages over all elements of $P_{RS}$ or only over the elements of $\text{supp}(P_{RS})$ since for $x → 0$ the limit is $x · \log_2 ({1}/{x}) = 0$ ist.

Conditional probability and conditional entropy

In the book „Theory of Stochastic Signals” the following conditional probabilities were given for the case of two events $X$ and $Y$ ⇒ Bayes' theorem:

$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm} {\rm Pr} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (X)} \hspace{0.05cm}.$$

Applied to probability functions, one thus obtains:

$$P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{P_{XY}(X, Y)}{P_{Y}(Y)} \hspace{0.05cm}, \hspace{0.5cm} P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$

Analogous to the joint entropy $H(XY)$ , the following entropy functions can be derived here:

$\text{Definitions:}$

The conditional entropy of the random variable $X$ under condition $Y$ is:

$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (x \hspace{-0.05cm}\mid \hspace{-0.05cm} y)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{Y}(y)}{P_{XY}(x, y)} \hspace{0.05cm}.$$

Similarly, for the second conditional entropy we obtain:

$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}Y\hspace{-0.03cm} \mid \hspace{-0.01cm} X} (y \hspace{-0.05cm}\mid \hspace{-0.05cm} x)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{X}(x)}{P_{XY}(x, y)} \hspace{0.05cm}.$$

In the argument of the logarithm function there is always a conditional probability function ⇒ $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$ bzw. $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$, while the joint probability ⇒ $P_{XY}(·)$ is needed for the expectation value formation.

For the conditional entropies, there are the following limitations:

Both $H(X|Y)$ and $H(Y|X)$ are always greater than or equal to zero. From $H(X|Y) = 0$ it follows directly also $H(Y|X) = 0$. Both are only possible for disoint sets $X$ and $Y$ .
$H(X|Y) ≤ H(X)$ and $H(Y|X) ≤ H(Y)$ always apply. These statements are plausible if one realises that one can also use „uncertainty” synonymously for „entropy” . For: the uncertainty with respect to the quantity $X$ cannot be increased by knowing $Y$ .
Except in the case of statistical independence ⇒ $H(X|Y) = H(X)$ , $H(X|Y) < H(X)$ always holds. Because of $H(X) ≤ H(XY)$ and $H(Y) ≤ H(XY)$ , $H(X|Y) ≤ H(XY)$ and $H(Y|X) ≤ H(XY)$ therefore also hold. Thus, a conditional entropy can never become larger than the joint entropy.

$\text{Example 2:}$ We consider the joint probabilities $P_{RS}(·)$ of our dice experiment, which were determined in the last chapter als $\text{example 6}$ . $P_{RS}(·)$ is given again in the middle of the following graph.

Joint probabilities $P_{RS}$ and conditional probabilities $P_{S \vert R}$ and $P_{R \vert S}$

The two conditional probability functions are drawn on the outside:

Given on the left is the conditional probability function $P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅)$. Because of $P_R(R) = \big [1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6 \big ]$ the same probability value $1/6$ is here in all shaded fields ⇒ $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$ . From this follows for the conditional entropy:

$$H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = \hspace{-0.2cm} \sum_{(r, s) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{RS})} \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} = 36 \cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bits} \hspace{0.05cm}.$$

On the right, the conditional probability function $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$ is given with $P_S(⋅)$ according to $\text{example 6}$ . The same non-zero fields result ⇒ $\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$. However, the probability values now increase continuously from the centre $(1/6)$ towards the edges up to probability $1$ in the corners. It follows that:

$$H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} S) = \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) + \frac{2}{36} \cdot \sum_{i=1}^5 \big [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \big ]= 1.896\ {\rm bit} \hspace{0.05cm}.$$

On the other hand, for the conditional probabilities of the 2D random variable $RB$ according to $\text{example 5}$ , one obtains because of $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:

$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R) \hspace{-0.15cm} & = \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm},\\ H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B) \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.\end{align*}$$

Mutual information between two random variables

We consider the random variable $XY$ with the 2D probability function $P_{XY}(X, Y)$. Let the 1D functionsn $P_X(X)$ and $P_Y(Y)$ also be known.

Now the following questions arise:

How does knowledge of the random variable $Y$ reduce the uncertainty with respect to $X$?
How does knowledge of the random variable $X$ reduce the uncertainty with respect to $Y$?

To answer this question, we need a definition that is substantial for information theory:

$\text{Definition:}$ The mutual information between the random variables $X$ and $Y$ – both over the same alphabet – is given as follows:

$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)} {P_{X}(X) \cdot P_{Y}(Y) }\right ] =\hspace{-0.25cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY})} \hspace{-0.8cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(x, y)} {P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$

A comparison with the last chapter shows that the transinformation can also be written as a Kullback–Leibler distance between the 2D-PMF $P_{XY}$ and the product $P_X · P_Y$ :

$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$

It is thus obvious that $I(X;\ Y) ≥ 0$ always holds. Because of the symmetry, $I(Y;\ X)$ = $I(X;\ Y)$ is also true.

By splitting the $\log_2$ argument according to

$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1} {P_{X}(X) }\right ] - {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac {P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$

is obtained using $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:

$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$

This means: The uncertainty regarding the random quantity $X$ ⇒ entropy $H(X)$ decreases by the amount $H(X|Y)$ when $Y$ is known. The remainder is the mutual information $I(X; Y)$.
With a different splitting, one arrives at the result

$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$

Ergo: The mutual information $I(X; Y)$ is symmetrical ⇒ $X$ says just as much about $Y$ as $Y$ says about $X$ ⇒ mutual information. The semicolon indicates equality.

$\text{Conclusion:}$ Often the equations mentioned here are clarified by a diagram, as in the following examples. From this you can see that the following equations also apply:

$$I(X;\ Y) = H(X) + H(Y) - H(XY) \hspace{0.05cm},$$

$$I(X;\ Y) = H(XY) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$

$\text{Example 3:}$ We return (for the last time) to the dice experiment with the red $(R)$ and blue $(B)$ dice. The random variable $S$ gives the sum of the two dice: $S = R + B$. Here we consider the 2D random variable $RS$. In earlier examples we calculated

the entropies $H(R) = 2.585 \ \rm bit$ and $H(S) = 3.274 \ \rm bit$ ⇒ example 6 in the last chapter,
the join entropies $H(RS) = 5.170 \ \rm bit$ ⇒ example 6 in the last chapter,
die conditional entropies $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$ and $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = 1.896 \ \rm bit$ ⇒ example 2 in the previous section.

Diagram of all entropies of the „dice experiment”

Diese Größen sind in der Grafik zusammengestellt, wobei die Zufallsgröße $R$ durch die Grundfarbe „Rot” und die Summe $S$ durch die Grundfarbe „Grün” markiert sind. Bedingte Entropien sind schraffiert. Man erkennt aus dieser Darstellung:

Die Entropie $H(R) = \log_2 (6) = 2.585\ \rm bit$ ist genau halb so groß wie die Verbundentropie $H(RS)$. Denn: Kennt man $R$, so liefert $S$ genau die gleiche Information wie die Zufallsgröße $B$, nämlich $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = H(B) = \log_2 (6) = 2.585\ \rm bit$. Hinweis: $H(R)$ = $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ gilt allerdings nur in diesem Beispiel, nicht allgemein.
Die Entropie $H(S) = 3.274 \ \rm bit$ ist im vorliegenden Beispiel erwartungsgemäß größer als $H(R)= 2.585\ \rm bit$. Wegen $H(S) + H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = H(R) + H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ muss deshalb $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S)$ gegenüber $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ um den gleichen Betrag $I(R;\ S) = 0.689 \ \rm bit$ kleiner sein als $H(R)$ gegenüber $H(S)$.
Die Transinformation (englisch: Mutual Information) zwischen den Zufallsgrößen $R$ und $S$ ergibt sich aber auch aus der Gleichung

$$I(R;\ S) = H(R) + H(S) - H(RS) = 2.585\ {\rm bit} + 3.274\ {\rm bit} - 5.170\ {\rm bit} = 0.689\ {\rm bit} \hspace{0.05cm}. $$

Bedingte Transinformation

Wir betrachten nun drei Zufallsgrößen $X$, $Y$ und $Z$, die zueinander in Beziehung stehen (können).

$\text{Definition:}$ Die bedingte Transinformation (englisch: Conditional Mutual Information) zwischen den Zufallsgrößen $X$ und $Y$ bei gegebenem $Z = z$ lautet:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

Man bezeichnet als die bedingte Transinformation zwischen den Zufallsgrößen $X$ und $Y$ für die Zufallsgröße $Z$ allgemein nach Mittelung über alle $z \in Z$:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y Z )= \hspace{-0.3cm} \sum_{z \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Z})} \hspace{-0.25cm} P_{Z}(z) \cdot I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

$P_Z(Z)$ ist die Wahrscheinlichkeitsfunktion (PMF) der Zufallsgröße $Z$ und $P_Z(z)$ die Wahrscheinlichkeit für die Realisierung $Z = z$.

$\text{Bitte beachten Sie:}$

Für die bedingte Entropie gilt bekanntlich die Größenrelation $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$.
Für die Transinformation gilt diese Größenrelation nicht unbedingt:
$I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$ kann kleiner, gleich, aber auch größer sein als $I(X; Y)$.

2D–PMF $P_{XZ}$

$\text{Beispiel 4:}$ Wir betrachten die binären Zufallsgrößen $X$, $Y$ und $Z$ mit folgenden Eigenschaften:

$X$ und $Y$ seien statistisch unabhängig. Für ihre Wahrscheinlichkeitsfunktionen gelte:

$$P_X(X) = \big [1/2, \ 1/2 \big], \hspace{0.2cm} P_Y(Y) = \big[1– p, \ p \big] \ ⇒ \ H(X) = 1\ {\rm bit}, \hspace{0.2cm} H(Y) = H_{\rm bin}(p).$$

$Z$ ist die Modulo–2–Summe von $X$ und $Y$: $Z = X ⊕ Y$.

Aus der Verbund–Wahrscheinlichkeitsfunktion $P_{XZ}$ gemäß der oberen Grafik folgt:

Durch Summation der Spalten–Wahrscheinlichkeiten ergibt sich $P_Z(Z) = \big [1/2, \ 1/2 \big ]$ ⇒ $H(Z) = 1\ {\rm bit}$.
$X$ und $Z$ sind ebenfalls statistisch unabhängig, da für die 2D–PMF $P_{XZ}(X, Z) = P_X(X) · P_Z(Z)$ gilt.
Daraus folgt: $H(Z\hspace{0.05cm}\vert\hspace{0.05cm} X) = H(Z)$ und $H(X \hspace{0.05cm}\vert\hspace{0.05cm} Z) = H(X)$ sowie $I(X; Z) = 0$.

Bedingte 2D–PMF $P_{X\hspace{0.05cm}\vert\hspace{0.05cm}YZ}$

Aus der bedingten Wahrscheinlichkeitsfunktion $P_{X\vert YZ}$ gemäß der unteren Grafik lassen sich berechnen:

$H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = 0$, da alle $P_{X\hspace{0.05cm}\vert\hspace{0.05cm} YZ}$–Einträge entweder $0$ oder $1$ sind ⇒ bedingte Entropie,
$I(X; YZ) = H(X) - H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = H(X)= 1 \ {\rm bit}$ ⇒ Transinformation,
$I(X; Y\vert Z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z) =H(X)=1 \ {\rm bit} $ ⇒ bedingte Transinformation.

Im vorliegenden Beispiel ist also

die bedingte Transinformation $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm} Z) = 1$
größer als die herkömmliche Transinformation $I(X; Y) = 0$.

Kettenregel der Transinformation

Bisher haben wir die Transinformation nur zwischen zwei eindimensionalen Zufallsgrößen betrachtet. Nun erweitern wir die Definition auf insgesamt $n + 1$ Zufallsgrößen, die wir aus Darstellungsgründen mit $X_1$, ... , $X_n$ sowie $Z$ bezeichnen. Dann gilt:

$\text{Kettenregel der Transinformation:}$

Die Transinformation zwischen der $n$–dimensionalen Zufallsgröße $X_1 X_2 \hspace{0.05cm}\text{...} \hspace{0.05cm} X_n$ und der Zufallsgröße $Z$ lässt sich wie folgt darstellen und berechnen:

$$I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_n;Z) = I(X_1;Z) + I(X_2;Z \vert X_1) + \hspace{0.05cm}\text{...} \hspace{0.1cm}+ I(X_n;Z\vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{n-1}) = \sum_{i = 1}^{n} I(X_i;Z \vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{i-1}) \hspace{0.05cm}.$$

$\text{Beweis:}$ Wir beschränken uns hier auf den Fall $n = 2$, also auf insgesamt drei Zufallsgrößen, und ersetzen $X_1$ durch $X$ und $X_2$ durch $Y$. Dann erhalten wir:

$$\begin{align*}I(X\hspace{0.05cm}Y;Z) & = H(XY) - H(XY\hspace{0.05cm} \vert \hspace{0.05cm}Z) = \\ & = \big [ H(X)+ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X)\big ] - \big [ H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm} XZ)\big ] =\\ & = \big [ H(X)- H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z)\big ] - \big [ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm}XZ)\big ]=\\ & = I(X;Z) + I(Y;Z \hspace{0.05cm} \vert \hspace{0.05cm} X) \hspace{0.05cm}.\end{align*}$$

Aus dieser Gleichung erkennt man, dass die die Größenrelation $I(X Y; Z) ≥ I(X; Z)$ immer gegeben ist.

Gleichheit ergibt sich für die bedingte Transinformation $I(Y; Z \hspace{0.05cm} \vert \hspace{0.05cm} X) = 0$,
also dann, wenn die Zufallsgrößen $Y$ und $Z$ für ein gegebenes $X$ statistisch unabhängig sind.

$\text{Beispiel 5:}$ Wir betrachten die Markovkette $X → Y → Z$. Für eine solche Konstellation gilt stets das Data Processing Theorem mit der folgenden Konsequenz, die sich aus der Kettenregel der Transinformation ableiten lässt:

$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$

$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm} I(Y;Z ) \hspace{0.05cm}.$$

Das Theorem besagt somit:

Man kann durch Manipulation $($Processing $Z)$ der Daten $Y$ keine zusätzliche Information über den Eingang $X$ gewinnen.
Die Datenverarbeitung $Y → Z$ $($durch einen zweiten Prozessor$)$ dient nur dem Zweck, die Information über $X$ besser sichtbar zu machen.

Weitere Informationen zum Data Processing Theorem finden Sie in der Aufgabe 3.15.

Aufgaben zum Kapitel

Aufgabe 3.7: Einige Entropieberechnungen

Aufgabe 3.8: Nochmals Transinformation

Aufgabe 3.8Z: Tupel aus ternären Zufallsgrößen

Aufgabe 3.9: Bedingte Transinformation

@@ Line 33: / Line 33: @@
 {{GraueBox|TEXT=
 $\text{Example 1:}$&nbsp; We refer again to the examples on the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|joint probability and joint entropy]]&nbsp; in the last chapter.
-For the 2D probability function&nbsp; $P_{RB}(R, B)$&nbsp; in&nbsp;  [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|$\text{example 5}$]]&nbsp; there with the parameters
+For the 2D probability function&nbsp; $P_{RB}(R, B)$&nbsp; in&nbsp;  [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{example 5}$]]&nbsp; there with the parameters
 *$R$ &nbsp; &rArr; &nbsp;  numbers of the red die and
 *$B$ &nbsp; &rArr; &nbsp;  number of the blue die
@@ Line 40: / Line 40: @@
 the sets&nbsp; $P_{RB}$&nbsp; and&nbsp; $\text{supp}(P_{RB})$&nbsp; are identical.&nbsp; Here, all&nbsp; $6^2 = 36$&nbsp; squares are occupied by non-zero values.
-For the 2D probability function&nbsp; $P_{RS}(R, S)$&nbsp;  in&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|$\text{example 6}$]]&nbsp; mit den Parametern
+For the 2D probability function&nbsp; $P_{RS}(R, S)$&nbsp;  in&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{example 6}$]]&nbsp; mit den Parametern
 *$R$ &nbsp; &rArr; &nbsp;  numbers of the red die
 *$S = R + B$ &nbsp; &rArr; &nbsp; sum of both dice
@@ Line 107: / Line 107: @@
 \frac{2}{36} \cdot  \sum_{i=1}^5 \big [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \big ]= 1.896\ {\rm bit} \hspace{0.05cm}.$$
-On the other hand, for the conditional probabilities of the 2D random variable&nbsp; $RB$&nbsp; according to&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|$\text{example 5}$]]&nbsp;, one obtains because of&nbsp; $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:
+On the other hand, for the conditional probabilities of the 2D random variable&nbsp; $RB$&nbsp; according to&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{example 5}$]]&nbsp;, one obtains because of&nbsp; $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:
 :$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R)  \hspace{-0.15cm} & =  \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm},\\
@@ Line 114: / Line 114: @@
-==Transinformation zwischen zwei Zufallsgrößen==
+==Mutual information between two random variables==
 <br>
-Wir betrachten die Zufallsgröße&nbsp; $XY$&nbsp; mit der 2D–Wahrscheinlichkeitsfunktion&nbsp; $P_{XY}(X, Y)$.&nbsp; Bekannt seien auch die 1D–Funktionen&nbsp; $P_X(X)$&nbsp; und&nbsp; $P_Y(Y)$.
+We consider the random variable&nbsp; $XY$&nbsp; with the 2D probability function&nbsp; $P_{XY}(X, Y)$.&nbsp;Let the 1D functionsn&nbsp; $P_X(X)$&nbsp; and&nbsp; $P_Y(Y)$ also be known.
-Nun stellen sich folgende Fragen:
+Now the following questions arise:
-*Wie vermindert die Kenntnis der Zufallsgröße&nbsp; $Y$&nbsp; die Unsicherheit bezüglich&nbsp; $X$?
+*How does knowledge of the random variable&nbsp; $Y$&nbsp; reduce the uncertainty with respect to&nbsp; $X$?
-*Wie vermindert die Kenntnis der Zufallsgröße&nbsp; $X$&nbsp; die Unsicherheit bezüglich&nbsp; $Y$?
+*How does knowledge of the random variable&nbsp; $X$&nbsp; reduce the uncertainty with respect to&nbsp; $Y$?
-Zur Beantwortung benötigen wir eine für die Informationstheorie substantielle Definition:
+To answer this question, we need a definition that is substantial for information theory:
 {{BlaueBox|TEXT=
-$\text{Definition:}$&nbsp; Die&nbsp; '''Transinformation'''&nbsp; (englisch:&nbsp; ''Mutual Information'')&nbsp; zwischen den Zufallsgrößen&nbsp; $X$&nbsp; und&nbsp; $Y$ – beide über dem gleichen Alphabet – ist wie folgt gegeben:
+$\text{Definition:}$&nbsp; The&nbsp; '''mutual information''' between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$ –  both over the same alphabet – is given as follows:
 :$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)}
@@ Line 133: / Line 133: @@
 {P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$
-Ein Vergleich mit dem&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|letzten Kapitel]]&nbsp; zeigt, dass die Transinformation auch als&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Relative_Entropie_.E2.80.93_Kullback.E2.80.93Leibler.E2.80.93Distanz|Kullback–Leibler–Distanz]]&nbsp; zwischen der 2D–PMF&nbsp; $P_{XY}$&nbsp; und dem Produkt&nbsp; $P_X · P_Y$&nbsp; geschrieben werden kann:
+A comparison with the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|last chapter]]&nbsp; shows that the transinformation can also be written as a&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Informational_Divergence_-_Kullback-Leibler_Distance|Kullback–Leibler distance]]&nbsp; between the 2D-PMF&nbsp; $P_{XY}$&nbsp; and the product&nbsp; $P_X · P_Y$&nbsp; :
 :$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$
-Es ist somit offensichtlich, dass stets&nbsp; $I(X;\ Y) ≥ 0$&nbsp; gilt.&nbsp; Wegen der Symmetrie ist auch&nbsp; $I(Y;\ X)$ = $I(X;\ Y)$.}}
+It is thus obvious that&nbsp; $I(X;\ Y) ≥ 0$&nbsp; always holds.&nbsp; Because of the symmetry, &nbsp; $I(Y;\ X)$ = $I(X;\ Y)$ is also true.}}
-Sucht man in einem Wörterbuch die Übersetzung für „mutual”, so findet man unter Anderem die Begriffe „gemeinsam”, „gegenseitig”, „beidseitig” und „wechselseitig”.&nbsp; Und ebenso sind in Fachbüchern für&nbsp; $I(X; Y)$&nbsp; auch die Bezeichnungen&nbsp; ''gemeinsame Entropie''&nbsp; und&nbsp; ''gegenseitige Entropie''&nbsp; üblich.&nbsp; Wir sprechen aber im Folgenden durchgängig von der&nbsp; ''Transinformation''&nbsp; $I(X; Y)$&nbsp; und versuchen nun eine Interpretation dieser Größe:
+*By splitting the&nbsp; $\log_2$ argument according to
-*Durch Aufspalten des&nbsp; $\log_2$–Arguments entsprechend
 :$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1}
@@ Line 147: / Line 146: @@
 {P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$
-:erhält man unter Verwendung von&nbsp; $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:
+:is obtained using&nbsp; $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:
 :$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
-*Das heißt: &nbsp; Die Unsicherheit hinsichtlich der Zufallsgröße&nbsp; $X$  &nbsp; ⇒  &nbsp;  Entropie&nbsp; $H(X)$&nbsp; vermindert sich bei Kenntnis von&nbsp; $Y$&nbsp; um den Betrag&nbsp; $H(X|Y)$.&nbsp; Der Rest ist die Transinformation&nbsp; $I(X; Y)$.
+*This means: &nbsp; The uncertainty regarding the random quantity&nbsp; $X$  &nbsp; ⇒  &nbsp;  entropy&nbsp; $H(X)$&nbsp; decreases by the amount&nbsp; $H(X|Y)$&nbsp; when&nbsp; $Y$ is known.&nbsp; The remainder is the mutual information&nbsp; $I(X; Y)$.
-*Bei anderer Aufspaltung kommt man zum Ergebnis
+*With a different splitting, one arrives at the result
 :$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
-*Ergo: &nbsp; Die Transinformation&nbsp; $I(X; Y)$&nbsp; ist symmetrisch  &nbsp; ⇒ &nbsp;  $X$&nbsp; sagt genau so viel über&nbsp; $Y$&nbsp; aus wie&nbsp; $Y$&nbsp; über&nbsp; $X$  &nbsp; ⇒ &nbsp; gegenseitige Information. Das Semikolon weist auf die Gleichberechtigung hin.
+*Ergo: &nbsp; The mutual information&nbsp; $I(X; Y)$&nbsp; is symmetrical  &nbsp; ⇒ &nbsp;  $X$&nbsp; says just as much about&nbsp; $Y$&nbsp; as&nbsp; $Y$&nbsp; says about&nbsp; $X$  &nbsp; ⇒ &nbsp; mutual information. The semicolon indicates equality.
 {{BlaueBox|TEXT=
-$\text{Fazit:}$&nbsp;
+$\text{Conclusion:}$&nbsp;
-Oft werden die hier genannten Gleichungen durch ein Schaubild verdeutlicht, so auch in den folgenden Beispielen.&nbsp; Daraus erkennt man, dass auch folgende Gleichungen zutreffen:
+Often the equations mentioned here are clarified by a diagram, as in the following examples.&nbsp; From this you can see that the following equations also apply:
 :$$I(X;\ Y) = H(X) + H(Y) - H(XY) \hspace{0.05cm},$$
@@ Line 168: / Line 166: @@
 {{GraueBox|TEXT=
-$\text{Beispiel 3:}$&nbsp; Wir kommen (letztmalig) auf das&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|Würfel–Experiment]]&nbsp; mit dem roten&nbsp; $(R)$&nbsp; und dem blauen&nbsp; $(B)$&nbsp; Würfel zurück.&nbsp; Die Zufallsgröße&nbsp; $S$&nbsp; gibt die Summe der beiden Würfel an:&nbsp; $S = R + B$.&nbsp;
+$\text{Example 3:}$&nbsp; We return (for the last time) to the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|dice experiment]]&nbsp; with the red&nbsp; $(R)$&nbsp; and blue&nbsp; $(B)$&nbsp; dice.&nbsp;  The random variable&nbsp; $S$&nbsp; gives the sum of the two dice:&nbsp; $S = R + B$.&nbsp;
-Wir betrachten hier die 2D–Zufallsgröße&nbsp; $RS$.&nbsp; In früheren Beispielen haben wir berechnet:
+Here we consider the 2D random variable&nbsp; $RS$.&nbsp; In earlier examples we calculated
-*die Entropien&nbsp; $H(R) = 2.585 \ \rm  bit$&nbsp; und&nbsp; $H(S) = 3.274 \ \rm bit$ &nbsp; ⇒  &nbsp;[[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 6]]&nbsp; im letzten Kapitel,
+*the entropies&nbsp; $H(R) = 2.585 \ \rm  bit$&nbsp; and&nbsp; $H(S) = 3.274 \ \rm bit$ &nbsp; ⇒  &nbsp;[[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|example 6]]&nbsp; in the last chapter,
-*die Verbundentropie&nbsp; $H(RS) = 5.170 \ \rm bit$  &nbsp; ⇒  &nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 6]]&nbsp; im letzten Kapitel,
+*the join entropies&nbsp; $H(RS) = 5.170 \ \rm bit$  &nbsp; ⇒  &nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|example 6]]&nbsp; in the last chapter,
-*die bedingten Entropien&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$&nbsp; und&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S) = 1.896 \ \rm bit$  &nbsp; ⇒  &nbsp;  [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Bedingte_Wahrscheinlichkeit_und_bedingte_Entropie|Beispiel 2]]&nbsp; im vorherigen Abschnitt.
+*die conditional entropies&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$&nbsp; and&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S) = 1.896 \ \rm bit$  &nbsp; ⇒  &nbsp;  [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Conditional_probability_and_conditional_entropy|example 2]]&nbsp; in the previous section.
-[[File:P_ID2765__Inf_T_3_2_S3_neu.png|frame|Schaubild aller Entropien des „Würfelexperiments” ]]
+[[File:P_ID2765__Inf_T_3_2_S3_neu.png|frame|Diagram of all entropies of the „dice experiment” ]]
 Diese Größen sind in der Grafik zusammengestellt, wobei die Zufallsgröße&nbsp; $R$&nbsp; durch die Grundfarbe „Rot” und die Summe&nbsp; $S$&nbsp; durch die Grundfarbe „Grün” markiert sind.&nbsp; Bedingte Entropien sind schraffiert.