Difference between revisions of "Information Theory/Different Entropy Measures of Two-Dimensional Random Variables"

From LNTwww
(47 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
   
 
   
 
{{Header
 
{{Header
|Untermenü=Information zwischen zwei wertdiskreten Zufallsgrößen
+
|Untermenü=Mutual Information Between Two Discrete Random Variables
 
|Vorherige Seite=Einige Vorbemerkungen zu zweidimensionalen Zufallsgrößen
 
|Vorherige Seite=Einige Vorbemerkungen zu zweidimensionalen Zufallsgrößen
 
|Nächste Seite=Anwendung auf die Digitalsignalübertragung
 
|Nächste Seite=Anwendung auf die Digitalsignalübertragung
Line 7: Line 7:
  
  
==Definition der Entropie unter Verwendung von supp(<i>P<sub>XY</sub></i>)==
+
==Definition of entropy using supp(<i>P<sub>XY</sub></i>)==  
Wir fassen die Ergebnisse des letzten Abschnitts nochmals kurz zusammen, wobei wir von der zweidimensionalen Zufallsgröße $XY$ mit der Wahrscheinlichkeitsfunktion $P_{XY}(X, Y)$ ausgehen. Gleichzeitig verwenden wir die Schreibweise
+
<br>
 +
We briefly summarise the results of the last chapter again, assuming the two-dimensional random variable&nbsp; $XY$&nbsp; with the probability mass function&nbsp; $P_{XY}(X,\ Y)$&nbsp;.&nbsp; At the same time we use the notation
 
   
 
   
:$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x, y) \in XY \hspace{0.05cm},
+
:$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm},
\hspace{0.3cm} {\rm wobei} \hspace{0.15cm} P_{XY}(X, Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm}.$$
+
\hspace{0.3cm} {\rm where} \hspace{0.15cm} P_{XY}(X,\ Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm};$$
  
{{BlaueBox|
+
{{BlaueBox|TEXT=
TEXT=Mit dieser Teilmenge $\text{supp}(P_{XY}) ⊂ P_{XY}$ gilt für
+
$\text{Summarising the last chapter:}$&nbsp; With this subset&nbsp; $\text{supp}(P_{XY}) ⊂ P_{XY}$,&nbsp; the following holds for
*die '''Verbundentropie''' (englisch: ''Joint Entropy''):
+
*the&nbsp; '''joint entropy'''&nbsp;:
 
   
 
   
 
:$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})}  
 
:$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})}  
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$
  
*die '''Entropien der 1D–Zufallsgrößen''' $X$ und $Y$:
+
*the&nbsp; '''entropies of the one-dimensional random variables'''&nbsp; $X$&nbsp; and&nbsp; $Y$:
 
    
 
    
 
:$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})}  
 
:$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})}  
Line 28: Line 29:
  
  
{{GraueBox|
+
{{GraueBox|TEXT=
TEXT='''Beispiel 1''':&nbsp; Wir beziehen uns nochmals auf das '''Experiment mit zwei Würfeln''' auf der Seite [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Verbundwahrscheinlichkeit und Verbundentropie]] im letzten Kapitel.
+
$\text{Example 1:}$&nbsp; We refer again to the examples on the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|joint probability and joint entropy]]&nbsp; in the last chapter.&nbsp;
  
Bei der 2D–Wahrscheinlichkeitsfunktion $P_{RB}(R, B)$ im dortigen  &bdquo;Beispiel 4&rdquo; mit den Parametern  
+
For the two-dimensional probability mass function&nbsp; $P_{RB}(R, B)$&nbsp; in&nbsp[[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 5}$]]&nbsp; with the parameters
*$R$ &nbsp; &rArr; &nbsp;  Augenzahl des roten Würfels und
+
*$R$ &nbsp; &rArr; &nbsp;  points of the red cube,
*$B$ &nbsp; &rArr; &nbsp;  Augenzahl des blauen Würfels
+
*$B$ &nbsp; &rArr; &nbsp;  points of the blue cube,
  
sind die Mengen $P_{RB}$ und $\text{supp}(P_{RB})$ identisch. Hier sind alle $6^2 = 36$ Felder mit Werten $≠ 0$ belegt.
 
  
Bei der 2D Wahrscheinlichkeitsfunktion $P_{RS}(R, S)$ im &bdquo;Beispiel 5&rdquo; mit den Parametern 
+
the sets&nbsp; $P_{RB}$&nbsp; and&nbsp; $\text{supp}(P_{RB})$&nbsp; are identical.&nbsp; Here, all&nbsp; $6^2 = 36$&nbsp; squares are occupied by non-zero values.
*$R$ &nbsp; &rArr; &nbsp; Augenzahl des roten Würfels und 
 
*$S = R + B$ &nbsp; &rArr; &nbsp; Summe der beiden Würfel
 
  
gibt es $6 · 11 = 66$ Felder, von denen allerdings viele leer sind, also für die  Wahrscheinlichkeit &bdquo;0&rdquo;.
+
For the two-dimensional probability mass function&nbsp; $P_{RS}(R, S)$&nbsp; in&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 6}$]]&nbsp; mit den Parametern 
*Die Teilmenge $\text{supp}(P_{RS})$ beinhaltet dagegen nur die 36 schraffierten Felder mit von &bdquo;0&rdquo; verschiedenen Wahrscheinlichkeiten.
+
*$R$ &nbsp; &rArr; &nbsp;  points of the red cube,
*Die Entropie bleibt gleich, ganz egal, ob man die Mittelung über alle Elemente von $P_{RS}$ oder nur über die Elemente von $\text{supp}(P_{RS})$ erstreckt, da für $x → 0$ der Grenzwert $x · \log_2 ({1}/{x}) = 0$ ist.}}
+
*$S = R + B$ &nbsp; &rArr; &nbsp; sum of both cubes,
  
  
==Bedingte Wahrscheinlichkeit und bedingte Entropie ==
+
there are&nbsp; $6 · 11 = 66$ squares, many of which, however, are empty, i.e. stand for the probability&nbsp; "0" .
 +
*The subset&nbsp; $\text{supp}(P_{RS})$&nbsp;, on the other hand, contains only the&nbsp; $36$&nbsp; shaded squares with non-zero probabilities.
 +
*The entropy remains the same no matter whether one averages over all elements of&nbsp; $P_{RS}$&nbsp; or only over the elements of &nbsp; $\text{supp}(P_{RS})$&nbsp; since for&nbsp; $x → 0$&nbsp; the limit is&nbsp; $x · \log_2 ({1}/{x}) = 0$.}}
  
Im Buch &bdquo;Stochastische Signaltheorie&rdquo; wurden für den Fall zweier Ereignisse $X$ und $Y$ die folgenden [[Stochastische_Signaltheorie/Statistische_Abhängigkeit_und_Unabhängigkeit#Bedingte_Wahrscheinlichkeit_.281.29|bedingten Wahrscheinlichkeiten]] angegeben &nbsp;  ⇒  &nbsp '''Satz von Bayes''':
+
 
 +
==Conditional probability and conditional entropy ==
 +
<br>
 +
In the book&nbsp; "Theory of Stochastic Signals"&nbsp; the following &nbsp; [[Theory_of_Stochastic_Signals/Statistical_Dependence_and_Independence#Conditional_Probability|conditional probabilities]]&nbsp;  were given for the case of two events&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp;  ⇒  &nbsp; '''Bayes' theorem''':
 
   
 
   
 
:$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm}
 
:$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm}
 
{\rm Pr} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (X)} \hspace{0.05cm}.$$
 
{\rm Pr} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (X)} \hspace{0.05cm}.$$
  
Angewendet auf  Wahrscheinlichkeitsfunktionen erhält man somit:
+
Applied to probability mass functions, one thus obtains:
 
   
 
   
 
:$$P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{P_{XY}(X, Y)}{P_{Y}(Y)} \hspace{0.05cm}, \hspace{0.5cm}  
 
:$$P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{P_{XY}(X, Y)}{P_{Y}(Y)} \hspace{0.05cm}, \hspace{0.5cm}  
 
P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  =  \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$
 
P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  =  \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$
  
Analog zur [[Informationstheorie/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Definition_der_Entropie_unter_Verwendung_von_supp.28PXY.29|Verbundentropie]] $H(XY)$ lassen sich hier folgende Entropiefunktionen ableiten:
+
Analogous to the&nbsp; [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Definition_of_entropy_using_supp.28PXY.29|joint entropy]]&nbsp; $H(XY)$&nbsp;, the following entropy functions can be derived here:
  
 
+
{{BlaueBox|TEXT=
{{BlaueBox|
+
$\text{Definitions:}$&nbsp;
TEXT='''Definition:'''&nbsp; Die '''bedingte Entropie''' (englisch: ''Conditional Entropy'') der Zufallsgröße $X$ unter der Bedingung $Y$ lautet:
+
*The&nbsp; '''conditional entropy''' of the random variable&nbsp; $X$&nbsp; under condition&nbsp; $Y$&nbsp; is:
 
   
 
   
 
:$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
 
:$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
Line 69: Line 72:
 
  \hspace{0.05cm}.$$
 
  \hspace{0.05cm}.$$
  
In gleicher Weise erhält man für die zweite bedingte Entropie:
+
*Similarly, for the&nbsp; '''second conditional entropy''' we obtain:
 
   
 
   
:$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
+
:$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}Y\hspace{-0.03cm} \mid \hspace{-0.01cm} X} (y \hspace{-0.05cm}\mid \hspace{-0.05cm} x)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}Y\hspace{-0.03cm} \mid \hspace{-0.01cm} X} (y \hspace{-0.05cm}\mid \hspace{-0.05cm} x)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{X}(x)}{P_{XY}(x, y)}
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{X}(x)}{P_{XY}(x, y)}
Line 77: Line 80:
  
  
Im Argument der Logarithmusfunktion steht stets eine bedingte Wahrscheinlichkeitsfunktion ⇒ $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$ bzw. $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$, während zur Erwartungswertbildung die Verbundwahrscheinlichkeit $P_{XY}(·)$ benötigt wird.
+
In the argument of the logarithm function there is always a conditional probability function &nbsp; &nbsp; $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$&nbsp; or&nbsp; $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$&nbsp; resp.,&nbsp; while the joint probability &nbsp; &nbsp; $P_{XY}(·)$ is needed for the expectation value formation.
Für die bedingten Entropien gibt es folgende Begrenzungen:
 
*Sowohl $H(X|Y)$ als auch $H(Y|X)$ sind stets größer oder gleich 0. Aus $H(X|Y) = 0$ folgt direkt auch $H(Y|X) = 0$. Beides ist nur für [[Stochastische_Signaltheorie/Mengentheoretische_Grundlagen#Disjunkte_Mengen|disjunkte Mengen]] $X$ und $Y$ möglich.
 
*Es gilt stets $H(X|Y) ≤ H(X)$ sowie $H(Y|X) ≤ H(Y)$. Diese Aussage ist einleuchtend, wenn man sich bewusst macht, dass man für &bdquo;Entropie&rdquo; synonym auch &bdquo;Unsicherheit&rdquo; verwenden kann.
 
*Denn: Die Unsicherheit bezüglich $X$ kann nicht dadurch größer werden, dass man $Y$ kennt. Außer bei statistischer Unabhängigkeit  ⇒  $H(X|Y) = H(X)$ gilt stets $H(X|Y) < H(X)$.
 
*Wegen $H(X) ≤ H(XY)$, $H(Y) ≤ H(XY)$ gilt somit auch $H(X|Y) ≤ H(XY)$ und $H(Y|X) ≤ H(XY)$. Eine bedingte Entropie kann also nie größer werden als die Verbundentropie.
 
  
 +
For the conditional entropies, there are the following limitations:
 +
*Both&nbsp; $H(X|Y)$&nbsp; and&nbsp; $H(Y|X)$&nbsp; are always greater than or equal to zero.&nbsp; From&nbsp; $H(X|Y) = 0$&nbsp; it follows directly&nbsp; $H(Y|X) = 0$.&nbsp; <br>Both are only possible for &nbsp; [[Theory_of_Stochastic_Signals/Mengentheoretische_Grundlagen#Disjunkte_Mengen|"disjoint sets"]]&nbsp; $X$&nbsp; and&nbsp; $Y$.
 +
*$H(X|Y) ≤ H(X)$&nbsp; and&nbsp; $H(Y|X) ≤ H(Y)$ always apply.&nbsp; These statements are plausible if one realises that one can also use&nbsp; "uncertainty"&nbsp; synonymously for&nbsp; "entropy".&nbsp; For: &nbsp; The uncertainty with respect to the quantity&nbsp;  $X$&nbsp; cannot be increased by knowing&nbsp; $Y$.&nbsp;
 +
*Except in the case of statistical independence  &nbsp; ⇒ &nbsp;  $H(X|Y) = H(X)$&nbsp;, &nbsp; $H(X|Y) < H(X)$ always holds.&nbsp; Because of&nbsp; $H(X) ≤ H(XY)$&nbsp; and&nbsp; $H(Y) ≤ H(XY)$&nbsp;,&nbsp; therefore also&nbsp; $H(X|Y) ≤ H(XY)$&nbsp; and&nbsp; $H(Y|X) ≤ H(XY)$&nbsp;  hold.&nbsp; Thus, '''a conditional entropy can never become larger than the joint entropy'''.
  
{{GraueBox|
 
TEXT='''Beispiel 2''':&nbsp; Wir betrachten die Verbundwahrscheinlichkeiten $P_{RS}(·)$ unseres Würfelexperiments, die im letzten Kapitel als [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 5]] ermittelt wurden. In der Mitte der folgenden Grafik ist $P_{RS}(·)$ nochmals angegeben.
 
  
[[File:P_ID2764__Inf_T_3_2_S3.png|Bedingte Wahrscheinlichkeitsfunktionen  <i>P<sub>S|R</sub></i> und <i>P<sub>R|S</sub></i>]]
+
{{GraueBox|TEXT=
 +
$\text{Example 2:}$&nbsp; We consider the joint probabilities&nbsp; $P_{RS}(·)$&nbsp; of our dice experiment, which were determined in the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Conditional_probability_and_conditional_entropy|last chapter]]&nbsp; as&nbsp; $\text{Example 6}$.&nbsp; The corresponding &nbsp;$P_{RS}(·)$&nbsp; is given again in the middle of the following graph.
  
Außen sind die beiden bedingten Wahrscheinlichkeitsfunktionen gezeichnet:
+
[[File:P_ID2764__Inf_T_3_2_S3.png|right|frame|Joint probabilities&nbsp; $P_{RS}$&nbsp; and conditional probabilities&nbsp;  $P_{S \vert R}$&nbsp; and&nbsp; $P_{R \vert S}$]]
*Links dargestellt ist die bedingte Wahrscheinlichkeitsfunktion $P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅)$. Wegen $P_R(R) = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]$ steht hier in allen schraffierten Feldern  ⇒  $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$ der gleiche Wahrscheinlichkeitswert $1/6$. Daraus folgt für die bedingte Entropie:
+
 
 +
The two conditional probability functions are drawn on the outside:
 +
 
 +
$\text{On the left}$&nbsp; you see the conditional probability mass function&nbsp;
 +
:$$P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅).$$
 +
*Because of&nbsp; $P_R(R) = \big [1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6 \big ]$&nbsp; the probability&nbsp; $1/6$&nbsp; is in all shaded fields &nbsp;
 +
*That means: &nbsp; $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$&nbsp; .&nbsp;
 +
*From this follows for the conditional entropy:
 
   
 
   
 
:$$H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = \hspace{-0.2cm} \sum_{(r, s) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{RS})}  
 
:$$H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = \hspace{-0.2cm} \sum_{(r, s) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{RS})}  
  \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} =
+
  \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} $$
36 \cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\,{\rm bit}
+
:$$\Rightarrow \hspace{0.3cm}H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) =
 +
36 \cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit}
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
*Für die andere bedingte Wahrscheinlichkeitsfunktion $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$ mit $P_S(⋅)$ gemäß [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 5]] ergeben sich die gleichen Felder ungleich 0 ⇒ $\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$. Die Wahrscheinlichkeitswerte nehmen nun aber von der Mitte ($1/6$) zu den Rändern hin bis zur Wahrscheinlichkeit $1$ in den Ecken kontinuierlich zu. Daraus folgt:
+
$\text{On the right}$,&nbsp; $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$&nbsp; is given with&nbsp; $P_S(⋅)$&nbsp; according to&nbsp; $\text{Example 6}$.&nbsp;
 +
*$\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$ &nbsp; ⇒ &nbsp;same non-zero fields result.  
 +
*However, the probability values now increase continuously from the centre&nbsp; $(1/6)$&nbsp;  towards the edges up to&nbsp; $1$&nbsp; in the corners.  
 +
*It follows that:
 
   
 
   
 
:$$H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} S)  = \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) +
 
:$$H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} S)  = \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) +
\frac{2}{36} \cdot  \sum_{i=1}^5 \left [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \right ]= 1.896\,{\rm bit} \hspace{0.05cm}.$$
+
\frac{2}{36} \cdot  \sum_{i=1}^5 \big [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \big ]= 1.896\ {\rm bit} \hspace{0.05cm}.$$
  
Dagegen ergibt sich für die Zufallsgröße $RB$ gemäß [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Verbundwahrscheinlichkeit_und_Verbundentropie|Beispiel 4]] wegen $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:
+
On the other hand, for the conditional probabilities of the 2D random variable&nbsp; $RB$&nbsp; according to&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 5}$]],&nbsp; one obtains because of&nbsp; $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:
 
   
 
   
:$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R)  \hspace{-0.15cm} & =  \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\,{\rm bit} \hspace{0.05cm},\\
+
:$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R)  \hspace{-0.15cm} & =  \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm},\\
H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B)  \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\,{\rm bit} \hspace{0.05cm}.\end{align*}$$}}
+
H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B)  \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.\end{align*}$$}}
  
 
 
 
 
  
==Transinformation zwischen zwei Zufallsgrößen  ==  
+
==Mutual information between two random variables==  
 +
<br>
 +
We consider the two-dimensional random variable&nbsp; $XY$&nbsp; with PMF&nbsp; $P_{XY}(X, Y)$.&nbsp;Let the one-dimensional functions&nbsp; $P_X(X)$&nbsp; and&nbsp; $P_Y(Y)$ also be known.
  
Wir betrachten die Zufallsgröße $XY$ mit der 2D–Wahrscheinlichkeitsfunktion $P_{XY}(X, Y)$. Bekannt seien auch die 1D–Funktionen $P_X(X)$ und $P_Y(Y)$. Nun stellen sich folgende Fragen:
+
Now the following questions arise:
*Wie vermindert die Kenntnis der Zufallsgröße $Y$ die Unsicherheit bezüglich $X$?
+
*How does the knowledge of the random variable&nbsp; $Y$&nbsp; reduce the uncertainty with respect to&nbsp; $X$?
*Wie vermindert die Kenntnis der Zufallsgröße $X$ die Unsicherheit bezüglich $Y$?
+
*How does the knowledge of the random variable&nbsp; $X$&nbsp; reduce the uncertainty with respect to&nbsp; $Y$?
Zur Beantwortung benötigen wir eine für die Informationstheorie substantielle Definition:
 
  
{{Definition}}
+
 
Die '''Transinformation''' (englisch: ''Mutual Information'') zwischen den Zufallsgrößen $X$ und $Y$ – beide über dem gleichen Alphabet ist gegeben durch den Ausdruck
+
To answer this question, we need a definition that is substantial for information theory:
 +
 
 +
{{BlaueBox|TEXT=
 +
$\text{Definition:}$&nbsp; The&nbsp; '''mutual information''' between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$ – both over the same alphabet is given as follows:
 
   
 
   
$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)}
+
:$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)}
 
{P_{X}(X) \cdot P_{Y}(Y) }\right ] =\hspace{-0.25cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY})}  
 
{P_{X}(X) \cdot P_{Y}(Y) }\right ] =\hspace{-0.25cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY})}  
 
  \hspace{-0.8cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(x, y)}
 
  \hspace{-0.8cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(x, y)}
 
{P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$
 
{P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$
  
Ein Vergleich mit [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|Kapitel 3.1]] zeigt, dass die Transinformation auch als [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Relative_Entropie_.E2.80.93_Kullback.E2.80.93Leibler.E2.80.93Distanz|Kullback–Leibler–Distanz]] zwischen der 2D–PMF $P_{XY}(⋅)$ und dem Produkt $P_X(⋅) · P_Y(⋅)$ geschrieben werden kann:
+
A comparison with the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|last chapter]]&nbsp; shows that the mutual information can also be written as a&nbsp; [[Information_Theory/Some_Preliminary_Remarks_on_Two-Dimensional_Random_Variables#Informational_divergence_-_Kullback-Leibler_distance|Kullback–Leibler distance]]&nbsp; between the two-dimensional probability mass function&nbsp; $P_{XY}$&nbsp; and the product&nbsp; $P_X · P_Y$&nbsp; :
 
   
 
   
$$I(X;Y) = D(P_{XY} \hspace{0.05cm}|| \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$
+
:$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$
 
 
Es ist offensichtlich, dass stets $I(X; Y)$ ≥ 0 gilt. Wegen der Symmetrie ist auch $I(Y; X)$ = $I(X; Y)$.
 
  
{{end}}
+
It is thus obvious that&nbsp; $I(X;\ Y) ≥ 0$&nbsp; always holds.&nbsp; Because of the symmetry, &nbsp; $I(Y;\ X)$ = $I(X;\ Y)$ is also true.}}
  
  
Sucht man in einem Wörterbuch die Übersetzung für „mutual”, so findet man unter Anderem die Begriffe „gemeinsam”, „gegenseitig”, „beidseitig” und „wechselseitig”. Und ebenso sind in Fachbüchern für $I(X; Y)$ auch die Bezeichnungen ''gemeinsame Entropie'' und ''gegenseitige Entropie'' üblich. Wir sprechen aber im Folgenden durchgängig von der ''Transinformation'' $I(X; Y)$ und interpretieren nun diese Größe:
+
By splitting the&nbsp; $\log_2$ argument according to
*Durch Aufspalten des log2–Arguments entsprechend
 
 
   
 
   
$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1}
+
:$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1}
 
{P_{X}(X)  }\right ] - {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac
 
{P_{X}(X)  }\right ] - {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac
 
{P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$
 
{P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$
  
erhält man unter Verwendung von $P_{X|Y}()$ = $P_{XY}()/_PY(Y)$:
+
is obtained using&nbsp; $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:
 
   
 
   
$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
+
:$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
  
Das heißt: Die Unsicherheit hinsichtlich der Zufallsgröße $X$  ⇒  Entropie $H(X)$ vermindert sich bei Kenntnis von $Y$ um den Betrag $H(X|Y)$. Der Rest ist die Transinformation $I(X; Y)$.
+
*This means: &nbsp; The uncertainty regarding the random quantity&nbsp; $X$  &nbsp; ⇒  &nbsp;  entropy&nbsp; $H(X)$&nbsp; decreases by the amount&nbsp; $H(X|Y)$&nbsp; when&nbsp; $Y$ is known.&nbsp; The remainder is the mutual information&nbsp; $I(X; Y)$.
*Bei anderer Aufspaltung kommt man zum Ergebnis:
+
*With a different splitting, one arrives at the result
+
:$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
+
*Ergo: &nbsp; The mutual information&nbsp; $I(X; Y)$&nbsp; is symmetrical  &nbsp; ⇒ &nbsp;  $X$&nbsp; says just as much about&nbsp; $Y$&nbsp; as&nbsp; $Y$&nbsp; says about&nbsp; $X$  &nbsp; ⇒ &nbsp; "mutual".&nbsp; The semicolon indicates equality.
  
Ergo: Die Transinformation $I(X; Y)$ ist symmetrisch: $X$ sagt genau so viel über $Y$ aus wie $Y$ über $X$  ⇒  gegenseitige Information. Das Semikolon weist auf die Gleichberechtigung hin.
 
  
Oft werden die hier genannten Gleichungen durch ein Schaubild verdeutlicht, so auch in den folgenden Beispielen. Daraus erkennt man, dass auch folgende Gleichungen zutreffen:
+
{{BlaueBox|TEXT=
 +
$\text{Conclusion:}$&nbsp;
 +
Often the equations mentioned here are clarified by a diagram, as in the following examples.&nbsp; <br>From this you can see that the following equations also apply:
 
   
 
   
$$\begin{align*}I(X;Y) \hspace{-0.15cm} & = \hspace{-0.15cm} H(X) + H(Y) - H(XY) \hspace{0.05cm},\\
+
:$$I(X;\ Y) = H(X) + H(Y) - H(XY) \hspace{0.05cm},$$
I(X;Y) \hspace{-0.15cm} & = \hspace{-0.15cm} H(XY) -  
+
:$$I(X;\ Y) = H(XY) -  
 
H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X)
 
H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X)
\hspace{0.05cm}.\end{align*}$$
+
\hspace{0.05cm}.$$}}
  
{{Beispiel}}
 
''Beispiel F'':  Wir kommen nochmals auf das [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|Würfel–Experiment]] mit dem roten $(R)$ und dem blauen $(B)$ Würfel zurück. Die Zufallsgröße $S$ gibt die Summe der beiden Würfel an: $S = R + B$.
 
Wir betrachten hier die 2D–Zufallsgröße RS. In früheren Beispielen haben wir berechnet:
 
*die Entropien $H(R)$ = 2.585 bit und $H(S)$ = 3.274 bit  ⇒  Beispiel D,
 
*die Verbundentropie $H(RS)$ = 5.170 bit  ⇒  Beispiel D,
 
*die bedingten Entropien $H(S|R)$ = 2.585 bit und $H(R|S)$ = 1.896 bit  ⇒  Beispiel F.
 
  
[[File:P_ID2765__Inf_T_3_2_S3_neu.png|Schaubild aller Entropien des „Würfelexperiments” ]]
+
{{GraueBox|TEXT=
 +
$\text{Example 3:}$&nbsp; We return&nbsp; (for the last time)&nbsp; to the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|dice experiment]]&nbsp; with the red&nbsp; $(R)$&nbsp; and blue&nbsp; $(B)$&nbsp; cube.&nbsp;  The random variable&nbsp; $S$&nbsp; gives the sum of the two dice:&nbsp; $S = R + B$.&nbsp;
 +
Here we consider the 2D random variable&nbsp; $RS$.&nbsp;
  
Diese Größen sind in der Grafik zusammengestellt, wobei die Zufallsgröße $R$ durch die Grundfarbe „Rot” und die Summe $S$ durch die Grundfarbe „grün” markiert sind. Bedingte Entropien sind schraffiert.
+
In earlier examples we calculated
Man erkennt aus dieser Darstellung:
+
*the entropies&nbsp; $H(R) = 2.585 \ \rm  bit$&nbsp; and&nbsp; $H(S) = 3.274 \ \rm bit$ &nbsp; ⇒  &nbsp;[[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|Example 6]]&nbsp; in the last chapter,
*Hier ist $H(R)$ = $\log_2 $(6) = 2.585 bit genau halb so groß wie die Verbundentropie $H(RS)$. Kennt man $R$, so liefert $S$ genau die gleiche Information wie die Zufallsgröße $B$, nämlich $H(S|R)$ = $H(B)$ = $\log_2(6)$ = 2.585 bit. Hinweis: $H(R)$ = $H(S|R)$ gilt nicht allgemein.
+
*the join entropies&nbsp; $H(RS) = 5.170 \ \rm bit$ &nbsp; ⇒  &nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|Example 6]]&nbsp; in the last chapter,
*Die Entropie $H(S)$ = 3.274 bit ist im vorliegenden Beispiel erwartungsgemäß größer als $H(R)$. Wegen $H(S) + H(R|S) = H(R) + H(S|R)$ muss deshalb $H(R|S)$ gegenüber $H(S|R)$ um den gleichen Betrag $I(R; S)$ = 0.689 bit kleiner sein wie $H(R)$ gegenüber $H(S)$.
+
*the conditional entropies&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$&nbsp; and&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = 1.896 \ \rm bit$ &nbsp; ⇒  &nbsp;  [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Conditional_probability_and_conditional_entropy|Example 2]]&nbsp; in the previous section.
*Die Transinformation (englisch: ''Mutual Information'') zwischen den Zufallsgrößen $R$ und $S$ ergibt sich aber auch aus der Gleichung
 
 
$$\begin{align*}I(R;S) \hspace{-0.15cm} & =  \hspace{-0.15cm} H(R) + H(S) - H(RS) =\\
 
& = \hspace{-0.15cm}  2.585\,{\rm bit} + 3.274\,{\rm bit} -
 
5.170\,{\rm bit} = 0.689\,{\rm bit}
 
\hspace{0.05cm}. \end{align*}$$
 
  
{{end}}
+
[[File:P_ID2765__Inf_T_3_2_S3_neu.png|frame|Diagram of all entropies of the „dice experiment” ]]
  
 +
<br>These quantities are compiled in the graph, with the random quantity&nbsp; $R$&nbsp; marked by the basic colour „red” and the sum&nbsp; $S$&nbsp; marked by the basic colour „green” .&nbsp; Conditional entropies are shaded.&nbsp;
 +
One can see from this representation:
 +
*The entropy&nbsp; $H(R) = \log_2 (6) = 2.585\ \rm bit$&nbsp; is exactly half as large as the joint entropy&nbsp; $H(RS)$.&nbsp; Because:&nbsp; If one knows&nbsp; $R$,&nbsp; then&nbsp; $S$&nbsp; provides exactly the same information as the random quantity&nbsp; $B$, namely&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R) = H(B) = \log_2 (6) = 2.585\ \rm bit$.&nbsp;
 +
*'''Note''': &nbsp; $H(R)$ = $H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R)$&nbsp; '''only applies in this example, not in general'''.
 +
*As expected, here the entropy&nbsp; $H(S) = 3.274 \ \rm bit$&nbsp; is greater than&nbsp; $H(R)= 2.585\ \rm bit$.&nbsp; Because of&nbsp; $H(S) + H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S) = H(R) + H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R)$&nbsp;,&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S)$&nbsp; must therefore be smaller than&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$&nbsp; by &nbsp; $I(R;\ S) = 0.689 \ \rm bit$&nbsp;. &nbsp; $H(R)$&nbsp; is also smaller than&nbsp; $H(S)$ by &nbsp; $I(R;\ S) = 0.689 \ \rm bit$&nbsp;.
 +
*The mutual information between the random variables&nbsp; $R$&nbsp; and&nbsp; $S$&nbsp; also results from the equation
 +
:$$I(R;\ S) = H(R) + H(S) - H(RS) =  2.585\ {\rm bit} + 3.274\ {\rm bit} - 5.170\ {\rm bit} = 0.689\ {\rm bit} \hspace{0.05cm}. $$}}
  
==Bedingte Transinformation  ==
 
  
Wir betrachten nun drei Zufallsgrößen $X$, $Y$ und $Z$, die zueinander in Beziehung stehen (können).
+
==Conditional mutual information  ==
 
+
<br>
{{Definition}}
+
We now consider three random variables&nbsp; $X$,&nbsp; $Y$&nbsp; and&nbsp; $Z$, that can be related to each other.
Die '''bedingte Transinformation''' (englisch: ''Conditional Mutual Information'') zwischen den Zufallsgrößen $X$ und $Y$ bei gegebenem Z = z lautet:
+
{{BlaueBox|TEXT=
 +
$\text{Definition:}$&nbsp; The &nbsp; '''conditional mutual information''' &nbsp;  between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; '''for a given'''&nbsp; $Z = z$&nbsp; is as follows:
 
   
 
   
$$I(X;Y \hspace{0.05cm}|\hspace{0.05cm} Z = z) =  H(X\hspace{0.05cm}|\hspace{0.05cm} Z = z) - H(X|\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$
+
:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$
  
Dagegen bezeichnet man als die '''bedingte Transinformation''' zwischen den Zufallsgrößen $X$ und $Y$ bei gegebener '''Zufallsgröße Z''':
+
One denotes as the conditional&nbsp; '''conditional mutual information'''&nbsp; between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; for the random variable&nbsp; $Z$&nbsp; '''in general'''&nbsp; <br>after averaging over all&nbsp; $z \in Z$:
 
   
 
   
$$I(X;Y \hspace{0.05cm}|\hspace{0.05cm} Z ) =  H(X\hspace{0.05cm}|\hspace{0.05cm} Z ) - H(X|\hspace{0.05cm}Y  Z )= \hspace{-0.3cm}
+
:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y  Z )= \hspace{-0.3cm}
 
\sum_{z \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Z})} \hspace{-0.25cm} P_{Z}(z) \cdot   
 
\sum_{z \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Z})} \hspace{-0.25cm} P_{Z}(z) \cdot   
I(X;Y \hspace{0.05cm}|\hspace{0.05cm} Z = z)  
+
I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z)  
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
Hierbei ist $P_Z(Z)$ die Wahrscheinlichkeitsfunktion der neben $X$ und $Y$ betrachteten Zufallsgröße $Z$ und $P_Z(z)$ die Wahrscheinlichkeit für $Z = z$.
+
$P_Z(Z)$&nbsp; is the probability mass function&nbsp; $\rm  (PMF)$&nbsp; of the random variable&nbsp; $Z$&nbsp; and&nbsp; $P_Z(z)$&nbsp; is the&nbsp; '''probability'''&nbsp; for the realisation&nbsp; $Z = z$.}}
  
{{end}}
 
  
 +
{{BlaueBox|TEXT=
 +
$\text{Please note:}$&nbsp;
 +
*For the conditional entropy, as is well known, the relation &nbsp; $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$&nbsp; holds.
 +
*For the mutual information, this relation does not necessarily hold: <br> &nbsp; &nbsp; $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$&nbsp; can be&nbsp; '''smaller, equal, but also larger than'''&nbsp; als&nbsp; $I(X; Y)$.}}
  
Bitte beachten Sie: Für die bedingte Entropie gilt bekanntlich die Größenrelation $H(X|Z) ≤ H(X)$. Für die Transinformation gilt diese Größenrelation nicht unbedingt:
 
: $I(X; Y|Z)$ kann kleiner, gleich, aber auch größer sein als $I(X; Y)$.
 
  
 +
[[File:P_ID2824__Inf_T_3_2_S4a.png|right|frame|2D PMF&nbsp; $P_{XZ}$ ]]
 +
{{GraueBox|TEXT=
 +
$\text{Example 4:}$&nbsp;
 +
We consider the binary random variables&nbsp; $X$,&nbsp; $Y$&nbsp; and&nbsp; $Z$&nbsp; with the following properties:
 +
* $X$&nbsp; and&nbsp; $Y$&nbsp; be statistically independent.&nbsp; Let the following be true for their probability mass functions: 
 +
:$$P_X(X) = \big [1/2, \ 1/2 \big],  \hspace{0.2cm} P_Y(Y) = \big[1– p, \ p \big] \  ⇒  \  H(X) = 1\ {\rm bit},  \hspace{0.2cm}  H(Y) = H_{\rm bin}(p).$$
 +
* $Z$&nbsp; is the modulo-2 sum of&nbsp; $X$&nbsp; and&nbsp; $Y$: &nbsp;  $Z = X ⊕ Y$.
  
{{Beispiel}}
 
Wir betrachten die binären Zufallsgrößen $X$, $Y$ und $Z$ mit folgenden Eigenschaften:
 
* $X$ und $Y$ seien statistisch unabhängig und für ihre Wahrscheinlichkeitsfunktionen gelte:  $P_X(X)$ = [1/2,  1/2],  $P_Y(Y)$ = [1– $p$,  $p$]  ⇒  $H(X)$ = 1 (bit),  $H(Y)$ = $H_{\text{bin}}(p)$.
 
  
[[File:P_ID2824__Inf_T_3_2_S4a.png|Wahrscheinlichkeitsfunktion <i>P<sub>XZ</sub></i> ]]
+
From the joint probability mass function&nbsp; $P_{XZ}$&nbsp; according to the upper graph, it follows:
 +
*Summing the column probabilities gives&nbsp; <br> &nbsp; &nbsp; $P_Z(Z) = \big [1/2, \  1/2 \big ]$ &nbsp;  ⇒ &nbsp; $H(Z) = 1\ {\rm bit}.$
 +
* $X$&nbsp; and&nbsp; $Z$&nbsp; are also statistically independent, since for the 2D PMF  holds&nbsp; <br> &nbsp; &nbsp; $P_{XZ}(X, Z) = P_X(X) · P_Z(Z)$&nbsp;.&nbsp;
 +
[[File:P_ID2826__Inf_T_3_2_S4b.png|right|frame|Conditional  2D PMF $P_{X\hspace{0.05cm}\vert\hspace{0.05cm}YZ}$]]
  
* $Z$ ist die Modulo–2–Summe von $X$ und $Y$:   $Z = X ⊕ Y$.
+
*It follows that: <br> &nbsp; &nbsp; $H(Z\hspace{0.05cm}\vert\hspace{0.05cm}  X) = H(Z),\hspace{0.5cm}(X \hspace{0.05cm}\vert\hspace{0.05cm}  Z) = H(X),\hspace{0.5cm} I(X; Z) = 0.$
 +
<br>From the conditional probability mass function&nbsp; $P_{X\vert YZ}$&nbsp; according to the graph below, we can calculate:
 +
* $H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = 0$,&nbsp; since all&nbsp; $P_{X\hspace{0.05cm}\vert\hspace{0.05cm} YZ}$ entries are&nbsp; $0$&nbsp; or&nbsp; $1$  &nbsp;  ⇒ &nbsp;  "conditional entropy",
 +
* $I(X; YZ) = H(X) - H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = H(X)= 1 \ {\rm bit}$ &nbsp;  ⇒ &nbsp;   "mutual information",
 +
* $I(X; Y\vert Z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z) =H(X)=1 \ {\rm bit} $ &nbsp;  ⇒ &nbsp;  "conditional mutual information".
  
Aus der Verbund–PMF $P_{XZ}$ gemäß der oberen Grafik folgt:
 
*Durch Summation der Spalten–Wahrscheinlichkeiten ergibt sich $P_Z(Z)$ = [1/2; 1/2] ⇒ $H(Z)$ = 1.
 
* $X$ und $Z$ sind ebenfalls statistisch unabhängig, da für die 2D–PMF $P_{XZ}(X, Z)$ = $P_X(X) · P_Z(Z)$ gilt.
 
*Daraus folgt: $H(Z|X)$ = $H(Z)$, $H(X|Z)$ = $H(X)$, $I(X; Z)$ = 0.
 
  
[[File:P_ID2826__Inf_T_3_2_S4b.png|Bedingte Wahrscheinlichkeitsfunktion <i>P<sub>X|ZY</sub></i>]]
+
In the present example:  
  
Aus der bedingten Wahrscheinlichkeitsfunktion $P_{X|YZ}$ gemäß der unteren Grafik lassen sich berechnen:
+
'''The conditional mutual information'''&nbsp; $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm} Z) = 1$&nbsp; '''is greater than the conventional mutual information''' &nbsp;$I(X; Y) = 0$. }}
* $H(X|YZ)$ = 0, da alle $P_{X|YZ}$–Einträge entweder 0 oder 1  ⇒  ''bedingte Entropie'',
 
* $I(X; YZ)$ = $H(X)$ – $H(X|YZ)$ = $H(X)$ ⇒  ''Transinformation'',
 
* $I(X; Y|Z)$ = $H(X|Z)$ = $H(X)$  ⇒  ''bedingte Transinformation''.
 
Im vorliegenden Beispiel ist also $I(X; Y|Z)$ = 1 (bit) größer als $I(X; Y)$ = 0 (bit).
 
 
{{end}}
 
  
 
 
 
 
 
 
 
 
==Kettenregel der Transinformation ==
+
==Chain rule of the mutual information ==
 +
<br>
 +
So far we have only considered the mutual information between two one-dimensional random variables.&nbsp; Now we extend the definition to a total of&nbsp; $n + 1$&nbsp; random variables, which, only for reasons of representation, we denote with&nbsp; $X_1$,&nbsp; ... ,&nbsp; $X_n$&nbsp; and&nbsp; $Z$&nbsp; .&nbsp; Then applies:
  
Bisher haben wir die Transinformation nur zwischen zwei eindimensionalen Zufallsgrößen betrachtet. Nun erweitern wir die Definition auf insgesamt $n$ + 1 Zufallsgrößen, die wir aus Darstellungsgründen mit $X_1$, ..., $X_n$ sowie $Z$ bezeichnen. Dann gilt
+
{{BlaueBox|TEXT=
 +
$\text{Chain rule of mutual information:}$&nbsp;
 +
 +
The mutual information between the&nbsp; $n$–dimensional random variable&nbsp; $X_1 X_2  \hspace{0.05cm}\text{...} \hspace{0.05cm}  X_n$&nbsp; and the random variable&nbsp; $Z$&nbsp; can be represented and calculated as follows:
 +
 +
:$$I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_n;Z) =
 +
I(X_1;Z) + I(X_2;Z \vert X_1) + \hspace{0.05cm}\text{...} \hspace{0.1cm}+
 +
I(X_n;Z\vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{n-1}) = \sum_{i = 1}^{n}
 +
I(X_i;Z \vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{i-1})
 +
\hspace{0.05cm}.$$
  
{{Definition}}
+
$\text{Proof:}$&nbsp;
''Kettenregel der Transinformation'':
+
We restrict ourselves here to the case&nbsp; $n = 2$, i.e. to a total of three random variables, and replace&nbsp; $X_1$&nbsp; by $X$ and&nbsp; $X_2$&nbsp; by&nbsp; $Y$.&nbsp; Then we obtain:
Die Transinformation zwischen der $n$–dimensionalen Zufallsgröße $X_1 X_2  ...  X_n$ und der Zufallsgröße $Z$ lässt sich wie folgt darstellen und berechnen:
 
 
   
 
   
$$\begin{align*}I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}... \hspace{0.1cm}X_n;Z) \hspace{-0.15cm} & =  \hspace{-0.15cm}   
+
:$$\begin{align*}I(X\hspace{0.05cm}Y;Z)  & = H(XY) - H(XY\hspace{0.05cm} \vert \hspace{0.05cm}Z) = \\
I(X_1;Z) + I(X_2;Z | X_1) + ... \hspace{0.1cm}+
+
& =  \big [  H(X)+ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X)\big ]  - \big [ H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm} XZ)\big ]  =\\  
I(X_n;Z | X_1\hspace{0.05cm}X_2\hspace{0.05cm}... \hspace{0.1cm}X_{n-1}) = \\
+
& =   \big [ H(X)- H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z)\big ]  - \big [  H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm}XZ)\big ]=\\
& =  \hspace{-0.15cm} \sum_{i = 1}^{n}
+
& =  I(X;Z) + I(Y;Z \hspace{0.05cm} \vert \hspace{0.05cm} X) \hspace{0.05cm}.\end{align*}$$}}
I(X_i;Z | X_1\hspace{0.05cm}X_2\hspace{0.05cm}... \hspace{0.1cm}X_{i-1})
+
 
\hspace{0.05cm}.\end{align*}$$
 
  
{{end}}
+
*From this equation one can see that the relation &nbsp;$I(X Y; Z) ≥ I(X; Z)$&nbsp; is always given.
 +
*Equality results for the conditional mutual information&nbsp; $I(Y; Z \hspace{0.05cm} \vert  \hspace{0.05cm} X) = 0$,&nbsp; i.e. when the random variables&nbsp; $Y$&nbsp; and&nbsp; $Z$&nbsp; for a given&nbsp; $X$&nbsp; are statistically independent.
  
  
Für den Beweis beschränken wir uns hier auf den Fall $n$ = 2, also auf insgesamt drei Zufallsgrößen, und ersetzen $X_1$ und $X_2$ durch $X$ und $Y$. Damit erhalten wir:
+
{{GraueBox|TEXT=
 +
$\text{Example 5:}$&nbsp;  We consider the&nbsp; [[Theory_of_Stochastic_Signals/Markovketten|Markov chain]] &nbsp; $X Y → Z$.&nbsp; For such a constellation, the&nbsp; "Data Processing Theorem"&nbsp; always holds with the following consequence, which can be derived from the chain rule of mutual information:
 
   
 
   
$$\begin{align*}I(X\hspace{0.05cm}Y;Z) \hspace{-0.15cm} & = & \hspace{-0.15cm} H(XY) - H(XY|Z) = \\
+
:$$I(X;Z) \hspace{-0.05cm} \le  \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$
& =  \hspace{-0.15cm} \big [  H(X)+ H(Y|X)\big ]  - \big [  H(X|Z) + H(Y|XZ)\big ]  =\\
+
:$$I(X;Z) \hspace{-0.05cm}  \le \hspace{-0.05cm} I(Y;Z ) \hspace{0.05cm}.$$
& =  \hspace{-0.15cm} \big [  H(X)- H(X|Z)\big ] - \big [  H(Y|X) + H(Y|XZ)\big ]=\\
+
 
& = \hspace{-0.15cm} I(X;Z) + I(Y;Z | X)  
+
The theorem thus states:
\hspace{0.05cm}.\end{align*}$$
+
*One cannot gain any additional information about the input&nbsp; $X$&nbsp; by manipulating the data&nbsp; $Y$&nbsp; by processing  &nbsp; $Y → Z$.
 +
*Data processing&nbsp; $Y → Z$&nbsp; $($by a second processor$)$ only serves the purpose of making the information about&nbsp; $X$&nbsp; more visible.
  
Aus dieser Gleichung erkennt man, dass die die Größenrelation $I(X Y; Z) ≥ I(X; Z)$ immer gegeben ist. Gleichheit ergibt sich für die bedingte Transinformation $I(Y; Z|X)$ = 0, also dann, wenn die Zufallsgrößen $Y$ und $Z$ für ein gegebenes $X$ statistisch unabhängig sind.
 
  
 +
For more information on the&nbsp; "Data Processing Theorem"&nbsp; see&nbsp; [[Aufgaben:Aufgabe_3.15:_Data_Processing_Theorem|Exercise 3.15]].}}
  
{{Beispiel}}
 
Wir betrachten die [[Stochastische_Signaltheorie/Markovketten|Markovkette]] $X → Y → Z$. Für eine solche Konstellation gilt stets das ''Data Processing Theorem'' mit der folgenden Konsequenz, die sich aus der Kettenregel der Transinformation ableiten lässt:
 
 
$$I(X;Z) \hspace{-0.15cm}  \le  \hspace{-0.15cm}I(X;Y ) \hspace{0.05cm},\\
 
I(X;Z) \hspace{-0.15cm}  \le  \hspace{-0.15cm} I(Y;Z ) \hspace{0.05cm}.$$
 
  
Das Theorem besagt somit:
+
==Exercises for the chapter==
*Man kann durch Manipulation (''Processing Z'') der Daten $Y$ keine zusätzliche Information über den Eingang $X$ gewinnen.
+
<br>
*Die Datenverarbeitung $Y → Z$ (durch einen zweiten Prozessor) dient nur dem Zweck, die Information über $X$ besser sichtbar zu machen.
+
[[Aufgaben:Exercise_3.7:_Some_Entropy_Calculations|Exercise 3.7: Some Entropy Calculations]]
  
Weitere Informationen zum ''Data Processing Theorem'' finden Sie in der [[Aufgaben:3.14_Data_Processing_Theorem|Aufgabe A3.14]].
+
[[Aufgaben:Exercise_3.8:_Once_more_Mutual_Information|Exercise 3.8: Once more Mutual Information]]
  
{{end}}
+
[[Aufgaben:Exercise_3.8Z:_Tuples_from_Ternary_Random_Variables|Exercise 3.8Z: Tuples from Ternary Random Variables]]
  
 +
[[Aufgaben:Exercise_3.9:_Conditional_Mutual_Information|Exercise 3.9: Conditional Mutual Information]]
  
== Aufgaben zu Kapitel 3.2  ==
 
  
  
 
{{Display}}
 
{{Display}}

Revision as of 14:04, 21 July 2021


Definition of entropy using supp(PXY)


We briefly summarise the results of the last chapter again, assuming the two-dimensional random variable  $XY$  with the probability mass function  $P_{XY}(X,\ Y)$ .  At the same time we use the notation

$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm}, \hspace{0.3cm} {\rm where} \hspace{0.15cm} P_{XY}(X,\ Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm};$$

$\text{Summarising the last chapter:}$  With this subset  $\text{supp}(P_{XY}) ⊂ P_{XY}$,  the following holds for

  • the  joint entropy :
$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$
  • the  entropies of the one-dimensional random variables  $X$  and  $Y$:
$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})} \hspace{-0.2cm} P_{X}(x) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(x)} \hspace{0.05cm},$$
$$H(Y) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(Y)}\right ] =\hspace{-0.2cm} \sum_{y \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Y})} \hspace{-0.2cm} P_{Y}(y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(y)} \hspace{0.05cm}.$$


$\text{Example 1:}$  We refer again to the examples on the  joint probability and joint entropy  in the last chapter. 

For the two-dimensional probability mass function  $P_{RB}(R, B)$  in  $\text{Example 5}$  with the parameters

  • $R$   ⇒   points of the red cube,
  • $B$   ⇒   points of the blue cube,


the sets  $P_{RB}$  and  $\text{supp}(P_{RB})$  are identical.  Here, all  $6^2 = 36$  squares are occupied by non-zero values.

For the two-dimensional probability mass function  $P_{RS}(R, S)$  in  $\text{Example 6}$  mit den Parametern

  • $R$   ⇒   points of the red cube,
  • $S = R + B$   ⇒   sum of both cubes,


there are  $6 · 11 = 66$ squares, many of which, however, are empty, i.e. stand for the probability  "0" .

  • The subset  $\text{supp}(P_{RS})$ , on the other hand, contains only the  $36$  shaded squares with non-zero probabilities.
  • The entropy remains the same no matter whether one averages over all elements of  $P_{RS}$  or only over the elements of   $\text{supp}(P_{RS})$  since for  $x → 0$  the limit is  $x · \log_2 ({1}/{x}) = 0$.


Conditional probability and conditional entropy


In the book  "Theory of Stochastic Signals"  the following   conditional probabilities  were given for the case of two events  $X$  and  $Y$  ⇒   Bayes' theorem:

$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm} {\rm Pr} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (X)} \hspace{0.05cm}.$$

Applied to probability mass functions, one thus obtains:

$$P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{P_{XY}(X, Y)}{P_{Y}(Y)} \hspace{0.05cm}, \hspace{0.5cm} P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$

Analogous to the  joint entropy  $H(XY)$ , the following entropy functions can be derived here:

$\text{Definitions:}$ 

  • The  conditional entropy of the random variable  $X$  under condition  $Y$  is:
$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (x \hspace{-0.05cm}\mid \hspace{-0.05cm} y)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{Y}(y)}{P_{XY}(x, y)} \hspace{0.05cm}.$$
  • Similarly, for the  second conditional entropy we obtain:
$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}Y\hspace{-0.03cm} \mid \hspace{-0.01cm} X} (y \hspace{-0.05cm}\mid \hspace{-0.05cm} x)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{X}(x)}{P_{XY}(x, y)} \hspace{0.05cm}.$$


In the argument of the logarithm function there is always a conditional probability function   ⇒   $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$  or  $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$  resp.,  while the joint probability   ⇒   $P_{XY}(·)$ is needed for the expectation value formation.

For the conditional entropies, there are the following limitations:

  • Both  $H(X|Y)$  and  $H(Y|X)$  are always greater than or equal to zero.  From  $H(X|Y) = 0$  it follows directly  $H(Y|X) = 0$. 
    Both are only possible for   "disjoint sets"  $X$  and  $Y$.
  • $H(X|Y) ≤ H(X)$  and  $H(Y|X) ≤ H(Y)$ always apply.  These statements are plausible if one realises that one can also use  "uncertainty"  synonymously for  "entropy".  For:   The uncertainty with respect to the quantity  $X$  cannot be increased by knowing  $Y$. 
  • Except in the case of statistical independence   ⇒   $H(X|Y) = H(X)$ ,   $H(X|Y) < H(X)$ always holds.  Because of  $H(X) ≤ H(XY)$  and  $H(Y) ≤ H(XY)$ ,  therefore also  $H(X|Y) ≤ H(XY)$  and  $H(Y|X) ≤ H(XY)$  hold.  Thus, a conditional entropy can never become larger than the joint entropy.


$\text{Example 2:}$  We consider the joint probabilities  $P_{RS}(·)$  of our dice experiment, which were determined in the  last chapter  as  $\text{Example 6}$.  The corresponding  $P_{RS}(·)$  is given again in the middle of the following graph.

Joint probabilities  $P_{RS}$  and conditional probabilities  $P_{S \vert R}$  and  $P_{R \vert S}$

The two conditional probability functions are drawn on the outside:

$\text{On the left}$  you see the conditional probability mass function 

$$P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅).$$
  • Because of  $P_R(R) = \big [1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6 \big ]$  the probability  $1/6$  is in all shaded fields  
  • That means:   $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$  . 
  • From this follows for the conditional entropy:
$$H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = \hspace{-0.2cm} \sum_{(r, s) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{RS})} \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} $$
$$\Rightarrow \hspace{0.3cm}H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = 36 \cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.$$

$\text{On the right}$,  $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$  is given with  $P_S(⋅)$  according to  $\text{Example 6}$. 

  • $\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$   ⇒  same non-zero fields result.
  • However, the probability values now increase continuously from the centre  $(1/6)$  towards the edges up to  $1$  in the corners.
  • It follows that:
$$H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} S) = \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) + \frac{2}{36} \cdot \sum_{i=1}^5 \big [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \big ]= 1.896\ {\rm bit} \hspace{0.05cm}.$$

On the other hand, for the conditional probabilities of the 2D random variable  $RB$  according to  $\text{Example 5}$,  one obtains because of  $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:

$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R) \hspace{-0.15cm} & = \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm},\\ H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B) \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.\end{align*}$$


Mutual information between two random variables


We consider the two-dimensional random variable  $XY$  with PMF  $P_{XY}(X, Y)$. Let the one-dimensional functions  $P_X(X)$  and  $P_Y(Y)$ also be known.

Now the following questions arise:

  • How does the knowledge of the random variable  $Y$  reduce the uncertainty with respect to  $X$?
  • How does the knowledge of the random variable  $X$  reduce the uncertainty with respect to  $Y$?


To answer this question, we need a definition that is substantial for information theory:

$\text{Definition:}$  The  mutual information between the random variables  $X$  and  $Y$ – both over the same alphabet – is given as follows:

$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)} {P_{X}(X) \cdot P_{Y}(Y) }\right ] =\hspace{-0.25cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY})} \hspace{-0.8cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(x, y)} {P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$

A comparison with the  last chapter  shows that the mutual information can also be written as a  Kullback–Leibler distance  between the two-dimensional probability mass function  $P_{XY}$  and the product  $P_X · P_Y$  :

$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$

It is thus obvious that  $I(X;\ Y) ≥ 0$  always holds.  Because of the symmetry,   $I(Y;\ X)$ = $I(X;\ Y)$ is also true.


By splitting the  $\log_2$ argument according to

$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1} {P_{X}(X) }\right ] - {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac {P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$

is obtained using  $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:

$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
  • This means:   The uncertainty regarding the random quantity  $X$   ⇒   entropy  $H(X)$  decreases by the amount  $H(X|Y)$  when  $Y$ is known.  The remainder is the mutual information  $I(X; Y)$.
  • With a different splitting, one arrives at the result
$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
  • Ergo:   The mutual information  $I(X; Y)$  is symmetrical   ⇒   $X$  says just as much about  $Y$  as  $Y$  says about  $X$   ⇒   "mutual".  The semicolon indicates equality.


$\text{Conclusion:}$  Often the equations mentioned here are clarified by a diagram, as in the following examples. 
From this you can see that the following equations also apply:

$$I(X;\ Y) = H(X) + H(Y) - H(XY) \hspace{0.05cm},$$
$$I(X;\ Y) = H(XY) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$


$\text{Example 3:}$  We return  (for the last time)  to the  dice experiment  with the red  $(R)$  and blue  $(B)$  cube.  The random variable  $S$  gives the sum of the two dice:  $S = R + B$.  Here we consider the 2D random variable  $RS$. 

In earlier examples we calculated

  • the entropies  $H(R) = 2.585 \ \rm bit$  and  $H(S) = 3.274 \ \rm bit$   ⇒  Example 6  in the last chapter,
  • the join entropies  $H(RS) = 5.170 \ \rm bit$   ⇒   Example 6  in the last chapter,
  • the conditional entropies  $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$  and  $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = 1.896 \ \rm bit$   ⇒   Example 2  in the previous section.
Diagram of all entropies of the „dice experiment”


These quantities are compiled in the graph, with the random quantity  $R$  marked by the basic colour „red” and the sum  $S$  marked by the basic colour „green” .  Conditional entropies are shaded.  One can see from this representation:

  • The entropy  $H(R) = \log_2 (6) = 2.585\ \rm bit$  is exactly half as large as the joint entropy  $H(RS)$.  Because:  If one knows  $R$,  then  $S$  provides exactly the same information as the random quantity  $B$, namely  $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = H(B) = \log_2 (6) = 2.585\ \rm bit$. 
  • Note:   $H(R)$ = $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$  only applies in this example, not in general.
  • As expected, here the entropy  $H(S) = 3.274 \ \rm bit$  is greater than  $H(R)= 2.585\ \rm bit$.  Because of  $H(S) + H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = H(R) + H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ ,  $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S)$  must therefore be smaller than  $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$  by   $I(R;\ S) = 0.689 \ \rm bit$ .   $H(R)$  is also smaller than  $H(S)$ by   $I(R;\ S) = 0.689 \ \rm bit$ .
  • The mutual information between the random variables  $R$  and  $S$  also results from the equation
$$I(R;\ S) = H(R) + H(S) - H(RS) = 2.585\ {\rm bit} + 3.274\ {\rm bit} - 5.170\ {\rm bit} = 0.689\ {\rm bit} \hspace{0.05cm}. $$


Conditional mutual information


We now consider three random variables  $X$,  $Y$  and  $Z$, that can be related to each other.

$\text{Definition:}$  The   conditional mutual information   between the random variables  $X$  and  $Y$  for a given  $Z = z$  is as follows:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

One denotes as the conditional  conditional mutual information  between the random variables  $X$  and  $Y$  for the random variable  $Z$  in general 
after averaging over all  $z \in Z$:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y Z )= \hspace{-0.3cm} \sum_{z \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Z})} \hspace{-0.25cm} P_{Z}(z) \cdot I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

$P_Z(Z)$  is the probability mass function  $\rm (PMF)$  of the random variable  $Z$  and  $P_Z(z)$  is the  probability  for the realisation  $Z = z$.


$\text{Please note:}$ 

  • For the conditional entropy, as is well known, the relation   $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$  holds.
  • For the mutual information, this relation does not necessarily hold:
        $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$  can be  smaller, equal, but also larger than  als  $I(X; Y)$.


2D PMF  $P_{XZ}$

$\text{Example 4:}$  We consider the binary random variables  $X$,  $Y$  and  $Z$  with the following properties:

  • $X$  and  $Y$  be statistically independent.  Let the following be true for their probability mass functions:
$$P_X(X) = \big [1/2, \ 1/2 \big], \hspace{0.2cm} P_Y(Y) = \big[1– p, \ p \big] \ ⇒ \ H(X) = 1\ {\rm bit}, \hspace{0.2cm} H(Y) = H_{\rm bin}(p).$$
  • $Z$  is the modulo-2 sum of  $X$  and  $Y$:   $Z = X ⊕ Y$.


From the joint probability mass function  $P_{XZ}$  according to the upper graph, it follows:

  • Summing the column probabilities gives 
        $P_Z(Z) = \big [1/2, \ 1/2 \big ]$   ⇒   $H(Z) = 1\ {\rm bit}.$
  • $X$  and  $Z$  are also statistically independent, since for the 2D PMF holds 
        $P_{XZ}(X, Z) = P_X(X) · P_Z(Z)$ . 
Conditional 2D PMF $P_{X\hspace{0.05cm}\vert\hspace{0.05cm}YZ}$
  • It follows that:
        $H(Z\hspace{0.05cm}\vert\hspace{0.05cm} X) = H(Z),\hspace{0.5cm}(X \hspace{0.05cm}\vert\hspace{0.05cm} Z) = H(X),\hspace{0.5cm} I(X; Z) = 0.$


From the conditional probability mass function  $P_{X\vert YZ}$  according to the graph below, we can calculate:

  • $H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = 0$,  since all  $P_{X\hspace{0.05cm}\vert\hspace{0.05cm} YZ}$ entries are  $0$  or  $1$   ⇒   "conditional entropy",
  • $I(X; YZ) = H(X) - H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = H(X)= 1 \ {\rm bit}$   ⇒   "mutual information",
  • $I(X; Y\vert Z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z) =H(X)=1 \ {\rm bit} $   ⇒   "conditional mutual information".


In the present example:

The conditional mutual information  $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm} Z) = 1$  is greater than the conventional mutual information  $I(X; Y) = 0$.


Chain rule of the mutual information


So far we have only considered the mutual information between two one-dimensional random variables.  Now we extend the definition to a total of  $n + 1$  random variables, which, only for reasons of representation, we denote with  $X_1$,  ... ,  $X_n$  and  $Z$  .  Then applies:

$\text{Chain rule of mutual information:}$ 

The mutual information between the  $n$–dimensional random variable  $X_1 X_2 \hspace{0.05cm}\text{...} \hspace{0.05cm} X_n$  and the random variable  $Z$  can be represented and calculated as follows:

$$I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_n;Z) = I(X_1;Z) + I(X_2;Z \vert X_1) + \hspace{0.05cm}\text{...} \hspace{0.1cm}+ I(X_n;Z\vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{n-1}) = \sum_{i = 1}^{n} I(X_i;Z \vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{i-1}) \hspace{0.05cm}.$$

$\text{Proof:}$  We restrict ourselves here to the case  $n = 2$, i.e. to a total of three random variables, and replace  $X_1$  by $X$ and  $X_2$  by  $Y$.  Then we obtain:

$$\begin{align*}I(X\hspace{0.05cm}Y;Z) & = H(XY) - H(XY\hspace{0.05cm} \vert \hspace{0.05cm}Z) = \\ & = \big [ H(X)+ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X)\big ] - \big [ H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm} XZ)\big ] =\\ & = \big [ H(X)- H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z)\big ] - \big [ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm}XZ)\big ]=\\ & = I(X;Z) + I(Y;Z \hspace{0.05cm} \vert \hspace{0.05cm} X) \hspace{0.05cm}.\end{align*}$$


  • From this equation one can see that the relation  $I(X Y; Z) ≥ I(X; Z)$  is always given.
  • Equality results for the conditional mutual information  $I(Y; Z \hspace{0.05cm} \vert \hspace{0.05cm} X) = 0$,  i.e. when the random variables  $Y$  and  $Z$  for a given  $X$  are statistically independent.


$\text{Example 5:}$  We consider the  Markov chain   $X → Y → Z$.  For such a constellation, the  "Data Processing Theorem"  always holds with the following consequence, which can be derived from the chain rule of mutual information:

$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$
$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm} I(Y;Z ) \hspace{0.05cm}.$$

The theorem thus states:

  • One cannot gain any additional information about the input  $X$  by manipulating the data  $Y$  by processing   $Y → Z$.
  • Data processing  $Y → Z$  $($by a second processor$)$ only serves the purpose of making the information about  $X$  more visible.


For more information on the  "Data Processing Theorem"  see  Exercise 3.15.


Exercises for the chapter


Exercise 3.7: Some Entropy Calculations

Exercise 3.8: Once more Mutual Information

Exercise 3.8Z: Tuples from Ternary Random Variables

Exercise 3.9: Conditional Mutual Information