Difference between revisions of "Information Theory/Different Entropy Measures of Two-Dimensional Random Variables"

From LNTwww
 
(8 intermediate revisions by 2 users not shown)
Line 9: Line 9:
 
==Definition of entropy using supp(<i>P<sub>XY</sub></i>)==  
 
==Definition of entropy using supp(<i>P<sub>XY</sub></i>)==  
 
<br>  
 
<br>  
We briefly summarise the results of the last chapter again, assuming the two-dimensional random variable&nbsp; $XY$&nbsp; with the probability mass function&nbsp; $P_{XY}(X,\ Y)$&nbsp;.&nbsp; At the same time we use the notation
+
We briefly summarize the results of the last chapter again, assuming the two-dimensional random variable&nbsp; $XY$&nbsp; with the probability mass function&nbsp; $P_{XY}(X,\ Y)$&nbsp;.&nbsp; At the same time we use the notation
 
   
 
   
 
:$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm},
 
:$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm},
Line 15: Line 15:
  
 
{{BlaueBox|TEXT=
 
{{BlaueBox|TEXT=
$\text{Summarising the last chapter:}$&nbsp; With this subset&nbsp; $\text{supp}(P_{XY}) ⊂ P_{XY}$,&nbsp; the following holds for
+
$\text{Summarizing the last chapter:}$&nbsp; With this subset&nbsp; $\text{supp}(P_{XY}) ⊂ P_{XY}$,&nbsp; the following holds for
*the&nbsp; '''joint entropy'''&nbsp;:
+
*the&nbsp; &raquo;'''joint entropy'''&laquo;&nbsp;:
 
   
 
   
 
:$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})}  
 
:$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})}  
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$
 
  \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$
  
*the&nbsp; '''entropies of the one-dimensional random variables'''&nbsp; $X$&nbsp; and&nbsp; $Y$:
+
*the&nbsp; &raquo;'''entropies of the one-dimensional random variables'''&laquo;&nbsp; $X$&nbsp; and&nbsp; $Y$:
 
    
 
    
 
:$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})}  
 
:$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})}  
Line 30: Line 30:
  
 
{{GraueBox|TEXT=
 
{{GraueBox|TEXT=
$\text{Example 1:}$&nbsp; We refer again to the examples on the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|joint probability and joint entropy]]&nbsp; in the last chapter.&nbsp;  
+
$\text{Example 1:}$&nbsp; We refer again to the examples on the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{joint probability and joint entropy}$]]&nbsp; in the last chapter.&nbsp;  
  
 
For the two-dimensional probability mass function&nbsp; $P_{RB}(R, B)$&nbsp; in&nbsp;  [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 5}$]]&nbsp; with the parameters
 
For the two-dimensional probability mass function&nbsp; $P_{RB}(R, B)$&nbsp; in&nbsp;  [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 5}$]]&nbsp; with the parameters
Line 39: Line 39:
 
the sets&nbsp; $P_{RB}$&nbsp; and&nbsp; $\text{supp}(P_{RB})$&nbsp; are identical.&nbsp; Here, all&nbsp; $6^2 = 36$&nbsp; squares are occupied by non-zero values.
 
the sets&nbsp; $P_{RB}$&nbsp; and&nbsp; $\text{supp}(P_{RB})$&nbsp; are identical.&nbsp; Here, all&nbsp; $6^2 = 36$&nbsp; squares are occupied by non-zero values.
  
For the two-dimensional probability mass function&nbsp; $P_{RS}(R, S)$&nbsp;  in&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 6}$]]&nbsp; mit den Parametern  
+
For the two-dimensional probability mass function&nbsp; $P_{RS}(R, S)$&nbsp;  in&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 6}$]]&nbsp; with the parameters  
 
*$R$ &nbsp; &rArr; &nbsp;  points of the red cube,
 
*$R$ &nbsp; &rArr; &nbsp;  points of the red cube,
 
*$S = R + B$ &nbsp; &rArr; &nbsp; sum of both cubes,
 
*$S = R + B$ &nbsp; &rArr; &nbsp; sum of both cubes,
Line 51: Line 51:
 
==Conditional probability and conditional entropy ==  
 
==Conditional probability and conditional entropy ==  
 
<br>  
 
<br>  
In the book&nbsp; "Theory of Stochastic Signals"&nbsp; the following &nbsp; [[Theory_of_Stochastic_Signals/Statistical_Dependence_and_Independence#Conditional_Probability|conditional probabilities]]&nbsp;  were given for the case of two events&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp;  ⇒  &nbsp; '''Bayes' theorem''':
+
In the book&nbsp; "Theory of Stochastic Signals"&nbsp; the following &nbsp; [[Theory_of_Stochastic_Signals/Statistical_Dependence_and_Independence#Conditional_probability|$\text{conditional probabilities}$]]&nbsp;  were given for the case of two events&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp;  ⇒  &nbsp; &raquo;'''Bayes' theorem'''&laquo;:
 
   
 
   
 
:$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm}
 
:$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)  = \frac{{\rm Pr} (X \cap  Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm}
Line 61: Line 61:
 
P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  =  \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$
 
P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X)  =  \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$
  
Analogous to the&nbsp; [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Definition_of_entropy_using_supp.28PXY.29|joint entropy]]&nbsp; $H(XY)$&nbsp;, the following entropy functions can be derived here:
+
Analogous to the&nbsp; [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Definition_of_entropy_using_supp.28PXY.29|$\text{joint entropy}$]]&nbsp; $H(XY)$&nbsp;, the following entropy functions can be derived here:
  
 
{{BlaueBox|TEXT=
 
{{BlaueBox|TEXT=
 
$\text{Definitions:}$&nbsp;
 
$\text{Definitions:}$&nbsp;
*The&nbsp; '''conditional entropy''' of the random variable&nbsp; $X$&nbsp; under condition&nbsp; $Y$&nbsp; is:
+
*The&nbsp; &raquo;'''conditional entropy'''&laquo; of the random variable&nbsp; $X$&nbsp; under condition&nbsp; $Y$&nbsp; is:
 
   
 
   
 
:$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
 
:$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
Line 72: Line 72:
 
  \hspace{0.05cm}.$$
 
  \hspace{0.05cm}.$$
  
*Similarly, for the&nbsp; '''second conditional entropy''' we obtain:
+
*Similarly, for the&nbsp; &raquo;'''second conditional entropy'''&laquo; we obtain:
 
   
 
   
 
:$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
 
:$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})}  
Line 83: Line 83:
  
 
For the conditional entropies, there are the following limitations:
 
For the conditional entropies, there are the following limitations:
*Both&nbsp; $H(X|Y)$&nbsp; and&nbsp; $H(Y|X)$&nbsp; are always greater than or equal to zero.&nbsp; From&nbsp; $H(X|Y) = 0$&nbsp; it follows directly&nbsp; $H(Y|X) = 0$.&nbsp; <br>Both are only possible for &nbsp; [[Theory_of_Stochastic_Signals/Mengentheoretische_Grundlagen#Disjunkte_Mengen|"disjoint sets"]]&nbsp; $X$&nbsp; and&nbsp; $Y$.
+
*Both&nbsp; $H(X|Y)$&nbsp; and&nbsp; $H(Y|X)$&nbsp; are always greater than or equal to zero.&nbsp; From&nbsp; $H(X|Y) = 0$&nbsp; it follows directly&nbsp; $H(Y|X) = 0$.&nbsp; <br>Both are only possible for &nbsp; [[Theory_of_Stochastic_Signals/Set_Theory_Basics#Disjoint_sets|$\text{disjoint sets}$]]&nbsp; $X$&nbsp; and&nbsp; $Y$.
*$H(X|Y) ≤ H(X)$&nbsp; and&nbsp; $H(Y|X) ≤ H(Y)$ always apply.&nbsp; These statements are plausible if one realises that one can also use&nbsp; "uncertainty"&nbsp; synonymously for&nbsp; "entropy".&nbsp; For: &nbsp; The uncertainty with respect to the quantity&nbsp;  $X$&nbsp; cannot be increased by knowing&nbsp; $Y$.&nbsp;  
+
*$H(X|Y) ≤ H(X)$&nbsp; and&nbsp; $H(Y|X) ≤ H(Y)$ always apply.&nbsp; These statements are plausible if one realizes that one can also use&nbsp; "uncertainty"&nbsp; synonymously for&nbsp; "entropy".&nbsp; For: &nbsp; The uncertainty with respect to the quantity&nbsp;  $X$&nbsp; cannot be increased by knowing&nbsp; $Y$.&nbsp;  
 
*Except in the case of statistical independence  &nbsp; ⇒ &nbsp;  $H(X|Y) = H(X)$&nbsp;, &nbsp; $H(X|Y) < H(X)$ always holds.&nbsp; Because of&nbsp; $H(X) ≤ H(XY)$&nbsp; and&nbsp; $H(Y) ≤ H(XY)$&nbsp;,&nbsp; therefore also&nbsp; $H(X|Y) ≤ H(XY)$&nbsp; and&nbsp; $H(Y|X) ≤ H(XY)$&nbsp;  hold.&nbsp; Thus, '''a conditional entropy can never become larger than the joint entropy'''.
 
*Except in the case of statistical independence  &nbsp; ⇒ &nbsp;  $H(X|Y) = H(X)$&nbsp;, &nbsp; $H(X|Y) < H(X)$ always holds.&nbsp; Because of&nbsp; $H(X) ≤ H(XY)$&nbsp; and&nbsp; $H(Y) ≤ H(XY)$&nbsp;,&nbsp; therefore also&nbsp; $H(X|Y) ≤ H(XY)$&nbsp; and&nbsp; $H(Y|X) ≤ H(XY)$&nbsp;  hold.&nbsp; Thus, '''a conditional entropy can never become larger than the joint entropy'''.
  
  
 
{{GraueBox|TEXT=
 
{{GraueBox|TEXT=
$\text{Example 2:}$&nbsp; We consider the joint probabilities&nbsp; $P_{RS}(·)$&nbsp; of our dice experiment, which were determined in the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Conditional_probability_and_conditional_entropy|last chapter]]&nbsp; as&nbsp; $\text{Example 6}$.&nbsp; The corresponding &nbsp;$P_{RS}(·)$&nbsp; is given again in the middle of the following graph.
+
$\text{Example 2:}$&nbsp; We consider the joint probabilities&nbsp; $P_{RS}(·)$&nbsp; of our dice experiment, which were determined in the&nbsp; [[Information_Theory/Some_Preliminary_Remarks_on_Two-Dimensional_Random_Variables#Joint_probability_and_joint_entropy|"last chapter"]]&nbsp; as&nbsp; $\text{Example 6}$.&nbsp; The corresponding &nbsp;$P_{RS}(·)$&nbsp; is given again in the middle of the following graph.
  
 
[[File:P_ID2764__Inf_T_3_2_S3.png|right|frame|Joint probabilities&nbsp; $P_{RS}$&nbsp; and conditional probabilities&nbsp;  $P_{S \vert R}$&nbsp; and&nbsp; $P_{R \vert S}$]]
 
[[File:P_ID2764__Inf_T_3_2_S3.png|right|frame|Joint probabilities&nbsp; $P_{RS}$&nbsp; and conditional probabilities&nbsp;  $P_{S \vert R}$&nbsp; and&nbsp; $P_{R \vert S}$]]
Line 134: Line 134:
  
 
{{BlaueBox|TEXT=
 
{{BlaueBox|TEXT=
$\text{Definition:}$&nbsp; The&nbsp; '''mutual information''' between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$ –  both over the same alphabet – is given as follows:
+
$\text{Definition:}$&nbsp; The&nbsp; &raquo;'''mutual information'''&laquo; between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$ –  both over the same alphabet – is given as follows:
 
   
 
   
 
:$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)}
 
:$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)}
Line 141: Line 141:
 
{P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$
 
{P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$
  
A comparison with the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|last chapter]]&nbsp; shows that the mutual information can also be written as a&nbsp; [[Information_Theory/Some_Preliminary_Remarks_on_Two-Dimensional_Random_Variables#Informational_divergence_-_Kullback-Leibler_distance|Kullback–Leibler distance]]&nbsp; between the two-dimensional probability mass function&nbsp; $P_{XY}$&nbsp; and the product&nbsp; $P_X · P_Y$&nbsp; :
+
A comparison with the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|"last chapter"]]&nbsp; shows that the mutual information can also be written as a&nbsp; [[Information_Theory/Some_Preliminary_Remarks_on_Two-Dimensional_Random_Variables#Informational_divergence_-_Kullback-Leibler_distance|$\text{Kullback–Leibler distance}$]]&nbsp; between the two-dimensional probability mass function&nbsp; $P_{XY}$&nbsp; and the product&nbsp; $P_X · P_Y$&nbsp; :
 
   
 
   
 
:$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$
 
:$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$
Line 158: Line 158:
 
:$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
 
:$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
  
*This means: &nbsp; The uncertainty regarding the random quantity&nbsp; $X$  &nbsp; ⇒  &nbsp;  entropy&nbsp; $H(X)$&nbsp; decreases by the amount&nbsp; $H(X|Y)$&nbsp; when&nbsp; $Y$ is known.&nbsp; The remainder is the mutual information&nbsp; $I(X; Y)$.
+
*This means: &nbsp; The uncertainty regarding the random quantity&nbsp; $X$  &nbsp; ⇒  &nbsp;  entropy&nbsp; $H(X)$&nbsp; decreases by the magnitude&nbsp; $H(X|Y)$&nbsp; when&nbsp; $Y$ is known.&nbsp; The remainder is the mutual information&nbsp; $I(X; Y)$.
 
*With a different splitting, one arrives at the result
 
*With a different splitting, one arrives at the result
 
:$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
 
:$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
Line 175: Line 175:
  
 
{{GraueBox|TEXT=
 
{{GraueBox|TEXT=
$\text{Example 3:}$&nbsp; We return&nbsp; (for the last time)&nbsp; to the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|dice experiment]]&nbsp; with the red&nbsp; $(R)$&nbsp; and blue&nbsp; $(B)$&nbsp; cube.&nbsp;  The random variable&nbsp; $S$&nbsp; gives the sum of the two dice:&nbsp; $S = R + B$.&nbsp;
+
$\text{Example 3:}$&nbsp; We return&nbsp; (for the last time)&nbsp; to the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|$\text{dice experiment}$]]&nbsp; with the red&nbsp; $(R)$&nbsp; and blue&nbsp; $(B)$&nbsp; cube.&nbsp;  The random variable&nbsp; $S$&nbsp; gives the sum of the two dice:&nbsp; $S = R + B$.&nbsp;
 
Here we consider the 2D random variable&nbsp; $RS$.&nbsp;  
 
Here we consider the 2D random variable&nbsp; $RS$.&nbsp;  
  
 
In earlier examples we calculated
 
In earlier examples we calculated
*the entropies&nbsp; $H(R) = 2.585 \ \rm  bit$&nbsp; and&nbsp; $H(S) = 3.274 \ \rm bit$ &nbsp; ⇒  &nbsp;[[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|Example 6]]&nbsp; in the last chapter,
+
*the entropies&nbsp; $H(R) = 2.585 \ \rm  bit$&nbsp; and&nbsp; $H(S) = 3.274 \ \rm bit$ &nbsp; ⇒  &nbsp;[[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 6}$]]&nbsp; in the last chapter,
*the join entropies&nbsp; $H(RS) = 5.170 \ \rm bit$  &nbsp; ⇒  &nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|Example 6]]&nbsp; in the last chapter,
+
*the join entropies&nbsp; $H(RS) = 5.170 \ \rm bit$  &nbsp; ⇒  &nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Joint_probability_and_joint_entropy|$\text{Example 6}$]]&nbsp; in the last chapter,
*the conditional entropies&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$&nbsp; and&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S) = 1.896 \ \rm bit$  &nbsp; ⇒  &nbsp;  [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Conditional_probability_and_conditional_entropy|Example 2]]&nbsp; in the previous section.
+
*the conditional entropies&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$&nbsp; and&nbsp; $H(R \hspace{0.05cm} \vert \hspace{0.05cm}  S) = 1.896 \ \rm bit$  &nbsp; ⇒  &nbsp;  [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Conditional_probability_and_conditional_entropy|$\text{Example 2}$]]&nbsp; in the previous section.
  
[[File:P_ID2765__Inf_T_3_2_S3_neu.png|frame|Diagram of all entropies of the „dice experiment” ]]
+
[[File:P_ID2765__Inf_T_3_2_S3_neu.png|frame|Diagram of all entropies of the "dice experiment" ]]
  
<br>These quantities are compiled in the graph, with the random quantity&nbsp; $R$&nbsp; marked by the basic colour „red” and the sum&nbsp; $S$&nbsp; marked by the basic colour „green” .&nbsp; Conditional entropies are shaded.&nbsp;
+
<br>These quantities are compiled in the graph, with the random quantity&nbsp; $R$&nbsp; marked by the basic colour "red" and the sum&nbsp; $S$&nbsp; marked by the basic colour "green" .&nbsp; Conditional entropies are shaded.&nbsp;
 
One can see from this representation:
 
One can see from this representation:
 
*The entropy&nbsp; $H(R) = \log_2 (6) = 2.585\ \rm bit$&nbsp; is exactly half as large as the joint entropy&nbsp; $H(RS)$.&nbsp; Because:&nbsp; If one knows&nbsp; $R$,&nbsp; then&nbsp; $S$&nbsp; provides exactly the same information as the random quantity&nbsp; $B$, namely&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R) = H(B) = \log_2 (6) = 2.585\ \rm bit$.&nbsp;  
 
*The entropy&nbsp; $H(R) = \log_2 (6) = 2.585\ \rm bit$&nbsp; is exactly half as large as the joint entropy&nbsp; $H(RS)$.&nbsp; Because:&nbsp; If one knows&nbsp; $R$,&nbsp; then&nbsp; $S$&nbsp; provides exactly the same information as the random quantity&nbsp; $B$, namely&nbsp; $H(S \hspace{0.05cm} \vert \hspace{0.05cm}  R) = H(B) = \log_2 (6) = 2.585\ \rm bit$.&nbsp;  
Line 198: Line 198:
 
We now consider three random variables&nbsp; $X$,&nbsp; $Y$&nbsp; and&nbsp; $Z$, that can be related to each other.
 
We now consider three random variables&nbsp; $X$,&nbsp; $Y$&nbsp; and&nbsp; $Z$, that can be related to each other.
 
{{BlaueBox|TEXT=
 
{{BlaueBox|TEXT=
$\text{Definition:}$&nbsp; The &nbsp; '''conditional mutual information''' &nbsp;  between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; '''for a given'''&nbsp; $Z = z$&nbsp; is as follows:
+
$\text{Definition:}$&nbsp; The &nbsp; &raquo;'''conditional mutual information'''&laquo; &nbsp;  between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; '''for a given'''&nbsp; $Z = z$&nbsp; is as follows:
 
   
 
   
 
:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$
 
:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$
  
One denotes as the conditional&nbsp; '''conditional mutual information'''&nbsp; between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; for the random variable&nbsp; $Z$&nbsp; '''in general'''&nbsp; <br>after averaging over all&nbsp; $z \in Z$:
+
One denotes as the conditional&nbsp; &raquo;'''conditional mutual information'''&laquo;&nbsp; between the random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; for the random variable&nbsp; $Z$&nbsp; '''in general'''&nbsp; <br>after averaging over all&nbsp; $z \in Z$:
 
   
 
   
 
:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y  Z )= \hspace{-0.3cm}
 
:$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) =  H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y  Z )= \hspace{-0.3cm}
Line 209: Line 209:
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
$P_Z(Z)$&nbsp; is the probability mass function&nbsp; $\rm  (PMF)$&nbsp; of the random variable&nbsp; $Z$&nbsp; and&nbsp; $P_Z(z)$&nbsp; is the&nbsp; '''probability'''&nbsp; for the realisation&nbsp; $Z = z$.}}
+
$P_Z(Z)$&nbsp; is the probability mass function&nbsp; $\rm  (PMF)$&nbsp; of the random variable&nbsp; $Z$&nbsp; and&nbsp; $P_Z(z)$&nbsp; is the&nbsp; &raquo;'''probability'''&laquo;&nbsp; for the realization&nbsp; $Z = z$.}}
  
  
Line 215: Line 215:
 
$\text{Please note:}$&nbsp;  
 
$\text{Please note:}$&nbsp;  
 
*For the conditional entropy, as is well known, the relation &nbsp; $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$&nbsp; holds.  
 
*For the conditional entropy, as is well known, the relation &nbsp; $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$&nbsp; holds.  
*For the mutual information, this relation does not necessarily hold: <br> &nbsp; &nbsp; $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$&nbsp; can be&nbsp; '''smaller, equal, but also larger than'''&nbsp; als&nbsp; $I(X; Y)$.}}
+
*For the mutual information, this relation does not necessarily hold: <br> &nbsp; &nbsp; $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$&nbsp; can be&nbsp; '''smaller, equal, but also larger than'''&nbsp; $I(X; Y)$.}}
  
  
Line 274: Line 274:
  
 
{{GraueBox|TEXT=
 
{{GraueBox|TEXT=
$\text{Example 5:}$&nbsp;  We consider the&nbsp; [[Theory_of_Stochastic_Signals/Markovketten|Markov chain]] &nbsp; $X → Y → Z$.&nbsp; For such a constellation, the&nbsp; "Data Processing Theorem"&nbsp; always holds with the following consequence, which can be derived from the chain rule of mutual information:
+
$\text{Example 5:}$&nbsp;  We consider the&nbsp; [[Theory_of_Stochastic_Signals/Markovketten|$\text{Markov chain}$]] &nbsp; $X → Y → Z$.&nbsp; For such a constellation, the&nbsp; "Data Processing Theorem"&nbsp; always holds with the following consequence, which can be derived from the chain rule of mutual information:
 
   
 
   
 
:$$I(X;Z) \hspace{-0.05cm}  \le  \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$
 
:$$I(X;Z) \hspace{-0.05cm}  \le  \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$
Line 284: Line 284:
  
  
For more information on the&nbsp; "Data Processing Theorem"&nbsp; see&nbsp; [[Aufgaben:Aufgabe_3.15:_Data_Processing_Theorem|Exercise 3.15]].}}  
+
For more information on the&nbsp; "Data Processing Theorem"&nbsp; see&nbsp; [[Aufgaben:Aufgabe_3.15:_Data_Processing_Theorem|"Exercise 3.15"]].}}  
  
  
 
==Exercises for the chapter==
 
==Exercises for the chapter==
 
<br>  
 
<br>  
[[Aufgaben:3.7 Einige Entropieberechnungen|Aufgabe 3.7: Einige Entropieberechnungen]]
+
[[Aufgaben:Exercise_3.7:_Some_Entropy_Calculations|Exercise 3.7: Some Entropy Calculations]]  
  
[[Aufgaben:3.8 Nochmals Transinformation|Aufgabe 3.8: Nochmals Transinformation]]
+
[[Aufgaben:Exercise_3.8:_Once_more_Mutual_Information|Exercise 3.8: Once more Mutual Information]]
  
[[Aufgaben:3.8Z Tupel aus ternären Zufallsgrößen|Aufgabe 3.8Z: Tupel aus ternären Zufallsgrößen]]
+
[[Aufgaben:Exercise_3.8Z:_Tuples_from_Ternary_Random_Variables|Exercise 3.8Z: Tuples from Ternary Random Variables]]
  
[[Aufgaben:3.9 Bedingte Transinformation|Aufgabe 3.9: Bedingte Transinformation]]
+
[[Aufgaben:Exercise_3.9:_Conditional_Mutual_Information|Exercise 3.9: Conditional Mutual Information]]
  
  
  
 
{{Display}}
 
{{Display}}

Latest revision as of 16:16, 16 February 2023


Definition of entropy using supp(PXY)


We briefly summarize the results of the last chapter again, assuming the two-dimensional random variable  $XY$  with the probability mass function  $P_{XY}(X,\ Y)$ .  At the same time we use the notation

$${\rm supp} (P_{XY}) = \big \{ \hspace{0.05cm}(x,\ y) \in XY \hspace{0.05cm}, \hspace{0.3cm} {\rm where} \hspace{0.15cm} P_{XY}(X,\ Y) \ne 0 \hspace{0.05cm} \big \} \hspace{0.05cm};$$

$\text{Summarizing the last chapter:}$  With this subset  $\text{supp}(P_{XY}) ⊂ P_{XY}$,  the following holds for

  • the  »joint entropy« :
$$H(XY) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.05cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{XY}(x, y)} \hspace{0.05cm}.$$
  • the  »entropies of the one-dimensional random variables«  $X$  and  $Y$:
$$H(X) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(X)}\right ] =\hspace{-0.2cm} \sum_{x \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{X})} \hspace{-0.2cm} P_{X}(x) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{X}(x)} \hspace{0.05cm},$$
$$H(Y) = {\rm E}\hspace{-0.1cm} \left [ {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(Y)}\right ] =\hspace{-0.2cm} \sum_{y \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Y})} \hspace{-0.2cm} P_{Y}(y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{Y}(y)} \hspace{0.05cm}.$$


$\text{Example 1:}$  We refer again to the examples on the  $\text{joint probability and joint entropy}$  in the last chapter. 

For the two-dimensional probability mass function  $P_{RB}(R, B)$  in  $\text{Example 5}$  with the parameters

  • $R$   ⇒   points of the red cube,
  • $B$   ⇒   points of the blue cube,


the sets  $P_{RB}$  and  $\text{supp}(P_{RB})$  are identical.  Here, all  $6^2 = 36$  squares are occupied by non-zero values.

For the two-dimensional probability mass function  $P_{RS}(R, S)$  in  $\text{Example 6}$  with the parameters

  • $R$   ⇒   points of the red cube,
  • $S = R + B$   ⇒   sum of both cubes,


there are  $6 · 11 = 66$ squares, many of which, however, are empty, i.e. stand for the probability  "0" .

  • The subset  $\text{supp}(P_{RS})$ , on the other hand, contains only the  $36$  shaded squares with non-zero probabilities.
  • The entropy remains the same no matter whether one averages over all elements of  $P_{RS}$  or only over the elements of   $\text{supp}(P_{RS})$  since for  $x → 0$  the limit is  $x · \log_2 ({1}/{x}) = 0$.


Conditional probability and conditional entropy


In the book  "Theory of Stochastic Signals"  the following   $\text{conditional probabilities}$  were given for the case of two events  $X$  and  $Y$  ⇒   »Bayes' theorem«:

$${\rm Pr} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (Y)} \hspace{0.05cm}, \hspace{0.5cm} {\rm Pr} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{{\rm Pr} (X \cap Y)}{{\rm Pr} (X)} \hspace{0.05cm}.$$

Applied to probability mass functions, one thus obtains:

$$P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = \frac{P_{XY}(X, Y)}{P_{Y}(Y)} \hspace{0.05cm}, \hspace{0.5cm} P_{\hspace{0.03cm}Y \mid \hspace{0.03cm} X} (Y \hspace{-0.05cm}\mid \hspace{-0.05cm} X) = \frac{P_{XY}(X, Y)}{P_{X}(X)} \hspace{0.05cm}.$$

Analogous to the  $\text{joint entropy}$  $H(XY)$ , the following entropy functions can be derived here:

$\text{Definitions:}$ 

  • The  »conditional entropy« of the random variable  $X$  under condition  $Y$  is:
$$H(X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (X \hspace{-0.05cm}\mid \hspace{-0.05cm} Y)}\right ] = \hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}X \mid \hspace{0.03cm} Y} (x \hspace{-0.05cm}\mid \hspace{-0.05cm} y)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{Y}(y)}{P_{XY}(x, y)} \hspace{0.05cm}.$$
  • Similarly, for the  »second conditional entropy« we obtain:
$$H(Y \hspace{-0.1cm}\mid \hspace{-0.05cm} X) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm}\frac{1}{P_{\hspace{0.03cm}Y\hspace{0.03cm} \mid \hspace{0.01cm} X} (Y \hspace{-0.08cm}\mid \hspace{-0.05cm}X)}\right ] =\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}Y\hspace{-0.03cm} \mid \hspace{-0.01cm} X} (y \hspace{-0.05cm}\mid \hspace{-0.05cm} x)}=\hspace{-0.2cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY}\hspace{-0.08cm})} \hspace{-0.6cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.1cm} \frac{P_{X}(x)}{P_{XY}(x, y)} \hspace{0.05cm}.$$


In the argument of the logarithm function there is always a conditional probability function   ⇒   $P_{X\hspace{0.03cm}| \hspace{0.03cm}Y}(·)$  or  $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(·)$  resp.,  while the joint probability   ⇒   $P_{XY}(·)$ is needed for the expectation value formation.

For the conditional entropies, there are the following limitations:

  • Both  $H(X|Y)$  and  $H(Y|X)$  are always greater than or equal to zero.  From  $H(X|Y) = 0$  it follows directly  $H(Y|X) = 0$. 
    Both are only possible for   $\text{disjoint sets}$  $X$  and  $Y$.
  • $H(X|Y) ≤ H(X)$  and  $H(Y|X) ≤ H(Y)$ always apply.  These statements are plausible if one realizes that one can also use  "uncertainty"  synonymously for  "entropy".  For:   The uncertainty with respect to the quantity  $X$  cannot be increased by knowing  $Y$. 
  • Except in the case of statistical independence   ⇒   $H(X|Y) = H(X)$ ,   $H(X|Y) < H(X)$ always holds.  Because of  $H(X) ≤ H(XY)$  and  $H(Y) ≤ H(XY)$ ,  therefore also  $H(X|Y) ≤ H(XY)$  and  $H(Y|X) ≤ H(XY)$  hold.  Thus, a conditional entropy can never become larger than the joint entropy.


$\text{Example 2:}$  We consider the joint probabilities  $P_{RS}(·)$  of our dice experiment, which were determined in the  "last chapter"  as  $\text{Example 6}$.  The corresponding  $P_{RS}(·)$  is given again in the middle of the following graph.

Joint probabilities  $P_{RS}$  and conditional probabilities  $P_{S \vert R}$  and  $P_{R \vert S}$

The two conditional probability functions are drawn on the outside:

$\text{On the left}$  you see the conditional probability mass function 

$$P_{S \vert R}(⋅) = P_{SR}(⋅)/P_R(⋅).$$
  • Because of  $P_R(R) = \big [1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6, \ 1/6 \big ]$  the probability  $1/6$  is in all shaded fields  
  • That means:   $\text{supp}(P_{S\vert R}) = \text{supp}(P_{R\vert S})$  . 
  • From this follows for the conditional entropy:
$$H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = \hspace{-0.2cm} \sum_{(r, s) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{RS})} \hspace{-0.6cm} P_{RS}(r, s) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{P_{\hspace{0.03cm}S \hspace{0.03cm} \mid \hspace{0.03cm} R} (s \hspace{-0.05cm}\mid \hspace{-0.05cm} r)} $$
$$\Rightarrow \hspace{0.3cm}H(S \hspace{-0.1cm}\mid \hspace{-0.13cm} R) = 36 \cdot \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.$$

$\text{On the right}$,  $P_{R\vert S}(⋅) = P_{RS}(⋅)/P_S(⋅)$  is given with  $P_S(⋅)$  according to  $\text{Example 6}$. 

  • $\text{supp}(P_{R\vert S}) = \text{supp}(P_{S\vert R})$   ⇒  same non-zero fields result.
  • However, the probability values now increase continuously from the centre  $(1/6)$  towards the edges up to  $1$  in the corners.
  • It follows that:
$$H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} S) = \frac{1}{36} \cdot {\rm log}_2 \hspace{0.1cm} (6) + \frac{2}{36} \cdot \sum_{i=1}^5 \big [ i \cdot {\rm log}_2 \hspace{0.1cm} (i) \big ]= 1.896\ {\rm bit} \hspace{0.05cm}.$$

On the other hand, for the conditional probabilities of the 2D random variable  $RB$  according to  $\text{Example 5}$,  one obtains because of  $P_{RB}(⋅) = P_R(⋅) · P_B(⋅)$:

$$\begin{align*}H(B \hspace{-0.1cm}\mid \hspace{-0.13cm} R) \hspace{-0.15cm} & = \hspace{-0.15cm} H(B) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm},\\ H(R \hspace{-0.1cm}\mid \hspace{-0.13cm} B) \hspace{-0.15cm} & = \hspace{-0.15cm} H(R) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\ {\rm bit} \hspace{0.05cm}.\end{align*}$$


Mutual information between two random variables


We consider the two-dimensional random variable  $XY$  with PMF  $P_{XY}(X, Y)$. Let the one-dimensional functions  $P_X(X)$  and  $P_Y(Y)$ also be known.

Now the following questions arise:

  • How does the knowledge of the random variable  $Y$  reduce the uncertainty with respect to  $X$?
  • How does the knowledge of the random variable  $X$  reduce the uncertainty with respect to  $Y$?


To answer this question, we need a definition that is substantial for information theory:

$\text{Definition:}$  The  »mutual information« between the random variables  $X$  and  $Y$ – both over the same alphabet – is given as follows:

$$I(X;\ Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(X, Y)} {P_{X}(X) \cdot P_{Y}(Y) }\right ] =\hspace{-0.25cm} \sum_{(x, y) \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{XY})} \hspace{-0.8cm} P_{XY}(x, y) \cdot {\rm log}_2 \hspace{0.08cm} \frac{P_{XY}(x, y)} {P_{X}(x) \cdot P_{Y}(y) } \hspace{0.01cm}.$$

A comparison with the  "last chapter"  shows that the mutual information can also be written as a  $\text{Kullback–Leibler distance}$  between the two-dimensional probability mass function  $P_{XY}$  and the product  $P_X · P_Y$  :

$$I(X;Y) = D(P_{XY} \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_X \cdot P_Y) \hspace{0.05cm}.$$

It is thus obvious that  $I(X;\ Y) ≥ 0$  always holds.  Because of the symmetry,   $I(Y;\ X)$ = $I(X;\ Y)$ is also true.


By splitting the  $\log_2$ argument according to

$$I(X;Y) = {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac{1} {P_{X}(X) }\right ] - {\rm E} \hspace{-0.1cm}\left [ {\rm log}_2 \hspace{0.1cm} \frac {P_{Y}(Y) }{P_{XY}(X, Y)} \right ] $$

is obtained using  $P_{X|Y}(\cdot) = P_{XY}(\cdot)/P_Y(Y)$:

$$I(X;Y) = H(X) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) \hspace{0.05cm}.$$
  • This means:   The uncertainty regarding the random quantity  $X$   ⇒   entropy  $H(X)$  decreases by the magnitude  $H(X|Y)$  when  $Y$ is known.  The remainder is the mutual information  $I(X; Y)$.
  • With a different splitting, one arrives at the result
$$I(X;Y) = H(Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$
  • Ergo:   The mutual information  $I(X; Y)$  is symmetrical   ⇒   $X$  says just as much about  $Y$  as  $Y$  says about  $X$   ⇒   "mutual".  The semicolon indicates equality.


$\text{Conclusion:}$  Often the equations mentioned here are clarified by a diagram, as in the following examples. 
From this you can see that the following equations also apply:

$$I(X;\ Y) = H(X) + H(Y) - H(XY) \hspace{0.05cm},$$
$$I(X;\ Y) = H(XY) - H(X \hspace{-0.1cm}\mid \hspace{-0.1cm} Y) - H(Y \hspace{-0.1cm}\mid \hspace{-0.1cm} X) \hspace{0.05cm}.$$


$\text{Example 3:}$  We return  (for the last time)  to the  $\text{dice experiment}$  with the red  $(R)$  and blue  $(B)$  cube.  The random variable  $S$  gives the sum of the two dice:  $S = R + B$.  Here we consider the 2D random variable  $RS$. 

In earlier examples we calculated

  • the entropies  $H(R) = 2.585 \ \rm bit$  and  $H(S) = 3.274 \ \rm bit$   ⇒  $\text{Example 6}$  in the last chapter,
  • the join entropies  $H(RS) = 5.170 \ \rm bit$   ⇒   $\text{Example 6}$  in the last chapter,
  • the conditional entropies  $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = 2.585 \ \rm bit$  and  $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = 1.896 \ \rm bit$   ⇒   $\text{Example 2}$  in the previous section.
Diagram of all entropies of the "dice experiment"


These quantities are compiled in the graph, with the random quantity  $R$  marked by the basic colour "red" and the sum  $S$  marked by the basic colour "green" .  Conditional entropies are shaded.  One can see from this representation:

  • The entropy  $H(R) = \log_2 (6) = 2.585\ \rm bit$  is exactly half as large as the joint entropy  $H(RS)$.  Because:  If one knows  $R$,  then  $S$  provides exactly the same information as the random quantity  $B$, namely  $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R) = H(B) = \log_2 (6) = 2.585\ \rm bit$. 
  • Note:   $H(R)$ = $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$  only applies in this example, not in general.
  • As expected, here the entropy  $H(S) = 3.274 \ \rm bit$  is greater than  $H(R)= 2.585\ \rm bit$.  Because of  $H(S) + H(R \hspace{0.05cm} \vert \hspace{0.05cm} S) = H(R) + H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$ ,  $H(R \hspace{0.05cm} \vert \hspace{0.05cm} S)$  must therefore be smaller than  $H(S \hspace{0.05cm} \vert \hspace{0.05cm} R)$  by   $I(R;\ S) = 0.689 \ \rm bit$ .   $H(R)$  is also smaller than  $H(S)$ by   $I(R;\ S) = 0.689 \ \rm bit$ .
  • The mutual information between the random variables  $R$  and  $S$  also results from the equation
$$I(R;\ S) = H(R) + H(S) - H(RS) = 2.585\ {\rm bit} + 3.274\ {\rm bit} - 5.170\ {\rm bit} = 0.689\ {\rm bit} \hspace{0.05cm}. $$


Conditional mutual information


We now consider three random variables  $X$,  $Y$  and  $Z$, that can be related to each other.

$\text{Definition:}$  The   »conditional mutual information«   between the random variables  $X$  and  $Y$  for a given  $Z = z$  is as follows:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z = z) - H(X\vert\hspace{0.05cm}Y ,\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

One denotes as the conditional  »conditional mutual information«  between the random variables  $X$  and  $Y$  for the random variable  $Z$  in general 
after averaging over all  $z \in Z$:

$$I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z ) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z ) - H(X\vert\hspace{0.05cm}Y Z )= \hspace{-0.3cm} \sum_{z \hspace{0.1cm}\in \hspace{0.1cm}{\rm supp} (P_{Z})} \hspace{-0.25cm} P_{Z}(z) \cdot I(X;Y \hspace{0.05cm}\vert\hspace{0.05cm} Z = z) \hspace{0.05cm}.$$

$P_Z(Z)$  is the probability mass function  $\rm (PMF)$  of the random variable  $Z$  and  $P_Z(z)$  is the  »probability«  for the realization  $Z = z$.


$\text{Please note:}$ 

  • For the conditional entropy, as is well known, the relation   $H(X\hspace{0.05cm}\vert\hspace{0.05cm}Z) ≤ H(X)$  holds.
  • For the mutual information, this relation does not necessarily hold:
        $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm}Z)$  can be  smaller, equal, but also larger than  $I(X; Y)$.


2D PMF  $P_{XZ}$

$\text{Example 4:}$  We consider the binary random variables  $X$,  $Y$  and  $Z$  with the following properties:

  • $X$  and  $Y$  be statistically independent.  Let the following be true for their probability mass functions:
$$P_X(X) = \big [1/2, \ 1/2 \big], \hspace{0.2cm} P_Y(Y) = \big[1– p, \ p \big] \ ⇒ \ H(X) = 1\ {\rm bit}, \hspace{0.2cm} H(Y) = H_{\rm bin}(p).$$
  • $Z$  is the modulo-2 sum of  $X$  and  $Y$:   $Z = X ⊕ Y$.


From the joint probability mass function  $P_{XZ}$  according to the upper graph, it follows:

  • Summing the column probabilities gives 
        $P_Z(Z) = \big [1/2, \ 1/2 \big ]$   ⇒   $H(Z) = 1\ {\rm bit}.$
  • $X$  and  $Z$  are also statistically independent, since for the 2D PMF holds 
        $P_{XZ}(X, Z) = P_X(X) · P_Z(Z)$ . 
Conditional 2D PMF $P_{X\hspace{0.05cm}\vert\hspace{0.05cm}YZ}$
  • It follows that:
        $H(Z\hspace{0.05cm}\vert\hspace{0.05cm} X) = H(Z),\hspace{0.5cm}(X \hspace{0.05cm}\vert\hspace{0.05cm} Z) = H(X),\hspace{0.5cm} I(X; Z) = 0.$


From the conditional probability mass function  $P_{X\vert YZ}$  according to the graph below, we can calculate:

  • $H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = 0$,  since all  $P_{X\hspace{0.05cm}\vert\hspace{0.05cm} YZ}$ entries are  $0$  or  $1$   ⇒   "conditional entropy",
  • $I(X; YZ) = H(X) - H(X\hspace{0.05cm}\vert\hspace{0.05cm} YZ) = H(X)= 1 \ {\rm bit}$   ⇒   "mutual information",
  • $I(X; Y\vert Z) = H(X\hspace{0.05cm}\vert\hspace{0.05cm} Z) =H(X)=1 \ {\rm bit} $   ⇒   "conditional mutual information".


In the present example:

The conditional mutual information  $I(X; Y\hspace{0.05cm}\vert\hspace{0.05cm} Z) = 1$  is greater than the conventional mutual information  $I(X; Y) = 0$.


Chain rule of the mutual information


So far we have only considered the mutual information between two one-dimensional random variables.  Now we extend the definition to a total of  $n + 1$  random variables, which, only for reasons of representation, we denote with  $X_1$,  ... ,  $X_n$  and  $Z$  .  Then applies:

$\text{Chain rule of mutual information:}$ 

The mutual information between the  $n$–dimensional random variable  $X_1 X_2 \hspace{0.05cm}\text{...} \hspace{0.05cm} X_n$  and the random variable  $Z$  can be represented and calculated as follows:

$$I(X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_n;Z) = I(X_1;Z) + I(X_2;Z \vert X_1) + \hspace{0.05cm}\text{...} \hspace{0.1cm}+ I(X_n;Z\vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{n-1}) = \sum_{i = 1}^{n} I(X_i;Z \vert X_1\hspace{0.05cm}X_2\hspace{0.05cm}\text{...} \hspace{0.1cm}X_{i-1}) \hspace{0.05cm}.$$

$\text{Proof:}$  We restrict ourselves here to the case  $n = 2$, i.e. to a total of three random variables, and replace  $X_1$  by $X$ and  $X_2$  by  $Y$.  Then we obtain:

$$\begin{align*}I(X\hspace{0.05cm}Y;Z) & = H(XY) - H(XY\hspace{0.05cm} \vert \hspace{0.05cm}Z) = \\ & = \big [ H(X)+ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X)\big ] - \big [ H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm} XZ)\big ] =\\ & = \big [ H(X)- H(X\hspace{0.05cm} \vert \hspace{0.05cm} Z)\big ] - \big [ H(Y\hspace{0.05cm} \vert \hspace{0.05cm} X) + H(Y\hspace{0.05cm} \vert \hspace{0.05cm}XZ)\big ]=\\ & = I(X;Z) + I(Y;Z \hspace{0.05cm} \vert \hspace{0.05cm} X) \hspace{0.05cm}.\end{align*}$$


  • From this equation one can see that the relation  $I(X Y; Z) ≥ I(X; Z)$  is always given.
  • Equality results for the conditional mutual information  $I(Y; Z \hspace{0.05cm} \vert \hspace{0.05cm} X) = 0$,  i.e. when the random variables  $Y$  and  $Z$  for a given  $X$  are statistically independent.


$\text{Example 5:}$  We consider the  $\text{Markov chain}$   $X → Y → Z$.  For such a constellation, the  "Data Processing Theorem"  always holds with the following consequence, which can be derived from the chain rule of mutual information:

$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm}I(X;Y ) \hspace{0.05cm},$$
$$I(X;Z) \hspace{-0.05cm} \le \hspace{-0.05cm} I(Y;Z ) \hspace{0.05cm}.$$

The theorem thus states:

  • One cannot gain any additional information about the input  $X$  by manipulating the data  $Y$  by processing   $Y → Z$.
  • Data processing  $Y → Z$  $($by a second processor$)$ only serves the purpose of making the information about  $X$  more visible.


For more information on the  "Data Processing Theorem"  see  "Exercise 3.15".


Exercises for the chapter


Exercise 3.7: Some Entropy Calculations

Exercise 3.8: Once more Mutual Information

Exercise 3.8Z: Tuples from Ternary Random Variables

Exercise 3.9: Conditional Mutual Information