Difference between revisions of "Aufgaben:Exercise 3.15: Data Processing Theorem"

From LNTwww
 
(4 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
[[File:P_ID2818__Inf_A_3_14.png|right|frame|"Data Processing Theorem"]]
 
[[File:P_ID2818__Inf_A_3_14.png|right|frame|"Data Processing Theorem"]]
 
We consider the following data processing chain:
 
We consider the following data processing chain:
* Binary input data  $X$  is processed by processor  $1$  (top half in the graph) which is describable by conditional probabilities   ⇒   $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(\cdot)$ .  Its output variable is  $Y$.
+
* Binary input data  $X$  is processed by processor   $1$  (top half in the graph)  which is describable by conditional probabilities   ⇒   $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(\cdot)$ .  Its output variable is  $Y$.
* A second processor with random variable  $Y$  at input and random variable  $Z$  at output is given by  $P_{Z\hspace{0.03cm}|\hspace{0.03cm}Y}(\cdot)$  gegeben   (lower half in the graph).  $Z$  depends on  $Y$  alone  (deterministic or stochastic)  and is independent of  $X$:
+
* A second processor with random variable  $Y$  at input and random variable  $Z$  at output is given by  $P_{Z\hspace{0.03cm}|\hspace{0.03cm}Y}(\cdot)$    (lower half in the graph).  $Z$  depends on  $Y$  alone  (deterministic or stochastic)  and is independent of  $X$:
 
:$$P_{Z\hspace{0.05cm}|\hspace{0.03cm} XY\hspace{-0.03cm}}(z\hspace{0.03cm}|\hspace{0.03cm} x, y) =P_{Z\hspace{0.05cm}|\hspace{0.03cm} Y\hspace{-0.03cm}}(z\hspace{0.03cm}|\hspace{0.03cm} y) \hspace{0.05cm}.$$
 
:$$P_{Z\hspace{0.05cm}|\hspace{0.03cm} XY\hspace{-0.03cm}}(z\hspace{0.03cm}|\hspace{0.03cm} x, y) =P_{Z\hspace{0.05cm}|\hspace{0.03cm} Y\hspace{-0.03cm}}(z\hspace{0.03cm}|\hspace{0.03cm} y) \hspace{0.05cm}.$$
 
The following nomenclature was used for this description:
 
The following nomenclature was used for this description:
Line 14: Line 14:
 
This also means:
 
This also means:
  
$X → Y → Z$  form a  [[Theory_of_Stochastic_Signals/Markovketten|Markov chain]].  For such a one, the  ''Data Processing Theorem''  holds with the following consequence:
+
$X → Y → Z$  form a  [[Theory_of_Stochastic_Signals/Markovketten|Markov chain]].  For such a one, the  "Data Processing Theorem"  holds with the following consequence:
 
:$$I(X;Z)  \le  I(X;Y ) \hspace{0.05cm}, $$
 
:$$I(X;Z)  \le  I(X;Y ) \hspace{0.05cm}, $$
 
:$$I(X;Z)  \le  I(Y;Z ) \hspace{0.05cm}.$$
 
:$$I(X;Z)  \le  I(Y;Z ) \hspace{0.05cm}.$$
 
The theorem thus states:
 
The theorem thus states:
:* One cannot gain additional information about input  $X$  by manipulating   (''processing'')  data  $Y$ .
+
:* One cannot gain additional information about input  $X$  by manipulating   ("processing")  data  $Y$ .
 
:* Data processing  (by processor   $1$)  serves only the purpose of making the information contained in  $X$  more visible.
 
:* Data processing  (by processor   $1$)  serves only the purpose of making the information contained in  $X$  more visible.
  
Line 30: Line 30:
 
Hints:
 
Hints:
 
*The task belongs to the chapter  [[Information_Theory/Anwendung_auf_die_Digitalsignalübertragung| Application to Digital Signal Transmission]].
 
*The task belongs to the chapter  [[Information_Theory/Anwendung_auf_die_Digitalsignalübertragung| Application to Digital Signal Transmission]].
*Reference is also made to the page  [[Information_Theory/Verschiedene_Entropien_zweidimensionaler_Zufallsgrößen#Chain_rule_of_the_mutual_information|Chain Rule of Transinformation in the previous chapter]].
+
*Reference is also made to the page  [[Information_Theory/Different_Entropy_Measures_of_Two-Dimensional_Random_Variables#Chain_rule_of_the_mutual_information|Chain rule of the mutual information]]  in the previous chapter]].
 
   
 
   
  
Line 42: Line 42:
 
+ The derivation is based on the properties of a strictly symmetric channel.
 
+ The derivation is based on the properties of a strictly symmetric channel.
 
- It is exploited that  $H_{\rm bin}(p)$  is a concave function.
 
- It is exploited that  $H_{\rm bin}(p)$  is a concave function.
- The result is valid for any probability function  $P_X(X)$.
+
- The result is valid for any probability mass function  $P_X(X)$.
  
  
Line 73: Line 73:
 
:$$ I(X;Y) = 1 + p \cdot {\rm log}_2 \hspace{0.1cm} (p) + (1-p) \cdot {\rm log}_2 \hspace{0.1cm} (1-p) = 1 - H_{\rm bin}(p)\hspace{0.05cm},
 
:$$ I(X;Y) = 1 + p \cdot {\rm log}_2 \hspace{0.1cm} (p) + (1-p) \cdot {\rm log}_2 \hspace{0.1cm} (1-p) = 1 - H_{\rm bin}(p)\hspace{0.05cm},
 
\hspace{1.0cm}H_{\rm bin}(p)= p \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{p}+ (1-p) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{1-p}\hspace{0.05cm}.$$
 
\hspace{1.0cm}H_{\rm bin}(p)= p \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{p}+ (1-p) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{1-p}\hspace{0.05cm}.$$
*However, the result holds only for  $P_X(X) = (0.5, \ 0.5)$   ⇒   maximale mutual information   ⇒   channel capacity.  
+
*However, the result holds only for  $P_X(X) = (0.5, \ 0.5)$   ⇒   maximum mutual information   ⇒   channel capacity.  
 
*Otherwise,  $I(X; Y)$  is smaller.     For example, for $P_X(X) = (1, \ 0)$:    $H(X) = 0$    ⇒    $I(X; Y) = 0.$
 
*Otherwise,  $I(X; Y)$  is smaller.     For example, for $P_X(X) = (1, \ 0)$:    $H(X) = 0$    ⇒    $I(X; Y) = 0.$
*The binary entropy function is concave, but this property was not used in the derivation   ⇒    Antwort 2 is incorrect.
+
*The binary entropy function is concave, but this property was not used in the derivation   ⇒    answer 2 is incorrect.
  
  
Line 94: Line 94:
  
  
'''(5)'''&nbsp; The answer is&nbsp; <u>YES</u>, of course&nbsp; ''Data Processing Theorem'' assumes exactly the conditions given here.
+
'''(5)'''&nbsp; The answer is&nbsp; <u>YES</u>,&nbsp; because the&nbsp; "Data Processing Theorem" assumes exactly the conditions given here.
  
 
However, we want to evaluate some additional numerical results:
 
However, we want to evaluate some additional numerical results:
* If&nbsp; $0 ≤ p < 0.5$ &nbsp;and&nbsp; $0 ≤ q < 0.5$, hold, we obtain:
+
* If&nbsp; $0 ≤ p < 0.5$ &nbsp;and&nbsp; $0 ≤ q < 0.5$&nbsp; hold, we obtain:
 
:$$\varepsilon = p + q \cdot (1-2p) > p \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) < I(X;Y) \hspace{0.05cm},$$
 
:$$\varepsilon = p + q \cdot (1-2p) > p \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) < I(X;Y) \hspace{0.05cm},$$
 
:$$\varepsilon = q + p \cdot (1-2q) > q \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) < I(Y;Z) \hspace{0.05cm}.$$
 
:$$\varepsilon = q + p \cdot (1-2q) > q \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) < I(Y;Z) \hspace{0.05cm}.$$
*For &nbsp;$p = 0.5$&nbsp; holds independently of &nbsp;$q$, since&nbsp; $I(X; Z)$&nbsp; cannot be greater than&nbsp; $I(X; Y)$:
+
*For &nbsp;$p = 0.5$&nbsp; holds independently of &nbsp;$q$,&nbsp; since&nbsp; $I(X; Z)$&nbsp; cannot be greater than&nbsp; $I(X; Y)$:
 
:$$\varepsilon = 0.5 + q \cdot (1-1) = 0.5 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Y) = 0 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) = 0 \hspace{0.05cm}.$$
 
:$$\varepsilon = 0.5 + q \cdot (1-1) = 0.5 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Y) = 0 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) = 0 \hspace{0.05cm}.$$
 
*Similarly, with &nbsp;$q = 0.5$&nbsp; independent of &nbsp;$p$, we obtain:
 
*Similarly, with &nbsp;$q = 0.5$&nbsp; independent of &nbsp;$p$, we obtain:
Line 112: Line 112:
  
  
[[Category:Information Theory: Exercises|^3.3 Anwendung auf DSÜ-Kanäle^]]
+
[[Category:Information Theory: Exercises|^3.3 Application to Digital Signal Transmission^]]

Latest revision as of 13:22, 24 September 2021

"Data Processing Theorem"

We consider the following data processing chain:

  • Binary input data  $X$  is processed by processor   $1$  (top half in the graph)  which is describable by conditional probabilities   ⇒   $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(\cdot)$ .  Its output variable is  $Y$.
  • A second processor with random variable  $Y$  at input and random variable  $Z$  at output is given by  $P_{Z\hspace{0.03cm}|\hspace{0.03cm}Y}(\cdot)$  (lower half in the graph).  $Z$  depends on  $Y$  alone  (deterministic or stochastic)  and is independent of  $X$:
$$P_{Z\hspace{0.05cm}|\hspace{0.03cm} XY\hspace{-0.03cm}}(z\hspace{0.03cm}|\hspace{0.03cm} x, y) =P_{Z\hspace{0.05cm}|\hspace{0.03cm} Y\hspace{-0.03cm}}(z\hspace{0.03cm}|\hspace{0.03cm} y) \hspace{0.05cm}.$$

The following nomenclature was used for this description:

$$x \in X = \{0, 1\}\hspace{0.02cm},\hspace{0.3cm} y \in Y = \{0,1\}\hspace{0.02cm},\hspace{0.3cm} z \in Z = \{0, 1\}\hspace{0.02cm}.$$

The joint probability mass function is:

$$P_{XYZ}(x, y, z) = P_{X}(x) \cdot P_{Y\hspace{0.05cm}|\hspace{0.03cm} X\hspace{-0.03cm}}(y\hspace{0.03cm}|\hspace{0.03cm} x)\cdot P_{Z\hspace{0.05cm}|\hspace{0.03cm} Y\hspace{-0.03cm}}(z\hspace{0.03cm}|\hspace{0.03cm} y) \hspace{0.05cm}.$$

This also means:

$X → Y → Z$  form a  Markov chain.  For such a one, the  "Data Processing Theorem"  holds with the following consequence:

$$I(X;Z) \le I(X;Y ) \hspace{0.05cm}, $$
$$I(X;Z) \le I(Y;Z ) \hspace{0.05cm}.$$

The theorem thus states:

  • One cannot gain additional information about input  $X$  by manipulating   ("processing")  data  $Y$ .
  • Data processing  (by processor   $1$)  serves only the purpose of making the information contained in  $X$  more visible.





Hints:



Questions

1

How can the result  $I(X; Y) = 1 - H_{\rm bin}(p)$  be interpreted?

The derivation is based on the properties of a strictly symmetric channel.
It is exploited that  $H_{\rm bin}(p)$  is a concave function.
The result is valid for any probability mass function  $P_X(X)$.

2

Which mutual information  $I(X; Y)$  results for the first processor with  $p = 0.1$?

$ I(X; Y) \ = \ $

$\ \rm bit$

3

Which mutual information  $I(Y; Z)$  results for the second processor with  $q = 0.2$?

$I(Y; Z) \ = \ $

$\ \rm bit$

4

Which mutual information  $I(X; Z)$  results for the whole system with  $p = 0.1$  and  $q = 0.2$?

$I(X; Z) \ = \ $

$\ \rm bit$

5

Does this example satisfy the  "Data Processing Theorem"?

Yes,
No.


Solution

(1)  Only solution proposal 1 is correct:

  • Both processors describe strictly symmetric channels   ⇒   both uniformly dispersive and uniformly focusing.
  • For such a binary channel, with  $Y = \{0, 1\} \ ⇒ \ |Y| = 2$:
$$I(X;Y) = 1 + \sum_{y \hspace{0.05cm}\in\hspace{0.05cm} Y} \hspace{0.1cm} P_{Y\hspace{0.03cm}|\hspace{0.03cm} X}(y\hspace{0.03cm}|\hspace{0.03cm}x) \cdot {\rm log}_2 \hspace{0.1cm}P_{\hspace{0.01cm}Y \hspace{0.03cm}|\hspace{0.03cm} X}(y\hspace{0.03cm}|\hspace{0.03cm}x) \hspace{0.05cm}.$$
  • Here it does not matter at all whether one starts from  $X = 0$  or from  $X = 1$ . 
  • For  $X = 0$  one obtains with  $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(Y = 1\hspace{0.03cm}|\hspace{0.03cm}X = 0) = p$  and  $P_{Y\hspace{0.03cm}|\hspace{0.03cm}X}(Y = 0\hspace{0.03cm}|\hspace{0.03cm}X = 0) = 1 – p\hspace{0.05cm}$:
$$ I(X;Y) = 1 + p \cdot {\rm log}_2 \hspace{0.1cm} (p) + (1-p) \cdot {\rm log}_2 \hspace{0.1cm} (1-p) = 1 - H_{\rm bin}(p)\hspace{0.05cm}, \hspace{1.0cm}H_{\rm bin}(p)= p \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{p}+ (1-p) \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{1-p}\hspace{0.05cm}.$$
  • However, the result holds only for  $P_X(X) = (0.5, \ 0.5)$   ⇒   maximum mutual information   ⇒   channel capacity.
  • Otherwise,  $I(X; Y)$  is smaller.     For example, for $P_X(X) = (1, \ 0)$:   $H(X) = 0$   ⇒   $I(X; Y) = 0.$
  • The binary entropy function is concave, but this property was not used in the derivation   ⇒   answer 2 is incorrect.


(2)  For processor  $1$ ,  $p = 0.1\hspace{0.05cm}$ gives:

$$ I(X;Y) = 1 - H_{\rm bin}(0.1) = 1 - 0.469 \hspace{0.15cm} \underline {=0.531 \,{\rm (bit)}} \hspace{0.05cm}.$$


(3)  Correspondingly, for the second processor with  $q = 0.2\hspace{0.05cm}$:

$$I(Y;Z) = 1 - H_{\rm bin}(0.2) = 1 - 0.722 \hspace{0.15cm} \underline {=0.278 \,{\rm (bit)}} \hspace{0.05cm}.$$


(4)  The probability for  $Z = 0$  under the condition  $X = 0$  results in two ways to

$$P(\hspace{0.01cm}Z\hspace{-0.05cm} = 0\hspace{0.03cm} | \hspace{0.03cm} X\hspace{-0.05cm} = \hspace{-0.05cm}0) = (1-p) \cdot (1-q) + p \cdot q = 1 - p - q + 2pq \stackrel{!}{=} 1 - \varepsilon \hspace{0.05cm}.$$
  • The overall system then has exactly the same BSC structure as processors $1$ and $2$, but now with falsification probability  $\varepsilon = p + q - 2pq \hspace{0.05cm}.$
  • For  $p = 0.1$  and  $q = 0.2$ , we obtain:
$$ \varepsilon = 0.1 + 0.2 - 2\cdot 0.1 \cdot 0.2 = 0.26 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) = 1 - H_{\rm bin}(0.26) = 1 - 0.827 \hspace{0.15cm} \underline {=0.173 \,{\rm (bit)}} \hspace{0.05cm}.$$


(5)  The answer is  YES,  because the  "Data Processing Theorem" assumes exactly the conditions given here.

However, we want to evaluate some additional numerical results:

  • If  $0 ≤ p < 0.5$  and  $0 ≤ q < 0.5$  hold, we obtain:
$$\varepsilon = p + q \cdot (1-2p) > p \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) < I(X;Y) \hspace{0.05cm},$$
$$\varepsilon = q + p \cdot (1-2q) > q \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) < I(Y;Z) \hspace{0.05cm}.$$
  • For  $p = 0.5$  holds independently of  $q$,  since  $I(X; Z)$  cannot be greater than  $I(X; Y)$:
$$\varepsilon = 0.5 + q \cdot (1-1) = 0.5 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Y) = 0 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) = 0 \hspace{0.05cm}.$$
  • Similarly, with  $q = 0.5$  independent of  $p$, we obtain:
$$\varepsilon = 0.5 + p \cdot (1-1) = 0.5 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(Y;Z) = 0 \hspace{0.3cm}\Rightarrow \hspace{0.3cm} I(X;Z) = 0 \hspace{0.05cm}.$$
  • Also for  $p < 0.5$  and  $q > 0.5$  the Data Processing Theorem is not violated, which will be shown here only by an example:
With  $p = 0.1$  and  $q = 0.8$ , the same result is obtained as in subtask  (4):
$$\varepsilon = 0.1 + 0.8 - 2\cdot 0.1 \cdot 0.8 = 0.74 \hspace{0.3cm} \Rightarrow \hspace{0.3cm} I(X;Z) = 1 - H_{\rm bin}(0.74) = 1 - H_{\rm bin}(0.26) =0.173 \,{\rm (bit)} \hspace{0.05cm}.$$