Difference between revisions of "Information Theory/Discrete Memoryless Sources"

From LNTwww
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{FirstPage}}
 
{{FirstPage}}
 
{{Header
 
{{Header
|Untermenü=Entropie wertdiskreter Nachrichtenquellen
+
|Untermenü=Entropy of Discrete Sources
 
|Vorherige Seite=
 
|Vorherige Seite=
|Nächste Seite=Nachrichtenquellen mit Gedächtnis
+
|Nächste Seite=Discrete Sources with Memory
 
}}
 
}}
  
Line 12: Line 12:
 
In detail are discussed:
 
In detail are discussed:
  
*The  »decision content«  and the  »entropy«  of a discrete memoryless source,
+
:*The  »information content«  of a symbol and the  »entropy«  of a discrete memoryless source,
*the  »binary entropy function«  and its application to non-binary sources,
+
:*the  »binary entropy function«  and its application to non-binary sources,
*the entropy calculation for  »sources with memory«  and suitable approximations,
+
:*the entropy calculation for  »sources with memory«  and suitable approximations,
*the special features of  »Markov sources«  regarding the entropy calculation,
+
:*the special features of  »Markov sources«  regarding the entropy calculation,
*the procedure for sources with a large number of symbols, for example  »natural texts«,
+
:*the procedure for sources with a large number of symbols, for example  »natural texts«,
*the  »entropy estimates«  according to Shannon and Küpfmüller.
+
:*the  »entropy estimates«  according to Shannon and Küpfmüller.
  
  
Line 72: Line 72:
 
 
 
 
  
==Decision content of a discrete source==
+
==Maximum entropy of a discrete source==
 
<br>
 
<br>
 
[https://de.wikipedia.org/wiki/Claude_Shannon Claude Elwood Shannon]&nbsp; defined in 1948 in the standard work of information theory&nbsp; [Sha48]<ref name='Sha48'>Shannon, C.E.: A Mathematical Theory of Communication. In: Bell Syst. Techn. J. 27 (1948), pp. 379-423 and pp. 623-656.</ref>&nbsp; the concept of information as&nbsp; "decrease of uncertainty about the occurrence of a statistical event".  
 
[https://de.wikipedia.org/wiki/Claude_Shannon Claude Elwood Shannon]&nbsp; defined in 1948 in the standard work of information theory&nbsp; [Sha48]<ref name='Sha48'>Shannon, C.E.: A Mathematical Theory of Communication. In: Bell Syst. Techn. J. 27 (1948), pp. 379-423 and pp. 623-656.</ref>&nbsp; the concept of information as&nbsp; "decrease of uncertainty about the occurrence of a statistical event".  
Line 88: Line 88:
  
 
{{BlaueBox|TEXT=   
 
{{BlaueBox|TEXT=   
$\text{Definition:}$&nbsp; The&nbsp; '''decision content''' &nbsp; of a message source depends only on the symbol range&nbsp; $M$&nbsp; and results in
+
$\text{Definition:}$&nbsp; The&nbsp; '''maximum average information content''' &nbsp; of a message source depends only on the symbol set size&nbsp; $M$&nbsp; and results in
 
   
 
   
 
:$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm} {\rm (in \ &#8220;bit")}
 
:$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm} {\rm (in \ &#8220;bit")}
Line 94: Line 94:
 
= {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}\text {(in &#8220;Hartley")}\hspace{0.05cm}.$$
 
= {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}\text {(in &#8220;Hartley")}\hspace{0.05cm}.$$
  
*Since&nbsp; $H_0$&nbsp; indicates the maximum value of the&nbsp; [[Information_Theory/Sources with Memory#Information_Content_and_Entropy|entropy]]&nbsp; $H$&nbsp;, $H_\text{max}$&nbsp; is also used in our tutorial as short notation&nbsp;. }}
+
*Since&nbsp; $H_0$&nbsp; indicates the maximum value of the&nbsp; [[Information_Theory/Sources with Memory#Information_Content_and_Entropy|entropy]]&nbsp; $H$,&nbsp; $H_\text{max}=H_0$&nbsp; is also used in our tutorial as short notation. }}
  
  
 
Please note our nomenclature:
 
Please note our nomenclature:
*The logarithm will be called "log" in the following, independent of the base.  
+
*The logarithm will be called&nbsp; &raquo;log&laquo;&nbsp; in the following, independent of the base.  
 
*The relations mentioned above are fulfilled due to the following properties:
 
*The relations mentioned above are fulfilled due to the following properties:
 
   
 
   
Line 105: Line 105:
 
{\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
 
{\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
  
* Usually we use the logarithm to the base&nbsp; $2$ &nbsp; ⇒ &nbsp; ''Logarithm dualis''&nbsp; $\rm (ld)$, where the pseudo unit "bit", more precisely:&nbsp; "bit/symbol", is then added:
+
* Usually we use the logarithm to the base&nbsp; $2$ &nbsp; ⇒ &nbsp; &raquo;logarithm dualis&laquo;&nbsp; &nbsp; $\rm (ld)$,&nbsp; where the pseudo unit&nbsp; "bit"&nbsp; $($more precisely:&nbsp; "bit/symbol"$)$&nbsp; is then added:
 
   
 
   
 
:$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2}
 
:$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2}
Line 111: Line 111:
 
  \hspace{0.05cm}.$$
 
  \hspace{0.05cm}.$$
  
*In addition, you can find in the literature some additional definitions, which are based on the natural logarithm&nbsp; $\rm (ln)$&nbsp; or the logarithm&nbsp; $\rm (lg)$&nbsp;.
+
*In addition, you can find in the literature some additional definitions, which are based on the natural logarithm&nbsp; $\rm (ln)$&nbsp; or the logarithm of the tens&nbsp; $\rm (lg)$.
 
   
 
   
 
==Information content and entropy ==
 
==Information content and entropy ==
Line 119: Line 119:
 
:$$p_1 > p_2 > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_\mu > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$
 
:$$p_1 > p_2 > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_\mu > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$
  
We now consider the ''information content''&nbsp; of the individual symbols, where we denote the "logarithm dualis" with $\log_2$:
+
We now consider the '''information content'''&nbsp; of the individual symbols, where we denote the&nbsp; "logarithm dualis"&nbsp; with&nbsp; $\log_2$:
 
   
 
   
 
:$$I_\mu = {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm log_2}\hspace{0.1cm}{p_\mu}
 
:$$I_\mu = {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm log_2}\hspace{0.1cm}{p_\mu}
Line 126: Line 126:
  
 
You can see:
 
You can see:
*because of&nbsp; $p_μ ≤ 1$&nbsp; the information content is never negative.&nbsp; In the borderline case&nbsp; $p_μ \to 1$&nbsp; goes&nbsp; $I_μ \to 0$.  
+
*Because of&nbsp; $p_μ ≤ 1$&nbsp; the information content is never negative.&nbsp; In the borderline case&nbsp; $p_μ \to 1$&nbsp; goes&nbsp; $I_μ \to 0$.  
*However for&nbsp; $I_μ = 0$ &nbsp; &rArr; &nbsp; $p_μ = 1$ &nbsp; &rArr; &nbsp; $M = 1$&nbsp; the decision content is also&nbsp; $H_0 = 0$.
+
*However, for&nbsp; $I_μ = 0$ &nbsp; &rArr; &nbsp; $p_μ = 1$ &nbsp; &rArr; &nbsp; $M = 1$&nbsp; the information content is also&nbsp; $H_0 = 0$.
 
*For decreasing probabilities&nbsp; $p_μ$&nbsp; the information content increases continuously:
 
*For decreasing probabilities&nbsp; $p_μ$&nbsp; the information content increases continuously:
 
   
 
   
Line 139: Line 139:
  
  
However, the information content of a single symbol (or event) is not very interesting.&nbsp; On the other hand  
+
However, the information content of a single symbol (or event) is not very interesting.&nbsp; On the other hand one of the central quantities of information theory is obtained,
 
*by ensemble averaging over all possible symbols&nbsp; $q_μ$ &nbsp;bzw.&nbsp;  
 
*by ensemble averaging over all possible symbols&nbsp; $q_μ$ &nbsp;bzw.&nbsp;  
*by time averaging over all elements of the sequence&nbsp; $\langle q_ν \rangle$
+
*by time averaging over all elements of the sequence&nbsp; $\langle q_ν \rangle$.
  
one of the central variables of information theory.
 
  
 
{{BlaueBox|TEXT=   
 
{{BlaueBox|TEXT=   
$\text{Definition:}$&nbsp; The&nbsp; '''Entropy'''&nbsp; $H$&nbsp; of a source indicates the ''mean information content of all symbols''&nbsp;:
+
$\text{Definition:}$&nbsp; The&nbsp; '''entropy'''&nbsp; $H$&nbsp; of a discrete-value source indicates the&nbsp; '''mean information content of all symbols''':
 
   
 
   
 
:$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}=
 
:$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}=
Line 152: Line 151:
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
The overline marks again a time averaging and&nbsp; $\rm E[\text{...}]$&nbsp; a ensemble averaging.}}
+
The overline marks again a time averaging and&nbsp; $\rm E[\text{...}]$&nbsp; an ensemble averaging.}}
  
  
 
Entropy is among other things a measure for
 
Entropy is among other things a measure for
 
*the mean uncertainty about the outcome of a statistical event,
 
*the mean uncertainty about the outcome of a statistical event,
*the "randomness" of this event,&nbsp; and
+
*the&nbsp; "randomness"&nbsp; of this event,&nbsp; and
 
*the average information content of a random variable.  
 
*the average information content of a random variable.  
 
   
 
   
Line 163: Line 162:
 
==Binary entropy function ==
 
==Binary entropy function ==
 
<br>
 
<br>
At first we will restrict ourselves to the special case&nbsp; $M = 2$&nbsp; and consider a binary source, which returns the two symbols&nbsp; $\rm A$&nbsp; and&nbsp; $\rm B$&nbsp; &nbsp; The occurrence probabilities are &nbsp; $p_{\rm A} = p$&nbsp; and&nbsp; $p_{\rm B} = 1 - p$.
+
At first we will restrict ourselves to the special case&nbsp; $M = 2$&nbsp; and consider a binary source, which returns the two symbols&nbsp; $\rm A$&nbsp; and&nbsp; $\rm B$.&nbsp; The symbol probabilities are &nbsp; $p_{\rm A} = p$&nbsp; and &nbsp; $p_{\rm B} = 1 - p$.
  
 
For the entropy of this binary source applies:  
 
For the entropy of this binary source applies:  
 
   
 
   
:$$H_{\rm bin} (p) = p \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/Symbol)}
+
:$$H_{\rm bin} (p) = p \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/symbol)}
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
The function is called&nbsp; $H_\text{bin}(p)$&nbsp; the&nbsp; '''binary entropy function'''.&nbsp; The entropy of a source with a larger symbol range&nbsp; $M$&nbsp; can often be expressed using&nbsp; $H_\text{bin}(p)$&nbsp;.
+
This function is called&nbsp; $H_\text{bin}(p)$&nbsp; the&nbsp; '''binary entropy function'''.&nbsp; The entropy of a source with a larger symbol set size&nbsp; $M$&nbsp; can often be expressed using&nbsp; $H_\text{bin}(p)$&nbsp;.
  
 
{{GraueBox|TEXT=   
 
{{GraueBox|TEXT=   
 
$\text{Example 2:}$&nbsp;
 
$\text{Example 2:}$&nbsp;
The figure shows the binary entropy function for the values&nbsp; $0 ≤ p ≤ 1$&nbsp; of the symbol probability of&nbsp; $\rm A$&nbsp; $($or also of&nbsp; $\rm B)$.&nbsp; You can see
+
The figure shows the binary entropy function for the values&nbsp; $0 ≤ p ≤ 1$&nbsp; of the symbol probability of&nbsp; $\rm A$&nbsp; $($or also of&nbsp; $\rm B)$.&nbsp; You can see:
  
[[File:EN_Inf_T_1_1_S4.png|frame|Binary entropy function as function of&nbsp; $p$|right]]
+
[[File:EN_Inf_T_1_1_S4.png|frame|Binary entropy function as a function of&nbsp; $p$|right]]
*The maximum value&nbsp; $H_\text{max} = 1\; \rm bit$&nbsp; results for&nbsp; $p = 0.5$, thus for equally probable binary symbols.&nbsp; Then &nbsp; $\rm A$&nbsp; and&nbsp; $\rm B$&nbsp; contribute the same amount to entropy.
+
*The maximum value&nbsp; $H_\text{max} = 1\; \rm bit$&nbsp; results for&nbsp; $p = 0.5$, thus for equally probable binary symbols.&nbsp; Then &nbsp; $\rm A$&nbsp; and&nbsp; $\rm B$&nbsp; contribute the same amount to the entropy.
* $H_\text{bin}(p)$&nbsp; is symmetrical about&nbsp; $p = 0.5$.&nbsp; A source with&nbsp; $p_{\rm A} = 0.1$&nbsp; and&nbsp; $p_{\rm B} = 0. 9$&nbsp; has the same entropy&nbsp; $H = 0.469 \; \rm bit$&nbsp; as a source with&nbsp; $p_{\rm A} = 0.9$&nbsp; and&nbsp; $p_{\rm B} = 0.1$.
+
* $H_\text{bin}(p)$&nbsp; is symmetrical around&nbsp; $p = 0.5$.&nbsp; A source with&nbsp; $p_{\rm A} = 0.1$&nbsp; and&nbsp; $p_{\rm B} = 0. 9$&nbsp; has the same entropy&nbsp; $H = 0.469 \; \rm bit$&nbsp; as a source with&nbsp; $p_{\rm A} = 0.9$&nbsp; and&nbsp; $p_{\rm B} = 0.1$.
*The difference&nbsp; $ΔH = H_\text{max} - H$ gives&nbsp; the&nbsp; ''redundancy''&nbsp; of the source and&nbsp; $r = ΔH/H_\text{max}$&nbsp; the&nbsp; ''relative redundancy''. &nbsp; In the example,&nbsp; $ΔH = 0.531\; \rm bit$&nbsp; and&nbsp; $r = 53.1 \rm \%$.
+
*The difference&nbsp; $ΔH = H_\text{max} - H$ gives&nbsp; the&nbsp; &raquo;redundancy&laquo;&nbsp; of the source and&nbsp; $r = ΔH/H_\text{max}$&nbsp; the&nbsp; &raquo;relative redundancy&laquo;. &nbsp; In the example,&nbsp; $ΔH = 0.531\; \rm bit$&nbsp; and&nbsp; $r = 53.1 \rm \%$.
*For&nbsp; $p = 0$&nbsp; this results in&nbsp; $H = 0$, since the symbol sequence &nbsp;$\rm B \ B \ B \text{...}$&nbsp; can be predicted with certainty. &nbsp; Actually, the symbol range is now only&nbsp; $M = 1$.&nbsp; The same applies to&nbsp; $p = 1$ &nbsp; &rArr; &nbsp; symbol sequence &nbsp;$\rm A \ A \ A \ text{...}$.
+
*For&nbsp; $p = 0$&nbsp; this results in&nbsp; $H = 0$, since the symbol sequence &nbsp;$\rm B \ B \ B \text{...}$&nbsp; can be predicted with certainty &nbsp; &rArr; &nbsp; symbol set size only&nbsp; $M = 1$.&nbsp; The same applies to&nbsp; $p = 1$ &nbsp; &rArr; &nbsp; symbol sequence &nbsp;$\rm A \ A \ A \text{...}$.
*$H_\text{bin}(p)$&nbsp; is always a&nbsp; ''concave function'', since the second derivative after the parameter&nbsp; $p$&nbsp; is negative for all values of&nbsp; $p$&nbsp;:  
+
*$H_\text{bin}(p)$&nbsp; is always a&nbsp; "concave function",&nbsp; since the second derivative after the parameter&nbsp; $p$&nbsp; is negative for all values of&nbsp; $p$&nbsp;:  
 
:$$\frac{ {\rm d}^2H_{\rm bin} (p)}{ {\rm d}\,p^2} = \frac{- 1}{ {\rm ln}(2) \cdot p \cdot (1-p)}< 0
 
:$$\frac{ {\rm d}^2H_{\rm bin} (p)}{ {\rm d}\,p^2} = \frac{- 1}{ {\rm ln}(2) \cdot p \cdot (1-p)}< 0
 
\hspace{0.05cm}.$$}}
 
\hspace{0.05cm}.$$}}
  
==Message sources with a larger symbol range==   
+
==Non-binary sources==   
 
<br>
 
<br>
In the&nbsp; [[Information_Theory/Sources with Memory#Model_and_Prerequisites|first section]]&nbsp; of this chapter we have a quaternary message source&nbsp; $(M = 4)$&nbsp; with the symbol probabilities&nbsp; $p_{\rm A} = 0. 4$, &nbsp; $p_{\rm B} = 0.3$, &nbsp; $p_{\rm C} = 0.2$ &nbsp; and&nbsp; $ p_{\rm D} = 0.1$&nbsp; considered.&nbsp; This source has the following entropy:
+
In the&nbsp; [[Information_Theory/Sources with Memory#Model_and_Prerequisites|first section]]&nbsp; of this chapter we considered a quaternary message source&nbsp; $(M = 4)$&nbsp; with the symbol probabilities&nbsp; $p_{\rm A} = 0. 4$, &nbsp; $p_{\rm B} = 0.3$, &nbsp; $p_{\rm C} = 0.2$&nbsp; and&nbsp; $ p_{\rm D} = 0.1$.&nbsp; This source has the following entropy:
 
   
 
   
 
:$$H_{\rm quat} = 0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}.$$
 
:$$H_{\rm quat} = 0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}.$$
  
For numerical calculation, the detour via the decimal logarithm&nbsp; $\lg \ x = {\rm log}_{10} \ x$&nbsp;, is often necessary. Since the ''logarithm dualis''&nbsp; $ {\rm log}_2 \ x$&nbsp; is mostly not found on pocket calculators.
+
For numerical calculation, the detour via the decimal logarithm&nbsp; $\lg \ x = {\rm log}_{10} \ x$&nbsp; is often necessary, since the&nbsp; "logarithm dualis"&nbsp; $ {\rm log}_2 \ x$&nbsp; is mostly not found on pocket calculators.
  
 
:$$H_{\rm quat}=\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2} + 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit}
 
:$$H_{\rm quat}=\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2} + 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit}
Line 217: Line 216:
 
You can see from the blue curve for the quaternary source:
 
You can see from the blue curve for the quaternary source:
 
*The maximum entropy&nbsp; $H_\text{max} = 2 \; \rm bit/symbol$&nbsp; results for&nbsp; $p = 0.25$ &nbsp; &rArr; &nbsp; equally probable symbols: &nbsp; $p_{\rm A} = p_{\rm B} = p_{\rm C} = p_{\rm A} = 0.25$.
 
*The maximum entropy&nbsp; $H_\text{max} = 2 \; \rm bit/symbol$&nbsp; results for&nbsp; $p = 0.25$ &nbsp; &rArr; &nbsp; equally probable symbols: &nbsp; $p_{\rm A} = p_{\rm B} = p_{\rm C} = p_{\rm A} = 0.25$.
*With&nbsp; $p = 0$&nbsp; resp.&nbsp; $p = 0.5$&nbsp; the quaternary source degenerates to a binary source with&nbsp; $p_{\rm B} = p_{\rm C} = 0. 5$&nbsp; and&nbsp; $p_{\rm A} = p_{\rm D} = 0$ &nbsp; &rArr; &nbsp; entropy&nbsp; $H = 1 \; \rm bit/symbol$.
+
*With&nbsp; $p = 0$&nbsp; the quaternary source degenerates to a binary source with&nbsp; $p_{\rm B} = p_{\rm C} = 0. 5$, &nbsp; $p_{\rm A} = p_{\rm D} = 0$ &nbsp; &rArr; &nbsp; $H = 1 \; \rm bit/symbol$.&nbsp; Similar applies to $p = 0.5$.  
 
*The source with&nbsp; $p_{\rm A} = p_{\rm D} = 0.1$&nbsp; and&nbsp; $p_{\rm B} = p_{\rm C} = 0.4$&nbsp; has the following characteristics (each with the pseudo unit "bit/symbol"):
 
*The source with&nbsp; $p_{\rm A} = p_{\rm D} = 0.1$&nbsp; and&nbsp; $p_{\rm B} = p_{\rm C} = 0.4$&nbsp; has the following characteristics (each with the pseudo unit "bit/symbol"):
  
Line 226: Line 225:
 
: &nbsp; &nbsp; '''(3)''' &nbsp; relative redundancy: &nbsp; $r ={\rm \delta }H/({\rm log_2}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.$
 
: &nbsp; &nbsp; '''(3)''' &nbsp; relative redundancy: &nbsp; $r ={\rm \delta }H/({\rm log_2}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.$
  
*The redundancy of the quaternary source with&nbsp; $p = 0.1$&nbsp; is equal to&nbsp; $ΔH = 0.278 \; \rm bit/symbol$&nbsp; and thus exactly the same as the redundancy of the binary source with&nbsp; $p = 0.2$.}}
+
*The redundancy of the quaternary source with&nbsp; $p = 0.1$&nbsp; is&nbsp; $ΔH = 0.278 \; \rm bit/symbol$ &nbsp; &rArr; &nbsp; exactly the same as the redundancy of the binary source with&nbsp; $p = 0.2$.}}
  
  
Line 234: Line 233:
 
[[Aufgaben:Exercise_1.1:_Entropy_of_the_Weather|Exercise 1.1: Entropy of the Weather]]
 
[[Aufgaben:Exercise_1.1:_Entropy_of_the_Weather|Exercise 1.1: Entropy of the Weather]]
  
[[Aufgaben:1.1Z Binäre Entropiefunktion|Aufgabe 1.1Z: Binäre Entropiefunktion]]
+
[[Aufgaben:Exercise_1.1Z:_Binary_Entropy_Function|Exercise 1.1Z: Binary Entropy Function]]
  
[[Aufgaben:1.2 Entropie von Ternärquellen|Aufgabe 1.2: Entropie von Ternärquellen]]
+
[[Aufgaben:Exercise_1.2:_Entropy_of_Ternary_Sources|Exercise 1.2: Entropy of Ternary Sources]]
  
  

Latest revision as of 15:57, 13 July 2021

# OVERVIEW OF THE FIRST MAIN CHAPTER #


This first chapter describes the calculation and the meaning of entropy.  According to the Shannonian information definition, entropy is a measure of the mean uncertainty about the outcome of a statistical event or the uncertainty in the measurement of a stochastic quantity.  Somewhat casually expressed, the entropy of a random quantity quantifies its "randomness".

In detail are discussed:

  • The  »information content«  of a symbol and the  »entropy«  of a discrete memoryless source,
  • the  »binary entropy function«  and its application to non-binary sources,
  • the entropy calculation for  »sources with memory«  and suitable approximations,
  • the special features of  »Markov sources«  regarding the entropy calculation,
  • the procedure for sources with a large number of symbols, for example  »natural texts«,
  • the  »entropy estimates«  according to Shannon and Küpfmüller.


Model and requirements


We consider a discrete-value message source  $\rm Q$, which gives a sequence  $ \langle q_ν \rangle$  of symbols.

  • For the variable  $ν = 1$, ... , $N$, where  $N$  should be "sufficiently large".
  • Each individual source symbol  $q_ν$  comes from a symbol set  $\{q_μ \}$  where  $μ = 1$, ... , $M$, where  $M$  denotes the symbol set size:
$$q_{\nu} \in \left \{ q_{\mu} \right \}, \hspace{0.25cm}{\rm with}\hspace{0.25cm} \nu = 1, \hspace{0.05cm} \text{ ...}\hspace{0.05cm} , N\hspace{0.25cm}{\rm and}\hspace{0.25cm}\mu = 1,\hspace{0.05cm} \text{ ...}\hspace{0.05cm} , M \hspace{0.05cm}.$$

The figure shows a quaternary message source  $(M = 4)$  with the alphabet  $\rm \{A, \ B, \ C, \ D\}$  and an exemplary sequence of length  $N = 100$.

Quaternary source

The following requirements apply:

  • The quaternary source is fully described by  $M = 4$  symbol probabilities  $p_μ$.  In general it applies:
$$\sum_{\mu = 1}^M \hspace{0.1cm}p_{\mu} = 1 \hspace{0.05cm}.$$
$${\rm Pr} \left (q_{\nu} = q_{\mu} \right ) = {\rm Pr} \left (q_{\nu} = q_{\mu} \hspace{0.03cm} | \hspace{0.03cm} q_{\nu -1}, q_{\nu -2}, \hspace{0.05cm} \text{ ...}\hspace{0.05cm}\right ) \hspace{0.05cm}.$$
  • Since the alphabet consists of symbols  (and not of random variables) , the specification of  expected values  (linear mean, quadratic mean, standard deviation, etc.)  is not possible here, but also not necessary from an information-theoretical point of view.


These properties will now be illustrated with an example.

Relative frequencies as a function of  $N$

$\text{Example 1:}$  For the symbol probabilities of a quaternary source applies:

$$p_{\rm A} = 0.4 \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = 0.3 \hspace{0.05cm},\hspace{0.2cm}p_{\rm C} = 0.2 \hspace{0.05cm},\hspace{0.2cm} p_{\rm D} = 0.1\hspace{0.05cm}.$$

For an infinitely long sequence  $(N \to \infty)$

  • the  relative frequencies  $h_{\rm A}$,  $h_{\rm B}$,  $h_{\rm C}$,  $h_{\rm D}$   ⇒   a-posteriori parameters
  • were identical to the  probabilities  $p_{\rm A}$,  $p_{\rm B}$,  $p_{\rm C}$,  $p_{\rm D}$   ⇒   a-priori parameters.


With smaller  $N$  deviations may occur, as the adjacent table (result of a simulation) shows.

  • In the graphic above an exemplary sequence is shown with  $N = 100$  symbols.
  • Due to the set elements  $\rm A$,  $\rm B$,  $\rm C$  and  $\rm D$  no mean values can be given.


However, if you replace the symbols with numerical values, for example  $\rm A \Rightarrow 1$,   $\rm B \Rightarrow 2$,   $\rm C \Rightarrow 3$,   $\rm D \Rightarrow 4$, then you will get after
    »time averaging«   ⇒   crossing line     or     »ensemble averaging«   ⇒   expected value formation

$$m_1 = \overline { q_{\nu} } = {\rm E} \big [ q_{\mu} \big ] = 0.4 \cdot 1 + 0.3 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4 = 2 \hspace{0.05cm},$$
$$m_2 = \overline { q_{\nu}^{\hspace{0.05cm}2} } = {\rm E} \big [ q_{\mu}^{\hspace{0.05cm}2} \big ] = 0.4 \cdot 1^2 + 0.3 \cdot 2^2 + 0.2 \cdot 3^2 + 0.1 \cdot 4^2 = 5 \hspace{0.05cm},$$
$$\sigma = \sqrt {m_2 - m_1^2} = \sqrt {5 - 2^2} = 1 \hspace{0.05cm}.$$


Maximum entropy of a discrete source


Claude Elwood Shannon  defined in 1948 in the standard work of information theory  [Sha48][1]  the concept of information as  "decrease of uncertainty about the occurrence of a statistical event".

Let us make a mental experiment with  $M$  possible results, which are all equally probable:   $p_1 = p_2 = \hspace{0.05cm} \text{ ...}\hspace{0.05cm} = p_M = 1/M \hspace{0.05cm}.$

Under this assumption applies:

  • Is  $M = 1$, then each individual attempt will yield the same result and therefore there is no uncertainty about the output.
  • On the other hand, an observer learns about an experiment with  $M = 2$, for example the  "coin toss"  with the set of events  $\big \{\rm \boldsymbol{\rm Z}(ahl), \rm \boldsymbol{\rm W}(app) \big \}$  and the probabilities  $p_{\rm Z} = p_{\rm W} = 0. 5$, a gain in information.  The uncertainty regarding  $\rm Z$  resp.  $\rm W$  is resolved.
  • In the experiment  »dice«  $(M = 6)$  and even more in  »roulette«  $(M = 37)$  the gained information is even more significant for the observer than in the  »coin toss«  when he learns which number was thrown or which ball fell.
  • Finally it should be considered that the experiment  »triple coin toss«  with  $M = 8$  possible results  $\rm ZZZ$,  $\rm ZZW$,  $\rm ZWZ$,  $\rm ZWW$,  $\rm WZZ$,  $\rm WZW$,  $\rm WWZ$,  $\rm WWW$  provides three times the information as the single coin toss  $(M = 2)$.


The following definition fulfills all the requirements listed here for a quantitative information measure for equally probable events, indicated only by the symbol set size  $M$.

$\text{Definition:}$  The  maximum average information content   of a message source depends only on the symbol set size  $M$  and results in

$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm} {\rm (in \ “bit")} = {\rm ln}\hspace{0.1cm}M \hspace{0.15cm}\text {(in “nat")} = {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}\text {(in “Hartley")}\hspace{0.05cm}.$$
  • Since  $H_0$  indicates the maximum value of the  entropy  $H$,  $H_\text{max}=H_0$  is also used in our tutorial as short notation.


Please note our nomenclature:

  • The logarithm will be called  »log«  in the following, independent of the base.
  • The relations mentioned above are fulfilled due to the following properties:
$${\rm log}\hspace{0.1cm}1 = 0 \hspace{0.05cm},\hspace{0.2cm} {\rm log}\hspace{0.1cm}37 > {\rm log}\hspace{0.1cm}6 > {\rm log}\hspace{0.1cm}2\hspace{0.05cm},\hspace{0.2cm} {\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
  • Usually we use the logarithm to the base  $2$   ⇒   »logarithm dualis«    $\rm (ld)$,  where the pseudo unit  "bit"  $($more precisely:  "bit/symbol"$)$  is then added:
$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2} = \frac{{\rm ln}\hspace{0.1cm}M}{{\rm ln}\hspace{0.1cm}2} \hspace{0.05cm}.$$
  • In addition, you can find in the literature some additional definitions, which are based on the natural logarithm  $\rm (ln)$  or the logarithm of the tens  $\rm (lg)$.

Information content and entropy


We now waive the previous requirement that all  $M$  possible results of an experiment are equally probable.  In order to keep the spelling as compact as possible, we define for this page only:

$$p_1 > p_2 > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_\mu > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$

We now consider the information content  of the individual symbols, where we denote the  "logarithm dualis"  with  $\log_2$:

$$I_\mu = {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/Symbol)} \hspace{0.05cm}.$$

You can see:

  • Because of  $p_μ ≤ 1$  the information content is never negative.  In the borderline case  $p_μ \to 1$  goes  $I_μ \to 0$.
  • However, for  $I_μ = 0$   ⇒   $p_μ = 1$   ⇒   $M = 1$  the information content is also  $H_0 = 0$.
  • For decreasing probabilities  $p_μ$  the information content increases continuously:
$$I_1 < I_2 < \hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_\mu <\hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_{M-1} < I_M \hspace{0.05cm}.$$

$\text{Conclusion:}$  The more improbable an event is, the greater is its information content.  This fact is also found in daily life:

  • "6 right ones" in the lottery are more likely to be noticed than "3 right ones" or no win at all.
  • A tsunami in Asia also dominates the news in Germany for weeks as opposed to the almost standard Deutsche Bahn delays.
  • A series of defeats of Bayern Munich leads to huge headlines in contrast to a winning series.  With 1860 Munich exactly the opposite is the case.


However, the information content of a single symbol (or event) is not very interesting.  On the other hand one of the central quantities of information theory is obtained,

  • by ensemble averaging over all possible symbols  $q_μ$  bzw. 
  • by time averaging over all elements of the sequence  $\langle q_ν \rangle$.


$\text{Definition:}$  The  entropy  $H$  of a discrete-value source indicates the  mean information content of all symbols:

$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\sum_{\mu = 1}^M p_{\mu} \cdot{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}\text{(unit: bit, more precisely: bit/symbol)} \hspace{0.05cm}.$$

The overline marks again a time averaging and  $\rm E[\text{...}]$  an ensemble averaging.


Entropy is among other things a measure for

  • the mean uncertainty about the outcome of a statistical event,
  • the  "randomness"  of this event,  and
  • the average information content of a random variable.


Binary entropy function


At first we will restrict ourselves to the special case  $M = 2$  and consider a binary source, which returns the two symbols  $\rm A$  and  $\rm B$.  The symbol probabilities are   $p_{\rm A} = p$  and   $p_{\rm B} = 1 - p$.

For the entropy of this binary source applies:

$$H_{\rm bin} (p) = p \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/symbol)} \hspace{0.05cm}.$$

This function is called  $H_\text{bin}(p)$  the  binary entropy function.  The entropy of a source with a larger symbol set size  $M$  can often be expressed using  $H_\text{bin}(p)$ .

$\text{Example 2:}$  The figure shows the binary entropy function for the values  $0 ≤ p ≤ 1$  of the symbol probability of  $\rm A$  $($or also of  $\rm B)$.  You can see:

Binary entropy function as a function of  $p$
  • The maximum value  $H_\text{max} = 1\; \rm bit$  results for  $p = 0.5$, thus for equally probable binary symbols.  Then   $\rm A$  and  $\rm B$  contribute the same amount to the entropy.
  • $H_\text{bin}(p)$  is symmetrical around  $p = 0.5$.  A source with  $p_{\rm A} = 0.1$  and  $p_{\rm B} = 0. 9$  has the same entropy  $H = 0.469 \; \rm bit$  as a source with  $p_{\rm A} = 0.9$  and  $p_{\rm B} = 0.1$.
  • The difference  $ΔH = H_\text{max} - H$ gives  the  »redundancy«  of the source and  $r = ΔH/H_\text{max}$  the  »relative redundancy«.   In the example,  $ΔH = 0.531\; \rm bit$  and  $r = 53.1 \rm \%$.
  • For  $p = 0$  this results in  $H = 0$, since the symbol sequence  $\rm B \ B \ B \text{...}$  can be predicted with certainty   ⇒   symbol set size only  $M = 1$.  The same applies to  $p = 1$   ⇒   symbol sequence  $\rm A \ A \ A \text{...}$.
  • $H_\text{bin}(p)$  is always a  "concave function",  since the second derivative after the parameter  $p$  is negative for all values of  $p$ :
$$\frac{ {\rm d}^2H_{\rm bin} (p)}{ {\rm d}\,p^2} = \frac{- 1}{ {\rm ln}(2) \cdot p \cdot (1-p)}< 0 \hspace{0.05cm}.$$

Non-binary sources


In the  first section  of this chapter we considered a quaternary message source  $(M = 4)$  with the symbol probabilities  $p_{\rm A} = 0. 4$,   $p_{\rm B} = 0.3$,   $p_{\rm C} = 0.2$  and  $ p_{\rm D} = 0.1$.  This source has the following entropy:

$$H_{\rm quat} = 0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}.$$

For numerical calculation, the detour via the decimal logarithm  $\lg \ x = {\rm log}_{10} \ x$  is often necessary, since the  "logarithm dualis"  $ {\rm log}_2 \ x$  is mostly not found on pocket calculators.

$$H_{\rm quat}=\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2} + 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit} \hspace{0.05cm}.$$

$\text{Example 3:}$  Now there are certain symmetries between the symbol probabilities:

Entropy of binary source and quaternary source
$$p_{\rm A} = p_{\rm D} = p \hspace{0.05cm},\hspace{0.4cm}p_{\rm B} = p_{\rm C} = 0.5 - p \hspace{0.05cm},\hspace{0.3cm}{\rm with} \hspace{0.15cm}0 \le p \le 0.5 \hspace{0.05cm}.$$

In this case, the binary entropy function can be used to calculate the entropy:

$$H_{\rm quat} = 2 \cdot p \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm} } + 2 \cdot (0.5-p) \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.5-p}$$

$$\Rightarrow \hspace{0.3cm} H_{\rm quat} = 1 + H_{\rm bin}(2p) \hspace{0.05cm}.$$

The graphic shows as a function of  $p$

  • the entropy of the quaternary source (blue)
  • in comparison to the entropy course of the binary source (red).


For the quaternary source only the abscissa  $0 ≤ p ≤ 0.5$  is allowed.
You can see from the blue curve for the quaternary source:

  • The maximum entropy  $H_\text{max} = 2 \; \rm bit/symbol$  results for  $p = 0.25$   ⇒   equally probable symbols:   $p_{\rm A} = p_{\rm B} = p_{\rm C} = p_{\rm A} = 0.25$.
  • With  $p = 0$  the quaternary source degenerates to a binary source with  $p_{\rm B} = p_{\rm C} = 0. 5$,   $p_{\rm A} = p_{\rm D} = 0$   ⇒   $H = 1 \; \rm bit/symbol$.  Similar applies to $p = 0.5$.
  • The source with  $p_{\rm A} = p_{\rm D} = 0.1$  and  $p_{\rm B} = p_{\rm C} = 0.4$  has the following characteristics (each with the pseudo unit "bit/symbol"):
    (1)   entropy:   $H = 1 + H_{\rm bin} (2p) =1 + H_{\rm bin} (0.2) = 1.722,$
    (2)   Redundancy:   ${\rm \Delta }H = {\rm log_2}\hspace{0.1cm} M - H =2- 1.722= 0.278,$
    (3)   relative redundancy:   $r ={\rm \delta }H/({\rm log_2}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.$
  • The redundancy of the quaternary source with  $p = 0.1$  is  $ΔH = 0.278 \; \rm bit/symbol$   ⇒   exactly the same as the redundancy of the binary source with  $p = 0.2$.


Exercises for the chapter


Exercise 1.1: Entropy of the Weather

Exercise 1.1Z: Binary Entropy Function

Exercise 1.2: Entropy of Ternary Sources


List of sources

  1. Shannon, C.E.: A Mathematical Theory of Communication. In: Bell Syst. Techn. J. 27 (1948), pp. 379-423 and pp. 623-656.