Difference between revisions of "Information Theory/Discrete Memoryless Sources"

From LNTwww
 
(80 intermediate revisions by 12 users not shown)
Line 1: Line 1:
 
{{FirstPage}}
 
{{FirstPage}}
 
{{Header
 
{{Header
|Untermenü=Entropie wertdiskreter Nachrichtenquellen
+
|Untermenü=Entropy of Discrete Sources
 
|Vorherige Seite=
 
|Vorherige Seite=
|Nächste Seite=Nachrichtenquellen mit Gedächtnis
+
|Nächste Seite=Discrete Sources with Memory
 
}}
 
}}
  
== Modell und Voraussetzungen ==  
+
== # OVERVIEW OF THE FIRST MAIN CHAPTER # ==
Wir betrachten eine wertdiskrete Nachrichtenquelle Q, die eine Folge 〈 $q_ν $ 〉 von Symbolen abgibt. Für die Laufvariable gilt $ν$ = 1, ... , $N$, wobei $N$ „hinreichend groß” sein sollte. Jedes einzelne Quellensymbol $q_ν$ entstammt einem Symbolvorrat { $q_μ$ } mit $μ$ = 1, ... , $M$, wobei $M$ den Symbolumfang bezeichnet:
+
<br>
 +
This first chapter describes the calculation and the meaning of entropy.&nbsp; According to the Shannonian information definition,&nbsp; entropy is a measure of the mean uncertainty about the outcome of a statistical event or the uncertainty in the measurement of a stochastic quantity.&nbsp; Somewhat casually expressed,&nbsp; the entropy of a random quantity quantifies its&nbsp; &raquo;randomness&laquo;.  
 +
 
 +
In detail are discussed:
 +
 
 +
#The &nbsp;&raquo;information content&laquo;&nbsp; of a symbol and the &nbsp;&raquo;entropy&laquo;&nbsp; of a discrete memoryless source,
 +
#the &nbsp;&raquo;binary entropy function&laquo;&nbsp; and its application to non-binary sources,
 +
#the entropy calculation for&nbsp; &raquo;sources with memory&laquo;&nbsp; and suitable approximations,
 +
#the special features of&nbsp; &raquo;Markov sources&laquo;&nbsp; regarding the entropy calculation,
 +
#the procedure for sources with a large number of symbols, for example&nbsp; &raquo;natural texts&laquo;,
 +
#the&nbsp; &raquo;entropy estimates&laquo;&nbsp; according to Shannon and Küpfmüller.
 +
 
 +
 
 +
 
 +
== Model and requirements ==
 +
<br>
 +
We consider a  discrete message source&nbsp; $\rm Q$, which gives a sequence&nbsp; $ \langle q_ν \rangle$&nbsp; of symbols.  
 +
*For the variable &nbsp;$ν = 1$, ... , $N$, where&nbsp; $N$&nbsp; should be sufficiently large.
 
   
 
   
$$q_{\nu} \in \left \{ q_{\mu}  \right \}, \hspace{0.15cm}{\rm mit}\hspace{0.15cm} \nu = 1, ... \hspace{0.05cm}, N\hspace{0.15cm}{\rm und}\hspace{0.15cm}\mu = 1, ...\hspace{0.05cm} , M \hspace{0.05cm}.$$
+
*Each individual source symbol &nbsp;$q_ν$&nbsp; comes from a symbol set&nbsp; $\{q_μ \}$&nbsp; where&nbsp; $μ = 1$, ... , $M$.&nbsp; $M$&nbsp; denotes the symbol set size:
 +
 +
:$$q_{\nu} \in \left \{ q_{\mu}  \right \}, \hspace{0.25cm}{\rm with}\hspace{0.25cm} \nu = 1, \hspace{0.05cm} \text{ ...}\hspace{0.05cm} , N\hspace{0.25cm}{\rm and}\hspace{0.25cm}\mu = 1,\hspace{0.05cm} \text{ ...}\hspace{0.05cm} , M \hspace{0.05cm}.$$
  
Die Grafik zeigt eine quaternäre Nachrichtenquelle ( $M= 4) mit dem Alphabet {A, B, C, D}. Rechts ist eine beispielhafte Folge der Länge $N$ = 100 angegeben.
+
The figure shows a quaternary message source&nbsp; $(M = 4)$&nbsp; with alphabet&nbsp; $\rm \{A, \ B, \ C, \ D\}$&nbsp; and an exemplary sequence of length&nbsp; $N = 100$.
  
[[File:P_ID2227__Inf_T_1_1_S1a_neu.png|250px|right|Gedächtnislose quaternäre Nachrichtenquelle]]
+
[[File:EN_Inf_T_1_1_S1a.png|frame|Quaternary source]]
  
Es gelten folgende Voraussetzungen:
+
The following requirements apply:
*Die quaternäre Nachrichtenquelle wird durch $M$ = 4 Symbolwahrscheinlichkeiten $p_μ$ vollständig beschrieben. Allgemein gilt:
+
*The quaternary source is fully described by&nbsp; $M = 4$&nbsp; symbol probabilities&nbsp; $p_μ$.&nbsp; In general it applies:
$$\sum_{\mu = 1}^M \hspace{0.1cm}p_{\mu} = 1 \hspace{0.05cm}.$$
+
:$$\sum_{\mu = 1}^M \hspace{0.1cm}p_{\mu} = 1 \hspace{0.05cm}.$$
*Die Nachrichtenquelle sei gedächtnislos, das heißt, die einzelnen Folgenelemente seien statistisch voneinander unabhängig:
+
*The message source is memoryless, i.e., the individual sequence elements are&nbsp; [[Theory_of_Stochastic_Signals/Statistical Dependence and Independence#General_definition_of_statistical_dependence|&raquo;statistically independent of each other&laquo;]]:
$${\rm Pr} \left (q_{\nu} = q_{\mu} \right ) = {\rm Pr} \left (q_{\nu} = q_{\mu} \hspace{0.03cm} | \hspace{0.03cm} q_{\nu -1}, q_{\nu -2}, ... \right ) \hspace{0.05cm}.$$
+
:$${\rm Pr} \left (q_{\nu} = q_{\mu} \right ) = {\rm Pr} \left (q_{\nu} = q_{\mu} \hspace{0.03cm} | \hspace{0.03cm} q_{\nu -1}, q_{\nu -2}, \hspace{0.05cm} \text{ ...}\hspace{0.05cm}\right ) \hspace{0.05cm}.$$
*Da das Alphabet aus Symbolen (und nicht aus Zufallsgrößen) besteht, ist hier die Angabe von Erwartungswerten (linearer Mittelwert, quadratischer Mittelwert, Streuung, usw.) nicht möglich, aber auch aus informationstheoretischer Sicht nicht nötig.
+
*Since the alphabet consists of symbols&nbsp; $($and not of random variables$)$,&nbsp; the specification of&nbsp; [[Theory_of_Stochastic_Signals/Expected_Values_and_Moments|&raquo;expected values&laquo;]]&nbsp; $($linear mean, second moment, standard deviation, etc.$)$&nbsp; is not possible here,&nbsp; but also not necessary from an information-theoretical point of view.
  
Diese Eigenschaften werden auf der nächsten Seite mit einem Beispiel verdeutlicht.
 
  
+
These properties will now be illustrated with an example.
{{Beispiel}}
+
 
Für die Symbolwahrscheinlichkeiten einer Quaternärquelle gelte:
+
{{GraueBox|TEXT=
 +
[[File:Inf_T_1_1_S1b_vers2.png|right|frame|Relative frequencies as a function of&nbsp; $N$]] 
 +
$\text{Example 1:}$&nbsp;
 +
For the symbol probabilities of a quaternary source applies:
 +
:$$p_{\rm A} = 0.4 \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = 0.3 \hspace{0.05cm},\hspace{0.2cm}p_{\rm C} = 0.2 \hspace{0.05cm},\hspace{0.2cm}  
 +
p_{\rm D} = 0.1\hspace{0.05cm}.$$
 +
For an infinitely long sequence&nbsp; $(N \to \infty)$
 +
*the&nbsp; [[Theory_of_Stochastic_Signals/From_Random_Experiment_to_Random_Variable#Bernoulli.27s_law_of_large_numbers|&raquo;relative frequencies&laquo;]]&nbsp; $h_{\rm A}$,&nbsp; $h_{\rm B}$,&nbsp; $h_{\rm C}$,&nbsp; $h_{\rm D}$ &nbsp; &rArr; &nbsp; a-posteriori parameters
 
   
 
   
$$p_{\rm A} = 0.4 \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = 0.3 \hspace{0.05cm},\hspace{0.2cm}p_{\rm C} = 0.2 \hspace{0.05cm},\hspace{0.2cm}
+
*were identical to the&nbsp; [[Theory_of_Stochastic_Signals/Some_Basic_Definitions#Event_and_event_probability|&raquo;probabilities&laquo;]]&nbsp; $p_{\rm A}$,&nbsp; $p_{\rm B}$,&nbsp; $p_{\rm C}$,&nbsp; $p_{\rm D}$ &nbsp; &rArr; &nbsp; a-priori parameters.  
p_{\rm D} = 0.1\hspace{0.05cm}.$$
+
 
 +
 
 +
With smaller&nbsp; $N$&nbsp; deviations may occur, as the adjacent table&nbsp; $($result of a simulation$)$&nbsp; shows.
  
Bei einer unendlich langen Folge ( $N$ → ∞) wären die relativen Häufigkeiten $h_A$, $h_B$, $h_C$ und $h_D$ – also die a–posteriori–Kenngrößen – identisch mit den a–priori–Wahrscheinlichkeiten $p_A$, $p_B$, $p_C$ und $p_D$. Bei kleinerem $N$ kann es aber durchaus zu Abweichungen kommen, wie die folgende Tabelle (Ergebnis einer Simulation) zeigt. Die Folge für $N$ = 100 ist auf der letzten Seite angegeben.
+
*In the graphic above an exemplary sequence is shown with&nbsp; $N = 100$&nbsp; symbols.
 +
 +
*Due to the set elements&nbsp; $\rm A$,&nbsp; $\rm B$,&nbsp; $\rm C$&nbsp; and&nbsp; $\rm D$&nbsp; no mean values can be given.  
  
[[File:P_ID2230__Inf_T_1_1_S1b.png|250px|right|Relative Häufigkeiten in Abhängigkeit von ''N'']]
 
  
Aufgrund der Mengenelemente A, B, C und D können keine Mittelwerte angegeben werden. Ersetzt man die Symbole durch Zahlenwerte, zum Beispiel A 1, B 2, C 3, D 4, so ergeben sich
+
However,&nbsp; if you replace the symbols with numerical values,&nbsp; for example&nbsp; $\rm A \Rightarrow 1$, &nbsp; $\rm B \Rightarrow 2$, &nbsp; $\rm C \Rightarrow 3$, &nbsp; $\rm D \Rightarrow 4$, then you will get after <br> &nbsp; &nbsp; &raquo;time averaging&laquo; &nbsp; &rArr; &nbsp; crossing line &nbsp; &nbsp; or &nbsp; &nbsp; &raquo;ensemble averaging&laquo; &nbsp; &rArr; &nbsp; expected value formation
*für den linearen Mittelwert:
+
*for the&nbsp; [[Theory_of_Stochastic_Signals/Moments_of_a_Discrete_Random_Variable#First_order_moment_.E2.80.93_linear_mean_.E2.80.93_DC_component|&raquo;linear mean&laquo;]] &nbsp; &rArr; &nbsp; &raquo;first order moment&laquo;:
$$m_1 = {\rm E} \left [ q_{\nu} \right ] = {\rm E} \left [ q_{\mu} \right ] = 0.4 \cdot 1 + 0.3 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4
+
:$$m_1 = \overline { q_{\nu} } = {\rm E} \big [ q_{\mu} \big ] = 0.4 \cdot 1 + 0.3 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4
 
= 2 \hspace{0.05cm},$$  
 
= 2 \hspace{0.05cm},$$  
*für den quadratischen Mittelwert:
+
*for the&nbsp; [[Theory_of_Stochastic_Signals/Moments_of_a_Discrete_Random_Variable#Second_order_moment_.E2.80.93_power_.E2.80.93_variance_.E2.80.93_standard_deviation |&raquo;second order moment&laquo;]]:
$$m_2 = {\rm E} \left [ q_{\nu}^{\hspace{0.05cm}2}  \right ] = {\rm E} \left [ q_{\mu}^{\hspace{0.05cm}2} \right ] = 0.4 \cdot 1^2 + 0.3 \cdot 2^2 + 0.2 \cdot 3^2 + 0.1 \cdot 4^2
+
:$$m_2 = \overline { q_{\nu}^{\hspace{0.05cm}2}  } = {\rm E} \big [ q_{\mu}^{\hspace{0.05cm}2} \big ] = 0.4 \cdot 1^2 + 0.3 \cdot 2^2 + 0.2 \cdot 3^2 + 0.1 \cdot 4^2
 
= 5 \hspace{0.05cm},$$
 
= 5 \hspace{0.05cm},$$
*für die Standardabweichung (Streuung) nach dem „Satz von Steiner”:
+
*for the&nbsp; [[Theory_of_Stochastic_Signals/Expected_Values_and_Moments#Some_common_central_moments|&raquo;standard deviation&laquo;]]&nbsp;  according to the&nbsp; &raquo;Theorem of Steiner&laquo;:
$$\sigma = \sqrt {m_2 - m_1^{\hspace{0.05cm}2}} = \sqrt {5 - 2^{\hspace{0.01cm}2}}
+
:$$\sigma = \sqrt {m_2 - m_1^2} = \sqrt {5 - 2^2} = 1 \hspace{0.05cm}.$$}}
= 1 \hspace{0.05cm}.$$
 
  
{{end}}
 
 
 
 
 
  
==Entscheidungsgehalt – Nachrichtengehalt==
+
==Maximum entropy of a discrete source==
Claude E. Shannon definierte 1948 im Standardwerk der Informationstheorie [Sha48] den Informationsbegriff als „Abnahme der Ungewissheit über das Eintreten eines statistischen Ereignisses”. Machen wir hierzu ein gedankliches Experiment mit M möglichen Ergebnissen, die alle gleichwahrscheinlich seien:
+
<br>
 +
[https://en.wikipedia.org/wiki/Claude_Shannon $\text{Claude Elwood Shannon}$]&nbsp; defined in 1948 in the standard work of information theory&nbsp; [Sha48]<ref name='Sha48'>Shannon, C.E.: A Mathematical Theory of Communication. In: Bell Syst. Techn. J. 27 (1948), pp. 379-423 and pp. 623-656.</ref>&nbsp; the concept of information as&nbsp; "decrease of uncertainty about the occurrence of a statistical event".
 +
 
 +
Let us make a mental experiment with&nbsp; $M$&nbsp; possible results, which are all equally probable: &nbsp; $p_1 = p_2 = \hspace{0.05cm} \text{ ...}\hspace{0.05cm} = p_M = 1/M \hspace{0.05cm}.$
 +
 
 +
Under this assumption applies:
 +
*Is&nbsp; $M = 1$, then each individual attempt will yield the same result and therefore there is no uncertainty about the output.
 +
 
 +
*On the other hand, an observer learns about an experiment with&nbsp; $M = 2$, for example the&nbsp; &raquo;coin toss&laquo;&nbsp; with the set of events&nbsp; $\big \{\rm \boldsymbol{\rm  Z}(ahl), \rm \boldsymbol{\rm  W}(app) \big \}$&nbsp; and the probabilities&nbsp; $p_{\rm Z} = p_{\rm W} = 0. 5$, a gain in information.&nbsp; The uncertainty regarding&nbsp; $\rm Z$ &nbsp;resp.&nbsp; $\rm W$&nbsp; is resolved.
  
$$p_1 = p_2 = ... = p_M = 1/M \hspace{0.05cm}.$$
+
*In the experiment&nbsp; &raquo;dice&laquo;&nbsp; $(M = 6)$&nbsp; and even more in&nbsp; &raquo;roulette&laquo;&nbsp;  $(M = 37)$&nbsp; the gained information is even more significant for the observer than in the&nbsp; &raquo;coin toss&laquo;&nbsp; when he learns which number was thrown or which ball fell.
  
Unter dieser Annahme gilt:
+
*Finally it should be considered that the experiment&nbsp; &raquo;triple coin toss&laquo;&nbsp; with&nbsp; $M = 8$&nbsp; possible results&nbsp; $\rm ZZZ$,&nbsp; $\rm ZZW$,&nbsp; $\rm ZWZ$,&nbsp; $\rm ZWW$,&nbsp; $\rm WZZ$,&nbsp; $\rm WZW$,&nbsp; $\rm WWZ$,&nbsp; $\rm WWW$&nbsp; provides three times the information as the single coin toss&nbsp; $(M = 2)$.
*Ist $M$ = 1, so wird jeder einzelne Versuch das gleiche Ergebnis liefern und demzufolge besteht keine Unsicherheit hinsichtlich des Ausgangs. Wird uns das Versuchsergebnis mitgeteilt, so haben wir dadurch natürlich auch keinen Informationsgewinn.
 
*Dagegen erfährt ein Beobachter bei einem Experiment mit $M$ = 2, zum Beispiel dem „Münzwurf” mit der Ereignismenge { '''Z'''(ahl), '''W'''(app) } und den Wahrscheinlichkeiten $p_Z$ = $p_W$ = 0.5, durchaus einen Informationsgewinn. Die Unsicherheit, ob '''Z''' oder '''W''' geworfen wurde, wird aufgelöst.
 
*Beim Experiment „Würfeln” ( $M$ = 6 ) und noch mehr beim Roulette ( $M$ = 37) ist die gewonnene Information für den Beobachter noch deutlich größer als beim „Münzwurf”, wenn er erfährt, welche Zahl gewürfelt bzw. welche Kugel gefallen ist.
 
*Schließlich sollte noch berücksichtigt werden, dass das Experiment „Dreifacher Münzwurf” mit den $M$ = 8 möglichen Ergebnissen '''ZZZ''', '''ZZW''', '''ZWZ''', '''ZWW''', '''WZZ''', '''WZW''', '''WWZ''', '''WWW''' die dreifache Information liefert wie der einfache Münzwurf ( $M$ = 2 ).
 
  
  
Die nachfolgende Festlegung erfüllt alle hier verbal aufgeführten Anforderungen an ein quantitatives Informationsmaß bei gleichwahrscheinlichen Ereignissen, gekennzeichnet durch den Symbolumfang $M$.
+
The following definition fulfills all the requirements listed here for a quantitative information measure for equally probable events, indicated only by the symbol set size&nbsp; $M$.
  
{{Definition}}
+
{{BlaueBox|TEXT= 
Der Entscheidungsgehalt einer Nachrichtenquelle hängt nur vom Symbolumfang $M$ ab und ergibt sich zu
+
$\text{Definition:}$&nbsp; The&nbsp; &raquo;'''maximum average information content'''&laquo; &nbsp; of a message source depends only on the symbol set size&nbsp; $M$&nbsp; and results in
 
   
 
   
$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm}{\rm (in \hspace{0.15cm}"bit")}
+
:$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm} {\rm (in \ &#8220;bit")}
= {\rm ln}\hspace{0.1cm}M \hspace{0.15cm}{\rm (in \hspace{0.15cm}"nat")}
+
= {\rm ln}\hspace{0.1cm}M \hspace{0.15cm}\text {(in &#8220;nat")}
= {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}{\rm (in \hspace{0.15cm}"Hartley")} \hspace{0.05cm}.$$
+
= {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}\text {(in &#8220;Hartley")}\hspace{0.05cm}.$$
  
Gebräuchlich ist hierfür auch die Bezeichnung ''Nachrichtengehalt''. Da $H_0$ gleichzeitig den Maximalwert der Entropie $H$ angibt, wird in hier teilweise auch $H_\text{max}$ als Kurzzeichen verwendet.
+
*Since&nbsp; $H_0$&nbsp; indicates the maximum value of the&nbsp; [[Information_Theory/Discrete_Memoryless_Sources#Information_content_and_entropy|$\text{entropy}$]]&nbsp; $H$,&nbsp; $H_\text{max}=H_0$&nbsp; is also used in our tutorial as short notation. }}
  
{{end}}
 
  
 
+
Please note our nomenclature:
Anzumerken ist:
+
*The logarithm will be called&nbsp; &raquo;log&laquo;&nbsp; in the following, independent of the base.
*Der Logarithmus wird in unserem Tutorial unabhängig von der Basis mit „log” bezeichnet. Die vier oben aufgestellten Kriterien werden aufgrund folgender Eigenschaften erfüllt:
+
 +
*The relations mentioned above are fulfilled due to the following properties:
 
   
 
   
$${\rm log}\hspace{0.1cm}1 = 0 \hspace{0.05cm},\hspace{0.2cm}
+
:$${\rm log}\hspace{0.1cm}1 = 0 \hspace{0.05cm},\hspace{0.2cm}
 
{\rm log}\hspace{0.1cm}37 > {\rm log}\hspace{0.1cm}6 > {\rm log}\hspace{0.1cm}2\hspace{0.05cm},\hspace{0.2cm}
 
{\rm log}\hspace{0.1cm}37 > {\rm log}\hspace{0.1cm}6 > {\rm log}\hspace{0.1cm}2\hspace{0.05cm},\hspace{0.2cm}
 
{\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
 
{\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
  
*Meist verwenden wir den Logarithmus zur Basis 2 ⇒ Logarithmus dualis (ld), wobei dann die Pseudoeinheit „bit” – genauer: „bit/Symbol” – hinzugefügt wird:
+
* Usually we use the logarithm to the base&nbsp; $2$ &nbsp; &nbsp; &raquo;logarithm dualis&laquo;&nbsp; &nbsp; $\rm (ld)$,&nbsp; where the pseudo unit&nbsp; "bit"&nbsp; $($more precisely:&nbsp; "bit/symbol"$)$&nbsp; is then added:
 
   
 
   
$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2}
+
:$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2}
 
= \frac{{\rm ln}\hspace{0.1cm}M}{{\rm ln}\hspace{0.1cm}2}  
 
= \frac{{\rm ln}\hspace{0.1cm}M}{{\rm ln}\hspace{0.1cm}2}  
 
  \hspace{0.05cm}.$$
 
  \hspace{0.05cm}.$$
  
*Weiter findet man in der Literatur auch Definitionen, basierend auf dem natürlichen Logarithmus („ln”) oder dem Zehnerlogarithmus („lg”) entsprechend obigen Definitionen.
+
*In addition, you can find in the literature some additional definitions, which are based on the natural logarithm&nbsp; $\rm (ln)$&nbsp; or the logarithm of the tens&nbsp; $\rm (lg)$.
 
   
 
   
==Informationsgehalt und Entropie ==
+
==Information content and entropy ==
 
+
<br>
Wir verzichten nun auf die bisherige Voraussetzung, dass alle $M$ möglichen Ergebnisse eines Versuchs gleichwahrscheinlich seien. Im Hinblick auf eine möglichst kompakte Schreibweise legen wir für diese Seite lediglich fest:
+
We now waive the previous requirement that all&nbsp; $M$&nbsp; possible results of an experiment are equally probable.&nbsp; In order to keep the spelling as compact as possible, we define for this section only:
 
   
 
   
$$p_1 > p_2 > ... > p_\mu > ... > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$
+
:$$p_1 > p_2 > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_\mu > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$
  
Unter dieser Voraussetzung betrachten wir nun den '''Informationsgehalt''' der einzelnen Symbole, wobei wir den Logarithmus dualis mit „ld”(manchmal auch mit „log2”) bezeichnen :
+
We now consider the &raquo;'''information content'''&laquo;&nbsp; of the individual symbols, where we denote the&nbsp; "logarithm dualis"&nbsp; with&nbsp; $\log_2$:
 
   
 
   
$$I_\mu = {\rm ld}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm ld}\hspace{0.1cm}{p_\mu}
+
:$$I_\mu = {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm log_2}\hspace{0.1cm}{p_\mu}
\hspace{0.5cm}{\rm (Einheit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}oder\hspace{0.15cm}bit/Symbol)}
+
\hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/Symbol)}
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
Man erkennt:
+
You can see:
*Wegen $p_μ$ ≤ 1 ist der Informationsgehalt nie negativ. Im Grenzfall $p_μ$  →  1 geht $I_μ$ →  0. Allerdings ist für $I_μ$ = 0 →  $p_μ$ = 1  →  $M$ = 1 auch der Entscheidungsgehalt $H_0$ = 0.
+
*Because of&nbsp; $p_μ ≤ 1$&nbsp; the information content is never negative.&nbsp; In the borderline case&nbsp; $p_μ \to 1$&nbsp; goes&nbsp; $I_μ \to 0$.
*Bei abfallenden Wahrscheinlichkeiten $p_μ$ nimmt der Informationsgehalt kontinuierlich zu:
 
 
   
 
   
$$I_1 < I_2 < ... < I_\mu < ... < I_{M-1} < I_M \hspace{0.05cm}.$$
+
*However, for&nbsp; $I_μ = 0$ &nbsp; &rArr; &nbsp; $p_μ = 1$ &nbsp; &rArr; &nbsp; $M = 1$&nbsp; the information content is also&nbsp; $H_0 = 0$.
  
Das heißt: Je weniger wahrscheinlich ein Ereignis ist, desto größer ist sein Informationsgehalt. Dieser Sachverhalt ist auch im täglichen Leben festzustellen:
+
*For decreasing probabilities&nbsp; $p_μ$&nbsp; the information content increases continuously:
*„6 Richtige” im Lotto nimmt man sicher eher war als „3 Richtige” oder gar keinen Gewinn.
+
*Ein Tsunami in Asien dominiert auch die Nachrichten in Deutschland über Wochen im Gegensatz zu den fast standardmäßigen Verspätungen der Deutschen Bahn.
+
:$$I_1 < I_2 < \hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_\mu <\hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_{M-1} < I_M \hspace{0.05cm}.$$
*Eine Niederlagenserie von Bayern München führt zu Riesen–Schlagzeilen im Gegensatz zu einer Siegesserie. Bei 1860 München ist genau das Gegenteil der Fall.
 
  
 +
{{BlaueBox|TEXT= 
 +
$\text{Conclusion:}$&nbsp; '''The more improbable an event is, the greater is its information content'''.&nbsp; This fact is also found in daily life:
 +
#"6 right ones" in the lottery are more likely to be noticed than "3 right ones" or no win at all.
 +
#A tsunami in Asia also dominates the news in Germany for weeks as opposed to the almost standard Deutsche Bahn delays.
 +
#A series of defeats of Bayern Munich leads to huge headlines in contrast to a winning series.&nbsp; With 1860 Munich exactly the opposite is the case.}}
  
Der Informationsgehalt eines einzelnen Symbols (oder Ereignisses) ist allerdings nicht sehr interessant. Durch Scharmittelung über alle möglichen Symbole $q_μ$ bzw. durch Zeitmittelung über alle Folgenelemente $q_ν$ erhält man dagegen eine der zentralen Größen der Informationstheorie.
 
  
{{Definition}}
+
However, the information content of a single symbol (or event) is not very interesting.&nbsp; On the other hand one of the central quantities of information theory is obtained,
Die '''Entropie''' einer Quelle gibt den mittleren Informationsgehalt aller Symbole an:
+
*by ensemble averaging over all possible symbols&nbsp; $q_μ$ &nbsp;bzw.&nbsp;
 
   
 
   
$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm ld}\hspace{0.1cm}\frac{1}{p_\mu}=
+
*by time averaging over all elements of the sequence&nbsp; $\langle q_ν \rangle$.
  -\sum_{\mu = 1}^M p_{\mu} \cdot{\rm ld}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}{\rm (Einheit\hspace{-0.15cm}: \hspace{0.15cm}bit[/Symbol])}
+
 
 +
 
 +
{{BlaueBox|TEXT= 
 +
$\text{Definition:}$&nbsp; The&nbsp; &raquo;'''entropy'''&laquo;&nbsp; $H$&nbsp; of a discrete source indicates the&nbsp; &raquo;'''mean information content of all symbols'''&laquo;:
 +
 +
:$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}=
 +
  -\sum_{\mu = 1}^M p_{\mu} \cdot{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}\text{(unit: bit, more precisely: bit/symbol)}  
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
Die überstreichende Linie kennzeichnet eine Zeitmittelung und E[...] eine Scharmittelung.
+
The overline marks again a time averaging and&nbsp; $\rm E[\text{...}]$&nbsp; an ensemble averaging.}}
{{end}}
+
 
 +
 
 +
Entropy is among other things a measure for
 +
*the mean uncertainty about the outcome of a statistical event,
 +
 
 +
*the&nbsp; "randomness"&nbsp; of this event,&nbsp; and
  
Die Entropie ist ein Maß für
+
*the average information content of a random variable.
*die mittlere Unsicherheit über den Ausgang eines statistischen Ereignisses,
+
   
*die „Zufälligkeit” dieses Ereignisses,
 
*den mittleren Informationsgehalt einer Zufallsgröße.  
 
  
==Binäre Entropiefunktion  ==
+
==Binary entropy function ==
 +
<br>
 +
At first we will restrict ourselves to the special case&nbsp; $M = 2$&nbsp; and consider a binary source, which returns the two symbols&nbsp; $\rm A$&nbsp; and&nbsp; $\rm B$.&nbsp; The symbol probabilities are &nbsp; $p_{\rm A} = p$&nbsp; and &nbsp; $p_{\rm B} = 1 - p$.
  
Wir beschränken uns zunächst auf den Sonderfall $M$ = 2 und betrachten eine binäre Quelle, die die beiden Symbole '''A''' und '''B''' abgibt. Die Auftrittwahrscheinlichkeiten seien $p_A$ = $p$ und $p_B$ = 1 – $p$.
+
For the entropy of this binary source applies:  
Für die Entropie dieser Quelle gilt:
 
 
   
 
   
$$H_{\rm bin} (p) = p \cdot {\rm ld}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm ld}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (Einheit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}oder\hspace{0.15cm}bit/Symbol)}
+
:$$H_{\rm bin} (p) = p \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/symbol)}
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
Man nennt diese Funktion $H_\text{bin}(p)$ die '''binäre Entropiefunktion'''. Die Entropie einer Quelle mit größerem Symbolumfang $M$ lässt sich häufig unter Verwendung von $H_\text{bin}(p)$ ausdrücken.
+
This function is called&nbsp; $H_\text{bin}(p)$&nbsp; the&nbsp; &raquo;'''binary entropy function'''&laquo;.&nbsp; The entropy of a source with a larger symbol set size&nbsp; $M$&nbsp; can often be expressed using&nbsp; $H_\text{bin}(p)$&nbsp;.
  
[[File:P_ID2229__Inf_T_1_1_S4_neu.png|250px|right|Binäre Entropiefunktion]]
+
{{GraueBox|TEXT= 
 +
$\text{Example 2:}$&nbsp;
 +
The figure shows the binary entropy function for the values&nbsp; $0 ≤ p ≤ 1$&nbsp; of the symbol probability of&nbsp; $\rm A$&nbsp; $($or also of&nbsp; $\rm B)$.&nbsp; You can see:
  
Die Grafik zeigt die Funktion $H_\text{bin}(p)$ für die Werte 0 ≤ $p$ ≤ 1 der Symbolwahrscheinlichkeit von '''A''' (oder '''B'''). Man erkennt:
+
[[File:EN_Inf_T_1_1_S5_v2.png|frame|Binary entropy function as a function of&nbsp; $p$ |right]]
*Der Maximalwert $H_\text{max}$ = 1 bit ergibt sich für $p$ = 0.5, also für gleichwahrscheinliche Binärsymbole. Dann liefern '''A''' und '''B''' jeweils den gleichen Beitrag zur Entropie.
+
*The maximum value&nbsp; $H_\text{max} = 1\; \rm bit$&nbsp; results for&nbsp; $p = 0.5$, thus for equally probable binary symbols.&nbsp; Then &nbsp; $\rm A$&nbsp; and&nbsp; $\rm B$&nbsp; contribute the same amount to the entropy.
* $H_\text{bin}(p)$ ist symmetrisch um $p$ = 0.5. Eine Quelle mit $p_A$ = 0.1 und $p_B$ = 0.9 hat die gleiche Entropie (Zufälligkeit) $H$ = 0.469 bit wie eine Quelle mit $p_A$ = 0.9 und $p_B$ = 0.1.
 
*Die Differenz $ΔH$ = $H_\text{max}$ – $H$ gibt die Redundanz der Quelle an und $r$ = $ΔH/H_\text{max}$ die relative Redundanz. Im genannten Beispiel ergeben sich $ΔH$ = 0.531 bit bzw. $r$ = 53.1%.
 
*Für $p$ = 0 ergibt sich $H$ = 0, da hier die Ausgangsfolge „'''B B B''' ...” sicher vorhersagbar ist. Eigentlich beträgt nun der Symbolumfang nur noch $M$ = 1. Gleiches gilt für $p$ = 1.
 
  
 +
* $H_\text{bin}(p)$&nbsp; is symmetrical around&nbsp; $p = 0.5$.&nbsp; A source with&nbsp; $p_{\rm A} = 0.1$&nbsp; and&nbsp; $p_{\rm B} = 0. 9$&nbsp; has the same entropy&nbsp; $H = 0.469 \; \rm bit$&nbsp; as a source with&nbsp; $p_{\rm A} = 0.9$&nbsp; and&nbsp; $p_{\rm B} = 0.1$.
  
Es sollte noch erwähnt werden, dass die binäre Entropiefunktion ''konkav'' ist, da deren zweite Ableitung nach dem Parameter $p$ für alle Werte von $p$ negativ ist:
+
*The difference&nbsp; $ΔH = H_\text{max} - H$ gives&nbsp; the&nbsp; &raquo;redundancy&laquo;&nbsp; of the source and&nbsp; $r = ΔH/H_\text{max}$&nbsp; the&nbsp; &raquo;relative redundancy&laquo;. &nbsp; In the example,&nbsp; $ΔH = 0.531\; \rm bit$&nbsp; and&nbsp; $r = 53.1 \rm \%$.
 
 
$$\frac{{\rm d}^2H_{\rm bin} (p)}{{\rm d}\,p^2} = \frac{-1}{{\rm ln}(2) \cdot p \cdot (1-p)}< 0
 
\hspace{0.05cm}.$$
 
  
 +
*For&nbsp; $p = 0$&nbsp; this results in&nbsp; $H = 0$, since the symbol sequence &nbsp;$\rm B \ B \ B \text{...}$&nbsp; can be predicted with certainty &nbsp; &rArr; &nbsp; symbol set size only&nbsp; $M = 1$.&nbsp; The same applies to&nbsp; $p = 1$ &nbsp; &rArr; &nbsp; symbol sequence &nbsp;$\rm A \ A \ A \text{...}$.
  
==Nachrichtenquellen mit größerem Symbolumfang== 
+
*$H_\text{bin}(p)$&nbsp; is always a&nbsp; &raquo;concave function&laquo;,&nbsp; since the second derivative after the parameter&nbsp; $p$&nbsp; is negative for all values of&nbsp; $p$&nbsp;:
 +
:$$\frac{ {\rm d}^2H_{\rm bin} (p)}{ {\rm d}\,p^2} = \frac{- 1}{ {\rm ln}(2) \cdot p \cdot (1-p)}< 0
 +
\hspace{0.05cm}.$$}}
  
Auf der ersten Seite dieses Kapitels haben wir eine quaternäre Nachrichtenquelle ($M$ = 4) mit den Symbolwahrscheinlichkeiten $p_A$ = 0.4, $p_B$ = 0.3, $p_C$ = 0.2 und $p_D$ = 0.1 betrachtet. Diese besitzt die folgende Entropie:
+
==Non-binary sources== 
 +
<br>
 +
In the&nbsp; [[Information_Theory/Discrete_Memoryless_Sources#Model_and_requirements|"first section"]]&nbsp; of this chapter we  considered a quaternary message source&nbsp; $(M = 4)$&nbsp; with the symbol probabilities&nbsp; $p_{\rm A} = 0. 4$, &nbsp; $p_{\rm B} = 0.3$, &nbsp; $p_{\rm C} = 0.2$&nbsp; and&nbsp; $ p_{\rm D} = 0.1$.&nbsp; This source has the following entropy:
 
   
 
   
$$\begin{align*}H_{\rm quat} \hspace{-0.1cm} & = \hspace{-0.1cm}  0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}=\\  
+
:$$H_{\rm quat} = 0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}.$$
\hspace{-0.1cm} & \hspace{-0.1cm}\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit}
+
 
\hspace{0.05cm}.\end{align*}$$
+
For numerical calculation, the detour via the decimal logarithm&nbsp; $\lg \ x = {\rm log}_{10} \ x$&nbsp; is often necessary, since the&nbsp; "logarithm dualis"&nbsp; $ {\rm log}_2 \ x$&nbsp; is mostly not found on pocket calculators.
 +
 
 +
:$$H_{\rm quat}=\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2} + 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit}
 +
\hspace{0.05cm}.$$
  
Oft ist der Umweg über den Zehnerlogarithmus lg $x$ = log10 $x$ sinnvoll, da meist der Logarithmus dualis log2 $x$ (oder auch ld $x$) auf Taschenrechnern nicht zu finden ist.
+
{{GraueBox|TEXT=
Bestehen zwischen den einzelnen Symbolwahrscheinlichkeiten Symmetrien wie im Beispiel
+
$\text{Example 3:}$&nbsp;
 +
Now there are certain symmetries between the symbol probabilities:
 +
[[File:EN_Inf_T_1_1_S5_v3.png|frame|Entropy of binary source and quaternary source]]
 
   
 
   
$$p_{\rm A} = p_{\rm D} = p \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = p_{\rm C} = 0.5-p \hspace{0.05cm},\hspace{0.3cm}{\rm mit} \hspace{0.15cm}0 \le p \le 0.5 \hspace{0.05cm},$$
+
:$$p_{\rm A} = p_{\rm D} = p \hspace{0.05cm},\hspace{0.4cm}p_{\rm B} = p_{\rm C} = 0.5 - p \hspace{0.05cm},\hspace{0.3cm}{\rm with} \hspace{0.15cm}0 \le p \le 0.5 \hspace{0.05cm}.$$
  
so kann zur Entropieberechnung auf die binäre Entropiefunktion zurückgegriffen werden:
+
In this case, the binary entropy function can be used to calculate the entropy:
 
   
 
   
$$H_{\rm quat} = 2 \cdot p \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + 2 \cdot (0.5-p) \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.5-p} = 1 + H_{\rm bin}(2p) \hspace{0.05cm}.$$
+
:$$H_{\rm quat} = 2 \cdot p \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm} } + 2 \cdot (0.5-p) \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.5-p}$$
 +
$$\Rightarrow \hspace{0.3cm} H_{\rm quat} = 1 + H_{\rm bin}(2p) \hspace{0.05cm}.$$
 +
 
 +
The graphic shows as a function of&nbsp; $p$
 +
*the entropy of the quaternary source (blue)
  
Die Grafik zeigt den Entropieverlauf der Quaternärquelle (blau) im Vergleich zur Binärquelle (rot) abhängig von $p$. Für die Quaternärquelle ist nur der Abszissenbereich 0 ≤ $p$ ≤ 0.5 zulässig.
+
*in comparison to the entropy course of the binary source (red).  
  
[[File:P_ID2231__Inf_T_1_1_S5_neu.png|250px|right|Entropie von Binärquelle und Quaternärquelle]]
 
  
Man erkennt aus der blauen Kurve für die Quaternärquelle:
+
For the quaternary source only the abscissa&nbsp; $0 ≤ p ≤ 0.5$&nbsp; is allowed.
*Die maximale Entropie $H_\text{max}$ = 2 bit ergibt sich für $p$ = 0.25 ⇒  $p_A$ = $p_B$ = $p_C$ = $p_D$ = 0.25, also wieder für gleichwahrscheinliche Symbole.
+
 
*Mit $p$ = 0 bzw. $p$ = 0.5 entartet die Quaternärquelle zu einer Binärquelle mit $p_B$ = $p_C$ = 0.5 bzw. $p_A$ = $p_D$ = 0.5. In diesem Fall ergibt sich die Entropie zu $H$ = 1 bit.
+
&rArr; &nbsp; You can see from the blue curve for the quaternary source:
*Die Quelle mit $p_A$ = $p_D$ = $p$ = 0.1 und $p_B$ = $p_C$ = 0.4 weist folgende Entropie und (relative) Redundanz auf:
+
*The maximum entropy&nbsp; $H_\text{max} = 2 \; \rm bit/symbol$&nbsp; results for&nbsp; $p = 0.25$ &nbsp; &rArr; &nbsp; equally probable symbols: &nbsp; $p_{\rm A} = p_{\rm B} = p_{\rm C} = p_{\rm A} = 0.25$.
 +
 
 +
*With&nbsp; $p = 0$&nbsp; the quaternary source degenerates to a binary source with&nbsp; $p_{\rm B} = p_{\rm C} = 0. 5$, &nbsp; $p_{\rm A} = p_{\rm D} = 0$ &nbsp; &rArr; &nbsp; $H = 1 \; \rm bit/symbol$.&nbsp; Similar applies to $p = 0.5$.
 
   
 
   
$$\begin{align*}H \hspace{-0.1cm} & = \hspace{-0.1cm}  1 + H_{\rm bin} (2p) =1 + H_{\rm bin} (0.2) = 1.722\,{\rm bit}\hspace{0.05cm},\\
+
*The source with&nbsp; $p_{\rm A} = p_{\rm D} = 0.1$&nbsp; and&nbsp; $p_{\rm B} = p_{\rm C} = 0.4$&nbsp; has the following characteristics (each with the pseudo unit "bit/symbol"):
{\rm \Delta }H \hspace{-0.1cm} & = \hspace{-0.1cm} {\rm ld}\hspace{0.1cm} M - H =2\,{\rm bit}- 1.722\,{\rm bit} = 0.278\,{\rm bit}\hspace{0.05cm},\\
+
 
r \hspace{-0.1cm} & = \hspace{-0.1cm} {\rm \Delta }H/({\rm ld}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.\end{align*}$$
+
: &nbsp; &nbsp; '''(1)''' &nbsp; entropy: &nbsp; $H = 1 + H_{\rm bin} (2p) =1 + H_{\rm bin} (0.2) = 1.722,$
 +
 
 +
: &nbsp; &nbsp; '''(2)''' &nbsp; Redundancy: &nbsp; ${\rm \Delta }H = {\rm log_2}\hspace{0.1cm} M - H =2- 1.722= 0.278,$
 +
 
 +
: &nbsp; &nbsp; '''(3)''' &nbsp; relative redundancy: &nbsp; $r ={\rm \Delta }H/({\rm log_2}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.$
 +
 
 +
*The redundancy of the quaternary source with&nbsp; $p = 0.1$&nbsp; is&nbsp; $ΔH = 0.278 \; \rm bit/symbol$ &nbsp; &rArr; &nbsp; exactly the same as the redundancy of the binary source with&nbsp; $p = 0.2$.}}
 +
 
 +
 
 +
 
 +
== Exercises for the chapter==
 +
<br>
 +
[[Aufgaben:Exercise_1.1:_Entropy_of_the_Weather|Exercise 1.1: Entropy of the Weather]]
  
Die Redundanz $ΔH$ der Quaternärquelle mit $p$ = 0.1 ist gleich 0.278 bit und damit genau so groß wie die Redundanz der Binärquelle mit $p$ = 0.2.
+
[[Aufgaben:Exercise_1.1Z:_Binary_Entropy_Function|Exercise 1.1Z: Binary Entropy Function]]
Anmerkung: Als Pseudoeinheit ist hier stets „bit” angegeben. Genauer wäre „bit/Symbol”.
 
  
== Aufgaben zu Kapitel 1.1==
+
[[Aufgaben:Exercise_1.2:_Entropy_of_Ternary_Sources|Exercise 1.2: Entropy of Ternary Sources]]
  
  
==Quellen==
+
==References==
 
<references />
 
<references />
  
  
 
{{Display}}
 
{{Display}}

Latest revision as of 15:52, 9 January 2024

# OVERVIEW OF THE FIRST MAIN CHAPTER #


This first chapter describes the calculation and the meaning of entropy.  According to the Shannonian information definition,  entropy is a measure of the mean uncertainty about the outcome of a statistical event or the uncertainty in the measurement of a stochastic quantity.  Somewhat casually expressed,  the entropy of a random quantity quantifies its  »randomness«.

In detail are discussed:

  1. The  »information content«  of a symbol and the  »entropy«  of a discrete memoryless source,
  2. the  »binary entropy function«  and its application to non-binary sources,
  3. the entropy calculation for  »sources with memory«  and suitable approximations,
  4. the special features of  »Markov sources«  regarding the entropy calculation,
  5. the procedure for sources with a large number of symbols, for example  »natural texts«,
  6. the  »entropy estimates«  according to Shannon and Küpfmüller.


Model and requirements


We consider a discrete message source  $\rm Q$, which gives a sequence  $ \langle q_ν \rangle$  of symbols.

  • For the variable  $ν = 1$, ... , $N$, where  $N$  should be sufficiently large.
  • Each individual source symbol  $q_ν$  comes from a symbol set  $\{q_μ \}$  where  $μ = 1$, ... , $M$.  $M$  denotes the symbol set size:
$$q_{\nu} \in \left \{ q_{\mu} \right \}, \hspace{0.25cm}{\rm with}\hspace{0.25cm} \nu = 1, \hspace{0.05cm} \text{ ...}\hspace{0.05cm} , N\hspace{0.25cm}{\rm and}\hspace{0.25cm}\mu = 1,\hspace{0.05cm} \text{ ...}\hspace{0.05cm} , M \hspace{0.05cm}.$$

The figure shows a quaternary message source  $(M = 4)$  with alphabet  $\rm \{A, \ B, \ C, \ D\}$  and an exemplary sequence of length  $N = 100$.

Quaternary source

The following requirements apply:

  • The quaternary source is fully described by  $M = 4$  symbol probabilities  $p_μ$.  In general it applies:
$$\sum_{\mu = 1}^M \hspace{0.1cm}p_{\mu} = 1 \hspace{0.05cm}.$$
$${\rm Pr} \left (q_{\nu} = q_{\mu} \right ) = {\rm Pr} \left (q_{\nu} = q_{\mu} \hspace{0.03cm} | \hspace{0.03cm} q_{\nu -1}, q_{\nu -2}, \hspace{0.05cm} \text{ ...}\hspace{0.05cm}\right ) \hspace{0.05cm}.$$
  • Since the alphabet consists of symbols  $($and not of random variables$)$,  the specification of  »expected values«  $($linear mean, second moment, standard deviation, etc.$)$  is not possible here,  but also not necessary from an information-theoretical point of view.


These properties will now be illustrated with an example.

Relative frequencies as a function of  $N$

$\text{Example 1:}$  For the symbol probabilities of a quaternary source applies:

$$p_{\rm A} = 0.4 \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = 0.3 \hspace{0.05cm},\hspace{0.2cm}p_{\rm C} = 0.2 \hspace{0.05cm},\hspace{0.2cm} p_{\rm D} = 0.1\hspace{0.05cm}.$$

For an infinitely long sequence  $(N \to \infty)$

  • the  »relative frequencies«  $h_{\rm A}$,  $h_{\rm B}$,  $h_{\rm C}$,  $h_{\rm D}$   ⇒   a-posteriori parameters
  • were identical to the  »probabilities«  $p_{\rm A}$,  $p_{\rm B}$,  $p_{\rm C}$,  $p_{\rm D}$   ⇒   a-priori parameters.


With smaller  $N$  deviations may occur, as the adjacent table  $($result of a simulation$)$  shows.

  • In the graphic above an exemplary sequence is shown with  $N = 100$  symbols.
  • Due to the set elements  $\rm A$,  $\rm B$,  $\rm C$  and  $\rm D$  no mean values can be given.


However,  if you replace the symbols with numerical values,  for example  $\rm A \Rightarrow 1$,   $\rm B \Rightarrow 2$,   $\rm C \Rightarrow 3$,   $\rm D \Rightarrow 4$, then you will get after
    »time averaging«   ⇒   crossing line     or     »ensemble averaging«   ⇒   expected value formation

$$m_1 = \overline { q_{\nu} } = {\rm E} \big [ q_{\mu} \big ] = 0.4 \cdot 1 + 0.3 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4 = 2 \hspace{0.05cm},$$
$$m_2 = \overline { q_{\nu}^{\hspace{0.05cm}2} } = {\rm E} \big [ q_{\mu}^{\hspace{0.05cm}2} \big ] = 0.4 \cdot 1^2 + 0.3 \cdot 2^2 + 0.2 \cdot 3^2 + 0.1 \cdot 4^2 = 5 \hspace{0.05cm},$$
$$\sigma = \sqrt {m_2 - m_1^2} = \sqrt {5 - 2^2} = 1 \hspace{0.05cm}.$$


Maximum entropy of a discrete source


$\text{Claude Elwood Shannon}$  defined in 1948 in the standard work of information theory  [Sha48][1]  the concept of information as  "decrease of uncertainty about the occurrence of a statistical event".

Let us make a mental experiment with  $M$  possible results, which are all equally probable:   $p_1 = p_2 = \hspace{0.05cm} \text{ ...}\hspace{0.05cm} = p_M = 1/M \hspace{0.05cm}.$

Under this assumption applies:

  • Is  $M = 1$, then each individual attempt will yield the same result and therefore there is no uncertainty about the output.
  • On the other hand, an observer learns about an experiment with  $M = 2$, for example the  »coin toss«  with the set of events  $\big \{\rm \boldsymbol{\rm Z}(ahl), \rm \boldsymbol{\rm W}(app) \big \}$  and the probabilities  $p_{\rm Z} = p_{\rm W} = 0. 5$, a gain in information.  The uncertainty regarding  $\rm Z$  resp.  $\rm W$  is resolved.
  • In the experiment  »dice«  $(M = 6)$  and even more in  »roulette«  $(M = 37)$  the gained information is even more significant for the observer than in the  »coin toss«  when he learns which number was thrown or which ball fell.
  • Finally it should be considered that the experiment  »triple coin toss«  with  $M = 8$  possible results  $\rm ZZZ$,  $\rm ZZW$,  $\rm ZWZ$,  $\rm ZWW$,  $\rm WZZ$,  $\rm WZW$,  $\rm WWZ$,  $\rm WWW$  provides three times the information as the single coin toss  $(M = 2)$.


The following definition fulfills all the requirements listed here for a quantitative information measure for equally probable events, indicated only by the symbol set size  $M$.

$\text{Definition:}$  The  »maximum average information content«   of a message source depends only on the symbol set size  $M$  and results in

$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm} {\rm (in \ “bit")} = {\rm ln}\hspace{0.1cm}M \hspace{0.15cm}\text {(in “nat")} = {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}\text {(in “Hartley")}\hspace{0.05cm}.$$
  • Since  $H_0$  indicates the maximum value of the  $\text{entropy}$  $H$,  $H_\text{max}=H_0$  is also used in our tutorial as short notation.


Please note our nomenclature:

  • The logarithm will be called  »log«  in the following, independent of the base.
  • The relations mentioned above are fulfilled due to the following properties:
$${\rm log}\hspace{0.1cm}1 = 0 \hspace{0.05cm},\hspace{0.2cm} {\rm log}\hspace{0.1cm}37 > {\rm log}\hspace{0.1cm}6 > {\rm log}\hspace{0.1cm}2\hspace{0.05cm},\hspace{0.2cm} {\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
  • Usually we use the logarithm to the base  $2$   ⇒   »logarithm dualis«    $\rm (ld)$,  where the pseudo unit  "bit"  $($more precisely:  "bit/symbol"$)$  is then added:
$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2} = \frac{{\rm ln}\hspace{0.1cm}M}{{\rm ln}\hspace{0.1cm}2} \hspace{0.05cm}.$$
  • In addition, you can find in the literature some additional definitions, which are based on the natural logarithm  $\rm (ln)$  or the logarithm of the tens  $\rm (lg)$.

Information content and entropy


We now waive the previous requirement that all  $M$  possible results of an experiment are equally probable.  In order to keep the spelling as compact as possible, we define for this section only:

$$p_1 > p_2 > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_\mu > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$

We now consider the »information content«  of the individual symbols, where we denote the  "logarithm dualis"  with  $\log_2$:

$$I_\mu = {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/Symbol)} \hspace{0.05cm}.$$

You can see:

  • Because of  $p_μ ≤ 1$  the information content is never negative.  In the borderline case  $p_μ \to 1$  goes  $I_μ \to 0$.
  • However, for  $I_μ = 0$   ⇒   $p_μ = 1$   ⇒   $M = 1$  the information content is also  $H_0 = 0$.
  • For decreasing probabilities  $p_μ$  the information content increases continuously:
$$I_1 < I_2 < \hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_\mu <\hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_{M-1} < I_M \hspace{0.05cm}.$$

$\text{Conclusion:}$  The more improbable an event is, the greater is its information content.  This fact is also found in daily life:

  1. "6 right ones" in the lottery are more likely to be noticed than "3 right ones" or no win at all.
  2. A tsunami in Asia also dominates the news in Germany for weeks as opposed to the almost standard Deutsche Bahn delays.
  3. A series of defeats of Bayern Munich leads to huge headlines in contrast to a winning series.  With 1860 Munich exactly the opposite is the case.


However, the information content of a single symbol (or event) is not very interesting.  On the other hand one of the central quantities of information theory is obtained,

  • by ensemble averaging over all possible symbols  $q_μ$  bzw. 
  • by time averaging over all elements of the sequence  $\langle q_ν \rangle$.


$\text{Definition:}$  The  »entropy«  $H$  of a discrete source indicates the  »mean information content of all symbols«:

$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\sum_{\mu = 1}^M p_{\mu} \cdot{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}\text{(unit: bit, more precisely: bit/symbol)} \hspace{0.05cm}.$$

The overline marks again a time averaging and  $\rm E[\text{...}]$  an ensemble averaging.


Entropy is among other things a measure for

  • the mean uncertainty about the outcome of a statistical event,
  • the  "randomness"  of this event,  and
  • the average information content of a random variable.


Binary entropy function


At first we will restrict ourselves to the special case  $M = 2$  and consider a binary source, which returns the two symbols  $\rm A$  and  $\rm B$.  The symbol probabilities are   $p_{\rm A} = p$  and   $p_{\rm B} = 1 - p$.

For the entropy of this binary source applies:

$$H_{\rm bin} (p) = p \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/symbol)} \hspace{0.05cm}.$$

This function is called  $H_\text{bin}(p)$  the  »binary entropy function«.  The entropy of a source with a larger symbol set size  $M$  can often be expressed using  $H_\text{bin}(p)$ .

$\text{Example 2:}$  The figure shows the binary entropy function for the values  $0 ≤ p ≤ 1$  of the symbol probability of  $\rm A$  $($or also of  $\rm B)$.  You can see:

Binary entropy function as a function of  $p$
  • The maximum value  $H_\text{max} = 1\; \rm bit$  results for  $p = 0.5$, thus for equally probable binary symbols.  Then   $\rm A$  and  $\rm B$  contribute the same amount to the entropy.
  • $H_\text{bin}(p)$  is symmetrical around  $p = 0.5$.  A source with  $p_{\rm A} = 0.1$  and  $p_{\rm B} = 0. 9$  has the same entropy  $H = 0.469 \; \rm bit$  as a source with  $p_{\rm A} = 0.9$  and  $p_{\rm B} = 0.1$.
  • The difference  $ΔH = H_\text{max} - H$ gives  the  »redundancy«  of the source and  $r = ΔH/H_\text{max}$  the  »relative redundancy«.   In the example,  $ΔH = 0.531\; \rm bit$  and  $r = 53.1 \rm \%$.
  • For  $p = 0$  this results in  $H = 0$, since the symbol sequence  $\rm B \ B \ B \text{...}$  can be predicted with certainty   ⇒   symbol set size only  $M = 1$.  The same applies to  $p = 1$   ⇒   symbol sequence  $\rm A \ A \ A \text{...}$.
  • $H_\text{bin}(p)$  is always a  »concave function«,  since the second derivative after the parameter  $p$  is negative for all values of  $p$ :
$$\frac{ {\rm d}^2H_{\rm bin} (p)}{ {\rm d}\,p^2} = \frac{- 1}{ {\rm ln}(2) \cdot p \cdot (1-p)}< 0 \hspace{0.05cm}.$$

Non-binary sources


In the  "first section"  of this chapter we considered a quaternary message source  $(M = 4)$  with the symbol probabilities  $p_{\rm A} = 0. 4$,   $p_{\rm B} = 0.3$,   $p_{\rm C} = 0.2$  and  $ p_{\rm D} = 0.1$.  This source has the following entropy:

$$H_{\rm quat} = 0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}.$$

For numerical calculation, the detour via the decimal logarithm  $\lg \ x = {\rm log}_{10} \ x$  is often necessary, since the  "logarithm dualis"  $ {\rm log}_2 \ x$  is mostly not found on pocket calculators.

$$H_{\rm quat}=\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2} + 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit} \hspace{0.05cm}.$$

$\text{Example 3:}$  Now there are certain symmetries between the symbol probabilities:

Entropy of binary source and quaternary source
$$p_{\rm A} = p_{\rm D} = p \hspace{0.05cm},\hspace{0.4cm}p_{\rm B} = p_{\rm C} = 0.5 - p \hspace{0.05cm},\hspace{0.3cm}{\rm with} \hspace{0.15cm}0 \le p \le 0.5 \hspace{0.05cm}.$$

In this case, the binary entropy function can be used to calculate the entropy:

$$H_{\rm quat} = 2 \cdot p \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm} } + 2 \cdot (0.5-p) \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.5-p}$$

$$\Rightarrow \hspace{0.3cm} H_{\rm quat} = 1 + H_{\rm bin}(2p) \hspace{0.05cm}.$$

The graphic shows as a function of  $p$

  • the entropy of the quaternary source (blue)
  • in comparison to the entropy course of the binary source (red).


For the quaternary source only the abscissa  $0 ≤ p ≤ 0.5$  is allowed.

⇒   You can see from the blue curve for the quaternary source:

  • The maximum entropy  $H_\text{max} = 2 \; \rm bit/symbol$  results for  $p = 0.25$   ⇒   equally probable symbols:   $p_{\rm A} = p_{\rm B} = p_{\rm C} = p_{\rm A} = 0.25$.
  • With  $p = 0$  the quaternary source degenerates to a binary source with  $p_{\rm B} = p_{\rm C} = 0. 5$,   $p_{\rm A} = p_{\rm D} = 0$   ⇒   $H = 1 \; \rm bit/symbol$.  Similar applies to $p = 0.5$.
  • The source with  $p_{\rm A} = p_{\rm D} = 0.1$  and  $p_{\rm B} = p_{\rm C} = 0.4$  has the following characteristics (each with the pseudo unit "bit/symbol"):
    (1)   entropy:   $H = 1 + H_{\rm bin} (2p) =1 + H_{\rm bin} (0.2) = 1.722,$
    (2)   Redundancy:   ${\rm \Delta }H = {\rm log_2}\hspace{0.1cm} M - H =2- 1.722= 0.278,$
    (3)   relative redundancy:   $r ={\rm \Delta }H/({\rm log_2}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.$
  • The redundancy of the quaternary source with  $p = 0.1$  is  $ΔH = 0.278 \; \rm bit/symbol$   ⇒   exactly the same as the redundancy of the binary source with  $p = 0.2$.


Exercises for the chapter


Exercise 1.1: Entropy of the Weather

Exercise 1.1Z: Binary Entropy Function

Exercise 1.2: Entropy of Ternary Sources


References

  1. Shannon, C.E.: A Mathematical Theory of Communication. In: Bell Syst. Techn. J. 27 (1948), pp. 379-423 and pp. 623-656.