Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Difference between revisions of "Information Theory/Discrete Memoryless Sources"

From LNTwww
 
(78 intermediate revisions by 10 users not shown)
Line 1: Line 1:
 
{{FirstPage}}
 
{{FirstPage}}
 
{{Header
 
{{Header
|Untermenü=Entropie wertdiskreter Nachrichtenquellen
+
|Untermenü=Entropy of Discrete Sources
 
|Vorherige Seite=
 
|Vorherige Seite=
|Nächste Seite=Nachrichtenquellen mit Gedächtnis
+
|Nächste Seite=Discrete Sources with Memory
 
}}
 
}}
  
== Modell und Voraussetzungen ==  
+
== # OVERVIEW OF THE FIRST MAIN CHAPTER # ==
Wir betrachten eine wertdiskrete Nachrichtenquelle Q, die eine Folge 〈 q_ν 〉 von Symbolen abgibt. Für die Laufvariable gilt ν = 1, ... , $N$, wobei $N$ „hinreichend groß” sein sollte. Jedes einzelne Quellensymbol q_ν entstammt einem Symbolvorrat { q_μ } mit $μ = 1, ... , M$, wobei $M$ den Symbolumfang bezeichnet:
+
<br>
 +
This first chapter describes the calculation and the meaning of entropy.&nbsp; According to the Shannonian information definition,&nbsp; entropy is a measure of the mean uncertainty about the outcome of a statistical event or the uncertainty in the measurement of a stochastic quantity.&nbsp; Somewhat casually expressed,&nbsp; the entropy of a random quantity quantifies its&nbsp; &raquo;randomness&laquo;.  
 +
 
 +
In detail are discussed:
 +
 
 +
#The &nbsp;&raquo;information content&laquo;&nbsp; of a symbol and the &nbsp;&raquo;entropy&laquo;&nbsp; of a discrete memoryless source,
 +
#the &nbsp;&raquo;binary entropy function&laquo;&nbsp; and its application to non-binary sources,
 +
#the entropy calculation for&nbsp; &raquo;sources with memory&laquo;&nbsp; and suitable approximations,
 +
#the special features of&nbsp; &raquo;Markov sources&laquo;&nbsp; regarding the entropy calculation,
 +
#the procedure for sources with a large number of symbols, for example&nbsp; &raquo;natural texts&laquo;,
 +
#the&nbsp; &raquo;entropy estimates&laquo;&nbsp; according to Shannon and Küpfmüller.
 +
 
 +
 
 +
 
 +
== Model and requirements ==
 +
<br>
 +
We consider a  discrete message source&nbsp; $\rm Q$, which gives a sequence&nbsp; $ \langle q_ν \rangle$&nbsp; of symbols.  
 +
*For the variable &nbsp;$ν = 1$, ... , $N$, where&nbsp; $N$&nbsp; should be sufficiently large.
 
   
 
   
$$q_{\nu} \in \left \{ q_{\mu}  \right \}, \hspace{0.15cm}{\rm mit}\hspace{0.15cm} \nu = 1, ... \hspace{0.05cm}, N\hspace{0.15cm}{\rm und}\hspace{0.15cm}\mu = 1, ...\hspace{0.05cm} , M \hspace{0.05cm}.$$
+
*Each individual source symbol &nbsp;q_ν&nbsp; comes from a symbol set&nbsp; \{q_μ \}&nbsp; where&nbsp; μ = 1, ... , M.&nbsp; M&nbsp; denotes the symbol set size:
 +
 +
:$$q_{\nu} \in \left \{ q_{\mu}  \right \}, \hspace{0.25cm}{\rm with}\hspace{0.25cm} \nu = 1, \hspace{0.05cm} \text{ ...}\hspace{0.05cm} , N\hspace{0.25cm}{\rm and}\hspace{0.25cm}\mu = 1,\hspace{0.05cm} \text{ ...}\hspace{0.05cm} , M \hspace{0.05cm}.$$
  
Die Grafik zeigt eine quaternäre Nachrichtenquelle ( $M= 4) mit dem Alphabet {A, B, C, D}. Rechts ist eine beispielhafte Folge der Länge $N$ = 100 angegeben.
+
The figure shows a quaternary message source&nbsp; $(M = 4)&nbsp; with alphabet&nbsp; \rm \{A, \ B, \ C, \ D\}$&nbsp; and an exemplary sequence of length&nbsp; $N = 100$.
  
[[File:P_ID2227__Inf_T_1_1_S1a_neu.png|Gedächtnislose quaternäre Nachrichtenquelle]]
+
[[File:EN_Inf_T_1_1_S1a.png|frame|Quaternary source]]
  
Es gelten folgende Voraussetzungen:
+
The following requirements apply:
*Die quaternäre Nachrichtenquelle wird durch $M$ = 4 Symbolwahrscheinlichkeiten p_μ vollständig beschrieben. Allgemein gilt:
+
*The quaternary source is fully described by&nbsp; $M = 4$&nbsp; symbol probabilities&nbsp; p_μ.&nbsp; In general it applies:
$$\sum_{\mu = 1}^M \hspace{0.1cm}p_{\mu} = 1 \hspace{0.05cm}.$$
+
:\sum_{\mu = 1}^M \hspace{0.1cm}p_{\mu} = 1 \hspace{0.05cm}.
*Die Nachrichtenquelle sei gedächtnislos, das heißt, die einzelnen Folgenelemente seien statistisch voneinander unabhängig:
+
*The message source is memoryless, i.e., the individual sequence elements are&nbsp; [[Theory_of_Stochastic_Signals/Statistical Dependence and Independence#General_definition_of_statistical_dependence|&raquo;statistically independent of each other&laquo;]]:
$${\rm Pr} \left (q_{\nu} = q_{\mu} \right ) = {\rm Pr} \left (q_{\nu} = q_{\mu} \hspace{0.03cm} | \hspace{0.03cm} q_{\nu -1}, q_{\nu -2}, ... \right ) \hspace{0.05cm}.$$
+
:$${\rm Pr} \left (q_{\nu} = q_{\mu} \right ) = {\rm Pr} \left (q_{\nu} = q_{\mu} \hspace{0.03cm} | \hspace{0.03cm} q_{\nu -1}, q_{\nu -2}, \hspace{0.05cm} \text{ ...}\hspace{0.05cm}\right ) \hspace{0.05cm}.$$
*Da das Alphabet aus Symbolen (und nicht aus Zufallsgrößen) besteht, ist hier die Angabe von Erwartungswerten (linearer Mittelwert, quadratischer Mittelwert, Streuung, usw.) nicht möglich, aber auch aus informationstheoretischer Sicht nicht nötig.
+
*Since the alphabet consists of symbols&nbsp; $(and not of random variables)$,&nbsp; the specification of&nbsp; [[Theory_of_Stochastic_Signals/Expected_Values_and_Moments|&raquo;expected values&laquo;]]&nbsp; $($linear mean, second moment, standard deviation, etc.$)$&nbsp; is not possible here,&nbsp; but also not necessary from an information-theoretical point of view.
  
Diese Eigenschaften werden auf der nächsten Seite mit einem Beispiel verdeutlicht.
 
  
+
These properties will now be illustrated with an example.
{{Beispiel}}
+
 
Für die Symbolwahrscheinlichkeiten einer Quaternärquelle gelte:
+
{{GraueBox|TEXT=
 +
[[File:Inf_T_1_1_S1b_vers2.png|right|frame|Relative frequencies as a function of&nbsp; N]] 
 +
\text{Example 1:}&nbsp;
 +
For the symbol probabilities of a quaternary source applies:
 +
:$$p_{\rm A} = 0.4 \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = 0.3 \hspace{0.05cm},\hspace{0.2cm}p_{\rm C} = 0.2 \hspace{0.05cm},\hspace{0.2cm}  
 +
p_{\rm D} = 0.1\hspace{0.05cm}.$$
 +
For an infinitely long sequence&nbsp; (N \to \infty)
 +
*the&nbsp; [[Theory_of_Stochastic_Signals/From_Random_Experiment_to_Random_Variable#Bernoulli.27s_law_of_large_numbers|&raquo;relative frequencies&laquo;]]&nbsp; h_{\rm A},&nbsp; h_{\rm B},&nbsp; h_{\rm C},&nbsp; h_{\rm D} &nbsp; &rArr; &nbsp; a-posteriori parameters
 
   
 
   
$$p_{\rm A} = 0.4 \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = 0.3 \hspace{0.05cm},\hspace{0.2cm}p_{\rm C} = 0.2 \hspace{0.05cm},\hspace{0.2cm}
+
*were identical to the&nbsp; [[Theory_of_Stochastic_Signals/Some_Basic_Definitions#Event_and_event_probability|&raquo;probabilities&laquo;]]&nbsp; $p_{\rm A}$,&nbsp; $p_{\rm B}$,&nbsp; $p_{\rm C}$,&nbsp; $p_{\rm D}$ &nbsp; &rArr; &nbsp; a-priori parameters.  
p_{\rm D} = 0.1\hspace{0.05cm}.$$
+
 
 +
 
 +
With smaller&nbsp; N&nbsp; deviations may occur, as the adjacent table&nbsp; (result of a simulation$)$&nbsp; shows.
  
Bei einer unendlich langen Folge ( N → ∞) wären die relativen Häufigkeiten h_A, h_B, h_C und h_D – also die a–posteriori–Kenngrößen – identisch mit den a–priori–Wahrscheinlichkeiten $p_A, p_B, p_C und p_D. Bei kleinerem N$ kann es aber durchaus zu Abweichungen kommen, wie die folgende Tabelle (Ergebnis einer Simulation) zeigt. Die Folge für $N$ = 100 ist auf der letzten Seite angegeben.
+
*In the graphic above an exemplary sequence is shown with&nbsp; $N = 100$&nbsp; symbols.
 +
 +
*Due to the set elements&nbsp; $\rm A$,&nbsp; $\rm B$,&nbsp; $\rm C$&nbsp; and&nbsp; $\rm D$&nbsp; no mean values can be given.  
  
[[File:P_ID2230__Inf_T_1_1_S1b.png|Relative Häufigkeiten in Abhängigkeit von ''N'']]
 
  
Aufgrund der Mengenelemente A, B, C und D können keine Mittelwerte angegeben werden. Ersetzt man die Symbole durch Zahlenwerte, zum Beispiel A 1, B 2, C 3, D 4, so ergeben sich
+
However,&nbsp; if you replace the symbols with numerical values,&nbsp; for example&nbsp; $\rm A \Rightarrow 1$, &nbsp; $\rm B \Rightarrow 2$, &nbsp; $\rm C \Rightarrow 3$, &nbsp; $\rm D \Rightarrow 4$, then you will get after <br> &nbsp; &nbsp; &raquo;time averaging&laquo; &nbsp; &rArr; &nbsp; crossing line &nbsp; &nbsp; or &nbsp; &nbsp; &raquo;ensemble averaging&laquo; &nbsp; &rArr; &nbsp; expected value formation
*für den linearen Mittelwert:
+
*for the&nbsp; [[Theory_of_Stochastic_Signals/Moments_of_a_Discrete_Random_Variable#First_order_moment_.E2.80.93_linear_mean_.E2.80.93_DC_component|&raquo;linear mean&laquo;]] &nbsp; &rArr; &nbsp; &raquo;first order moment&laquo;:
$$m_1 = {\rm E} \left [ q_{\nu} \right ] = {\rm E} \left [ q_{\mu} \right ] = 0.4 \cdot 1 + 0.3 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4
+
:$$m_1 = \overline { q_{\nu} } = {\rm E} \big [ q_{\mu} \big ] = 0.4 \cdot 1 + 0.3 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4
 
= 2 \hspace{0.05cm},$$  
 
= 2 \hspace{0.05cm},$$  
*für den quadratischen Mittelwert:
+
*for the&nbsp; [[Theory_of_Stochastic_Signals/Moments_of_a_Discrete_Random_Variable#Second_order_moment_.E2.80.93_power_.E2.80.93_variance_.E2.80.93_standard_deviation |&raquo;second order moment&laquo;]]:
$$m_2 = {\rm E} \left [ q_{\nu}^{\hspace{0.05cm}2}  \right ] = {\rm E} \left [ q_{\mu}^{\hspace{0.05cm}2} \right ] = 0.4 \cdot 1^2 + 0.3 \cdot 2^2 + 0.2 \cdot 3^2 + 0.1 \cdot 4^2
+
:$$m_2 = \overline { q_{\nu}^{\hspace{0.05cm}2}  } = {\rm E} \big [ q_{\mu}^{\hspace{0.05cm}2} \big ] = 0.4 \cdot 1^2 + 0.3 \cdot 2^2 + 0.2 \cdot 3^2 + 0.1 \cdot 4^2
 
= 5 \hspace{0.05cm},$$
 
= 5 \hspace{0.05cm},$$
*für die Standardabweichung (Streuung) nach dem „Satz von Steiner”:
+
*for the&nbsp; [[Theory_of_Stochastic_Signals/Expected_Values_and_Moments#Some_common_central_moments|&raquo;standard deviation&laquo;]]&nbsp;  according to the&nbsp; &raquo;Theorem of Steiner&laquo;:
$$\sigma = \sqrt {m_2 - m_1^{\hspace{0.05cm}2}} = \sqrt {5 - 2^{\hspace{0.01cm}2}}
+
:\sigma = \sqrt {m_2 - m_1^2} = \sqrt {5 - 2^2} = 1 \hspace{0.05cm}.}}
= 1 \hspace{0.05cm}.$$
 
  
{{end}}
 
 
 
 
 
  
==Entscheidungsgehalt – Nachrichtengehalt==
+
==Maximum entropy of a discrete source==
Claude E. Shannon definierte 1948 im Standardwerk der Informationstheorie [Sha48] den Informationsbegriff als „Abnahme der Ungewissheit über das Eintreten eines statistischen Ereignisses”. Machen wir hierzu ein gedankliches Experiment mit M möglichen Ergebnissen, die alle gleichwahrscheinlich seien:
+
<br>
 +
[https://en.wikipedia.org/wiki/Claude_Shannon $\text{Claude Elwood Shannon}$]&nbsp; defined in 1948 in the standard work of information theory&nbsp; [Sha48]<ref name='Sha48'>Shannon, C.E.: A Mathematical Theory of Communication. In: Bell Syst. Techn. J. 27 (1948), pp. 379-423 and pp. 623-656.</ref>&nbsp; the concept of information as&nbsp; "decrease of uncertainty about the occurrence of a statistical event".
 +
 
 +
Let us make a mental experiment with&nbsp; $M$&nbsp; possible results, which are all equally probable: &nbsp; p_1 = p_2 = \hspace{0.05cm} \text{ ...}\hspace{0.05cm} = p_M = 1/M \hspace{0.05cm}.
 +
 
 +
Under this assumption applies:
 +
*Is&nbsp; M = 1, then each individual attempt will yield the same result and therefore there is no uncertainty about the output.
 +
 
 +
*On the other hand, an observer learns about an experiment with&nbsp; M = 2, for example the&nbsp; &raquo;coin toss&laquo;&nbsp; with the set of events&nbsp; \big \{\rm \boldsymbol{\rm  Z}(ahl), \rm \boldsymbol{\rm  W}(app) \big \}&nbsp; and the probabilities&nbsp; p_{\rm Z} = p_{\rm W} = 0. 5, a gain in information.&nbsp; The uncertainty regarding&nbsp; \rm Z &nbsp;resp.&nbsp; \rm W&nbsp; is resolved.
  
$$p_1 = p_2 = ... = p_M = 1/M \hspace{0.05cm}.$$
+
*In the experiment&nbsp; &raquo;dice&laquo;&nbsp; $(M = 6)$&nbsp; and even more in&nbsp; &raquo;roulette&laquo;&nbsp;  $(M = 37)$&nbsp; the gained information is even more significant for the observer than in the&nbsp; &raquo;coin toss&laquo;&nbsp; when he learns which number was thrown or which ball fell.
  
Unter dieser Annahme gilt:
+
*Finally it should be considered that the experiment&nbsp; &raquo;triple coin toss&laquo;&nbsp; with&nbsp; $M = 8$&nbsp; possible results&nbsp; $\rm ZZZ$,&nbsp; $\rm ZZW$,&nbsp; $\rm ZWZ$,&nbsp; $\rm ZWW$,&nbsp; $\rm WZZ$,&nbsp; $\rm WZW$,&nbsp; $\rm WWZ$,&nbsp; $\rm WWW&nbsp; provides three times the information as the single coin toss&nbsp; (M = 2)$.
*Ist M = 1, so wird jeder einzelne Versuch das gleiche Ergebnis liefern und demzufolge besteht keine Unsicherheit hinsichtlich des Ausgangs. Wird uns das Versuchsergebnis mitgeteilt, so haben wir dadurch natürlich auch keinen Informationsgewinn.
 
*Dagegen erfährt ein Beobachter bei einem Experiment mit $M$ = 2, zum Beispiel dem „Münzwurf” mit der Ereignismenge { '''Z'''(ahl), '''W'''(app) } und den Wahrscheinlichkeiten $p_Z$ = $p_W$ = 0.5, durchaus einen Informationsgewinn. Die Unsicherheit, ob '''Z''' oder '''W''' geworfen wurde, wird aufgelöst.
 
*Beim Experiment „Würfeln” ( $M$ = 6 ) und noch mehr beim Roulette ( $M$ = 37) ist die gewonnene Information für den Beobachter noch deutlich größer als beim „Münzwurf”, wenn er erfährt, welche Zahl gewürfelt bzw. welche Kugel gefallen ist.
 
*Schließlich sollte noch berücksichtigt werden, dass das Experiment „Dreifacher Münzwurf” mit den $M$ = 8 möglichen Ergebnissen '''ZZZ''', '''ZZW''', '''ZWZ''', '''ZWW''', '''WZZ''', '''WZW''', '''WWZ''', '''WWW''' die dreifache Information liefert wie der einfache Münzwurf ( $M$ = 2 ).
 
  
  
Die nachfolgende Festlegung erfüllt alle hier verbal aufgeführten Anforderungen an ein quantitatives Informationsmaß bei gleichwahrscheinlichen Ereignissen, gekennzeichnet durch den Symbolumfang M.
+
The following definition fulfills all the requirements listed here for a quantitative information measure for equally probable events, indicated only by the symbol set size&nbsp; M.
  
{{Definition}}
+
{{BlaueBox|TEXT= 
Der Entscheidungsgehalt einer Nachrichtenquelle hängt nur vom Symbolumfang M ab und ergibt sich zu
+
$\text{Definition:}$&nbsp; The&nbsp; &raquo;'''maximum average information content'''&laquo; &nbsp; of a message source depends only on the symbol set size&nbsp; M&nbsp; and results in
 
   
 
   
$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm}{\rm (in \hspace{0.15cm}"bit")}
+
:$$H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm} {\rm (in \ &#8220;bit")}
= {\rm ln}\hspace{0.1cm}M \hspace{0.15cm}{\rm (in \hspace{0.15cm}"nat")}
+
= {\rm ln}\hspace{0.1cm}M \hspace{0.15cm}\text {(in &#8220;nat")}
= {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}{\rm (in \hspace{0.15cm}"Hartley")} \hspace{0.05cm}.$$
+
= {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}\text {(in &#8220;Hartley")}\hspace{0.05cm}.$$
  
Gebräuchlich ist hierfür auch die Bezeichnung ''Nachrichtengehalt''. Da H_0 gleichzeitig den Maximalwert der Entropie H angibt, wird in hier teilweise auch H_\text{max} als Kurzzeichen verwendet.
+
*Since&nbsp; H_0&nbsp; indicates the maximum value of the&nbsp; [[Information_Theory/Discrete_Memoryless_Sources#Information_content_and_entropy|\text{entropy}]]&nbsp; H,&nbsp; $H_\text{max}=H_0$&nbsp; is also used in our tutorial as short notation. }}
  
{{end}}
 
  
 
+
Please note our nomenclature:
Anzumerken ist:
+
*The logarithm will be called&nbsp; &raquo;log&laquo;&nbsp; in the following, independent of the base.
*Der Logarithmus wird in unserem Tutorial unabhängig von der Basis mit „log” bezeichnet. Die vier oben aufgestellten Kriterien werden aufgrund folgender Eigenschaften erfüllt:
+
 +
*The relations mentioned above are fulfilled due to the following properties:
 
   
 
   
$${\rm log}\hspace{0.1cm}1 = 0 \hspace{0.05cm},\hspace{0.2cm}
+
:$${\rm log}\hspace{0.1cm}1 = 0 \hspace{0.05cm},\hspace{0.2cm}
 
{\rm log}\hspace{0.1cm}37 > {\rm log}\hspace{0.1cm}6 > {\rm log}\hspace{0.1cm}2\hspace{0.05cm},\hspace{0.2cm}
 
{\rm log}\hspace{0.1cm}37 > {\rm log}\hspace{0.1cm}6 > {\rm log}\hspace{0.1cm}2\hspace{0.05cm},\hspace{0.2cm}
 
{\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
 
{\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.$$
  
*Meist verwenden wir den Logarithmus zur Basis 2 ⇒ Logarithmus dualis (ld), wobei dann die Pseudoeinheit „bit” – genauer: „bit/Symbol” – hinzugefügt wird:
+
* Usually we use the logarithm to the base&nbsp; $2$ &nbsp; &nbsp; &raquo;logarithm dualis&laquo;&nbsp; &nbsp; $\rm (ld)$,&nbsp; where the pseudo unit&nbsp; "bit"&nbsp; (more precisely:&nbsp; "bit/symbol")&nbsp; is then added:
 
   
 
   
$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2}
+
:$${\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2}
 
= \frac{{\rm ln}\hspace{0.1cm}M}{{\rm ln}\hspace{0.1cm}2}  
 
= \frac{{\rm ln}\hspace{0.1cm}M}{{\rm ln}\hspace{0.1cm}2}  
 
  \hspace{0.05cm}.$$
 
  \hspace{0.05cm}.$$
  
*Weiter findet man in der Literatur auch Definitionen, basierend auf dem natürlichen Logarithmus („ln”) oder dem Zehnerlogarithmus („lg”) entsprechend obigen Definitionen.
+
*In addition, you can find in the literature some additional definitions, which are based on the natural logarithm&nbsp; $\rm (ln)&nbsp; or the logarithm of the tens&nbsp; \rm (lg)$.
 
   
 
   
==Informationsgehalt und Entropie ==
+
==Information content and entropy ==
 
+
<br>
Wir verzichten nun auf die bisherige Voraussetzung, dass alle M möglichen Ergebnisse eines Versuchs gleichwahrscheinlich seien. Im Hinblick auf eine möglichst kompakte Schreibweise legen wir für diese Seite lediglich fest:
+
We now waive the previous requirement that all&nbsp; M&nbsp; possible results of an experiment are equally probable.&nbsp; In order to keep the spelling as compact as possible, we define for this section only:
 
   
 
   
$$p_1 > p_2 > ... > p_\mu > ... > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$
+
:$$p_1 > p_2 > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_\mu > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.$$
  
Unter dieser Voraussetzung betrachten wir nun den '''Informationsgehalt''' der einzelnen Symbole, wobei wir den Logarithmus dualis mit „ld”(manchmal auch mit „log2”) bezeichnen :
+
We now consider the &raquo;'''information content'''&laquo;&nbsp; of the individual symbols, where we denote the&nbsp; "logarithm dualis"&nbsp; with&nbsp; \log_2:
 
   
 
   
$$I_\mu = {\rm ld}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm ld}\hspace{0.1cm}{p_\mu}
+
:$$I_\mu = {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm log_2}\hspace{0.1cm}{p_\mu}
\hspace{0.5cm}{\rm (Einheit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}oder\hspace{0.15cm}bit/Symbol)}
+
\hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/Symbol)}
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
Man erkennt:
+
You can see:
*Wegen $p_μ$ ≤ 1 ist der Informationsgehalt nie negativ. Im Grenzfall $p_μ$  →  1 geht $I_μ$ →  0. Allerdings ist für $I_μ$ = 0 →  $p_μ$ = 1  →  M = 1 auch der Entscheidungsgehalt H_0 = 0.
+
*Because of&nbsp; $p_μ ≤ 1$&nbsp; the information content is never negative.&nbsp; In the borderline case&nbsp; $p_μ \to 1$&nbsp; goes&nbsp; $I_μ \to 0$.
*Bei abfallenden Wahrscheinlichkeiten p_μ nimmt der Informationsgehalt kontinuierlich zu:
 
 
   
 
   
$$I_1 < I_2 < ... < I_\mu < ... < I_{M-1} < I_M \hspace{0.05cm}.$$
+
*However, for&nbsp; $I_μ = 0 &nbsp; &rArr; &nbsp; p_μ = 1$ &nbsp; &rArr; &nbsp; $M = 1&nbsp; the information content is also&nbsp; H_0 = 0$.
  
Das heißt: Je weniger wahrscheinlich ein Ereignis ist, desto größer ist sein Informationsgehalt. Dieser Sachverhalt ist auch im täglichen Leben festzustellen:
+
*For decreasing probabilities&nbsp; p_μ&nbsp; the information content increases continuously:
*„6 Richtige” im Lotto nimmt man sicher eher war als „3 Richtige” oder gar keinen Gewinn.
+
*Ein Tsunami in Asien dominiert auch die Nachrichten in Deutschland über Wochen im Gegensatz zu den fast standardmäßigen Verspätungen der Deutschen Bahn.
+
:$$I_1 < I_2 < \hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_\mu <\hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_{M-1} < I_M \hspace{0.05cm}.$$
*Eine Niederlagenserie von Bayern München führt zu Riesen–Schlagzeilen im Gegensatz zu einer Siegesserie. Bei 1860 München ist genau das Gegenteil der Fall.
 
  
 +
{{BlaueBox|TEXT= 
 +
\text{Conclusion:}&nbsp; '''The more improbable an event is, the greater is its information content'''.&nbsp; This fact is also found in daily life:
 +
#"6 right ones" in the lottery are more likely to be noticed than "3 right ones" or no win at all.
 +
#A tsunami in Asia also dominates the news in Germany for weeks as opposed to the almost standard Deutsche Bahn delays.
 +
#A series of defeats of Bayern Munich leads to huge headlines in contrast to a winning series.&nbsp; With 1860 Munich exactly the opposite is the case.}}
  
Der Informationsgehalt eines einzelnen Symbols (oder Ereignisses) ist allerdings nicht sehr interessant. Durch Scharmittelung über alle möglichen Symbole q_μ bzw. durch Zeitmittelung über alle Folgenelemente q_ν erhält man dagegen eine der zentralen Größen der Informationstheorie.
 
  
{{Definition}}
+
However, the information content of a single symbol (or event) is not very interesting.&nbsp; On the other hand one of the central quantities of information theory is obtained,
Die '''Entropie''' einer Quelle gibt den mittleren Informationsgehalt aller Symbole an:
+
*by ensemble averaging over all possible symbols&nbsp; q_μ &nbsp;bzw.&nbsp;
 
   
 
   
$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm ld}\hspace{0.1cm}\frac{1}{p_\mu}=
+
*by time averaging over all elements of the sequence&nbsp; \langle q_ν \rangle.
  -\sum_{\mu = 1}^M p_{\mu} \cdot{\rm ld}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}{\rm (Einheit\hspace{-0.15cm}: \hspace{0.15cm}bit[/Symbol])}
+
 
 +
 
 +
{{BlaueBox|TEXT= 
 +
\text{Definition:}&nbsp; The&nbsp; &raquo;'''entropy'''&laquo;&nbsp; H&nbsp; of a discrete source indicates the&nbsp; &raquo;'''mean information content of all symbols'''&laquo;:
 +
 +
:$$H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}=
 +
  -\sum_{\mu = 1}^M p_{\mu} \cdot{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}\text{(unit: bit, more precisely: bit/symbol)}  
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
Die überstreichende Linie kennzeichnet eine Zeitmittelung und E[...] eine Scharmittelung.
+
The overline marks again a time averaging and&nbsp; $\rm E[\text{...}]$&nbsp; an ensemble averaging.}}
{{end}}
+
 
 +
 
 +
Entropy is among other things a measure for
 +
*the mean uncertainty about the outcome of a statistical event,
 +
 
 +
*the&nbsp; "randomness"&nbsp; of this event,&nbsp; and
  
Die Entropie ist ein Maß für
+
*the average information content of a random variable.
*die mittlere Unsicherheit über den Ausgang eines statistischen Ereignisses,
+
   
*die „Zufälligkeit” dieses Ereignisses,
 
*den mittleren Informationsgehalt einer Zufallsgröße.  
 
  
==Binäre Entropiefunktion  ==
+
==Binary entropy function ==
 +
<br>
 +
At first we will restrict ourselves to the special case&nbsp; M = 2&nbsp; and consider a binary source, which returns the two symbols&nbsp; \rm A&nbsp; and&nbsp; \rm B.&nbsp; The symbol probabilities are &nbsp; p_{\rm A} = p&nbsp; and &nbsp; p_{\rm B} = 1 - p.
  
Wir beschränken uns zunächst auf den Sonderfall M = 2 und betrachten eine binäre Quelle, die die beiden Symbole '''A''' und '''B''' abgibt. Die Auftrittwahrscheinlichkeiten seien p_A = p und p_B = 1 – p.
+
For the entropy of this binary source applies:  
Für die Entropie dieser Quelle gilt:
 
 
   
 
   
$$H_{\rm bin} (p) = p \cdot {\rm ld}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm ld}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (Einheit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}oder\hspace{0.15cm}bit/Symbol)}
+
:$$H_{\rm bin} (p) = p \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/symbol)}
 
\hspace{0.05cm}.$$
 
\hspace{0.05cm}.$$
  
Man nennt diese Funktion H_\text{bin}(p) die '''binäre Entropiefunktion'''. Die Entropie einer Quelle mit größerem Symbolumfang M lässt sich häufig unter Verwendung von H_\text{bin}(p) ausdrücken.
+
This function is called&nbsp; H_\text{bin}(p)&nbsp; the&nbsp; &raquo;'''binary entropy function'''&laquo;.&nbsp; The entropy of a source with a larger symbol set size&nbsp; M&nbsp; can often be expressed using&nbsp; H_\text{bin}(p)&nbsp;.
  
[[File:P_ID2229__Inf_T_1_1_S4_neu.png|Binäre Entropiefunktion]]
+
{{GraueBox|TEXT= 
 +
$\text{Example 2:}$&nbsp;
 +
The figure shows the binary entropy function for the values&nbsp; 0 ≤ p ≤ 1&nbsp; of the symbol probability of&nbsp; \rm A&nbsp; (or also of&nbsp; \rm B).&nbsp; You can see:
  
Die Grafik zeigt die Funktion $H_\text{bin}(p)$ für die Werte 0 ≤ $p$ ≤ 1 der Symbolwahrscheinlichkeit von '''A''' (oder '''B'''). Man erkennt:
+
[[File:EN_Inf_T_1_1_S5_v2.png|frame|Binary entropy function as a function of&nbsp; p |right]]
*Der Maximalwert $H_\text{max}$ = 1 bit ergibt sich für $p$ = 0.5, also für gleichwahrscheinliche Binärsymbole. Dann liefern '''A''' und '''B''' jeweils den gleichen Beitrag zur Entropie.
+
*The maximum value&nbsp; $H_\text{max} = 1\; \rm bit$&nbsp; results for&nbsp; p = 0.5, thus for equally probable binary symbols.&nbsp; Then &nbsp; $\rm A$&nbsp; and&nbsp; $\rm B$&nbsp; contribute the same amount to the entropy.
* $H_\text{bin}(p) ist symmetrisch um p$ = 0.5. Eine Quelle mit $p_A$ = 0.1 und $p_B$ = 0.9 hat die gleiche Entropie (Zufälligkeit) H = 0.469 bit wie eine Quelle mit p_A = 0.9 und p_B = 0.1.
 
*Die Differenz ΔH = $H_\text{max}H$ gibt die Redundanz der Quelle an und $r = ΔH/H_\text{max} die relative Redundanz. Im genannten Beispiel ergeben sich ΔH = 0.531 bit bzw. r$ = 53.1%.
 
*Für p = 0 ergibt sich H = 0, da hier die Ausgangsfolge „'''B B B''' ...” sicher vorhersagbar ist. Eigentlich beträgt nun der Symbolumfang nur noch M = 1. Gleiches gilt für $p$ = 1.
 
  
 +
* H_\text{bin}(p)&nbsp; is symmetrical around&nbsp; p = 0.5.&nbsp; A source with&nbsp; p_{\rm A} = 0.1&nbsp; and&nbsp; p_{\rm B} = 0. 9&nbsp; has the same entropy&nbsp; H = 0.469 \; \rm bit&nbsp; as a source with&nbsp; p_{\rm A} = 0.9&nbsp; and&nbsp; p_{\rm B} = 0.1.
  
Es sollte noch erwähnt werden, dass die binäre Entropiefunktion ''konkav'' ist, da deren zweite Ableitung nach dem Parameter $p für alle Werte von p$ negativ ist:
+
*The difference&nbsp; $ΔH = H_\text{max} - H$ gives&nbsp; the&nbsp; &raquo;redundancy&laquo;&nbsp; of the source and&nbsp; $r = ΔH/H_\text{max}&nbsp; the&nbsp; &raquo;relative redundancy&laquo;. &nbsp; In the example,&nbsp; ΔH = 0.531\; \rm bit&nbsp; and&nbsp; r = 53.1 \rm \%$.
 
 
$$\frac{{\rm d}^2H_{\rm bin} (p)}{{\rm d}\,p^2} = \frac{-1}{{\rm ln}(2) \cdot p \cdot (1-p)}< 0
 
\hspace{0.05cm}.$$
 
  
 +
*For&nbsp; p = 0&nbsp; this results in&nbsp; H = 0, since the symbol sequence &nbsp;\rm B \ B \ B \text{...}&nbsp; can be predicted with certainty &nbsp; &rArr; &nbsp; symbol set size only&nbsp; M = 1.&nbsp; The same applies to&nbsp; p = 1 &nbsp; &rArr; &nbsp; symbol sequence &nbsp;\rm A \ A \ A \text{...}.
  
==Nachrichtenquellen mit größerem Symbolumfang== 
+
*H_\text{bin}(p)&nbsp; is always a&nbsp; &raquo;concave function&laquo;,&nbsp; since the second derivative after the parameter&nbsp; p&nbsp; is negative for all values of&nbsp; p&nbsp;:
 +
:$$\frac{ {\rm d}^2H_{\rm bin} (p)}{ {\rm d}\,p^2} = \frac{- 1}{ {\rm ln}(2) \cdot p \cdot (1-p)}< 0
 +
\hspace{0.05cm}.$$}}
  
Auf der ersten Seite dieses Kapitels haben wir eine quaternäre Nachrichtenquelle ($M$ = 4) mit den Symbolwahrscheinlichkeiten $p_A = 0.4, p_B$ = 0.3, $p_C$ = 0.2 und $p_D$ = 0.1 betrachtet. Diese besitzt die folgende Entropie:
+
==Non-binary sources== 
 +
<br>
 +
In the&nbsp; [[Information_Theory/Discrete_Memoryless_Sources#Model_and_requirements|"first section"]]&nbsp; of this chapter we  considered a quaternary message source&nbsp; $(M = 4)$&nbsp; with the symbol probabilities&nbsp; $p_{\rm A} = 0. 4$, &nbsp; $p_{\rm B} = 0.3$, &nbsp; $p_{\rm C} = 0.2$&nbsp; and&nbsp; $ p_{\rm D} = 0.1$.&nbsp; This source has the following entropy:
 
   
 
   
$$\begin{align*}H_{\rm quat} \hspace{-0.1cm} & = \hspace{-0.1cm}  0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}=\\  
+
:$$H_{\rm quat} = 0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}.$$
\hspace{-0.1cm} & \hspace{-0.1cm}\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit}
+
 
\hspace{0.05cm}.\end{align*}$$
+
For numerical calculation, the detour via the decimal logarithm&nbsp; $\lg \ x = {\rm log}_{10} \ x$&nbsp; is often necessary, since the&nbsp; "logarithm dualis"&nbsp; $ {\rm log}_2 \ x$&nbsp; is mostly not found on pocket calculators.
 +
 
 +
:$$H_{\rm quat}=\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2} + 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit}
 +
\hspace{0.05cm}.$$
  
Oft ist der Umweg über den Zehnerlogarithmus lg x = log10 $x$ sinnvoll, da meist der Logarithmus dualis log2 x (oder auch ld x) auf Taschenrechnern nicht zu finden ist.
+
{{GraueBox|TEXT=
Bestehen zwischen den einzelnen Symbolwahrscheinlichkeiten Symmetrien wie im Beispiel
+
$\text{Example 3:}$&nbsp;
 +
Now there are certain symmetries between the symbol probabilities:
 +
[[File:EN_Inf_T_1_1_S5_v3.png|frame|Entropy of binary source and quaternary source]]
 
   
 
   
$$p_{\rm A} = p_{\rm D} = p \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = p_{\rm C} = 0.5-p \hspace{0.05cm},\hspace{0.3cm}{\rm mit} \hspace{0.15cm}0 \le p \le 0.5 \hspace{0.05cm},$$
+
:$$p_{\rm A} = p_{\rm D} = p \hspace{0.05cm},\hspace{0.4cm}p_{\rm B} = p_{\rm C} = 0.5 - p \hspace{0.05cm},\hspace{0.3cm}{\rm with} \hspace{0.15cm}0 \le p \le 0.5 \hspace{0.05cm}.$$
  
so kann zur Entropieberechnung auf die binäre Entropiefunktion zurückgegriffen werden:
+
In this case, the binary entropy function can be used to calculate the entropy:
 
   
 
   
$$H_{\rm quat} = 2 \cdot p \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + 2 \cdot (0.5-p) \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.5-p} = 1 + H_{\rm bin}(2p) \hspace{0.05cm}.$$
+
:$$H_{\rm quat} = 2 \cdot p \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm} } + 2 \cdot (0.5-p) \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.5-p}$$
 +
$$\Rightarrow \hspace{0.3cm} H_{\rm quat} = 1 + H_{\rm bin}(2p) \hspace{0.05cm}.$$
 +
 
 +
The graphic shows as a function of&nbsp; p
 +
*the entropy of the quaternary source (blue)
  
Die Grafik zeigt den Entropieverlauf der Quaternärquelle (blau) im Vergleich zur Binärquelle (rot) abhängig von p. Für die Quaternärquelle ist nur der Abszissenbereich 0 ≤ p ≤ 0.5 zulässig.
+
*in comparison to the entropy course of the binary source (red).  
  
[[File:P_ID2231__Inf_T_1_1_S5_neu.png|Entropie von Binärquelle und Quaternärquelle]]
 
  
Man erkennt aus der blauen Kurve für die Quaternärquelle:
+
For the quaternary source only the abscissa&nbsp; 0 ≤ p ≤ 0.5&nbsp; is allowed.
*Die maximale Entropie $H_\text{max}$ = 2 bit ergibt sich für $p$ = 0.25 ⇒  $p_A$ = p_B = p_C = p_D = 0.25, also wieder für gleichwahrscheinliche Symbole.
+
 
*Mit $p$ = 0 bzw. $p$ = 0.5 entartet die Quaternärquelle zu einer Binärquelle mit $p_B$ = p_C = 0.5 bzw. p_A = $p_D$ = 0.5. In diesem Fall ergibt sich die Entropie zu $H$ = 1 bit.
+
&rArr; &nbsp; You can see from the blue curve for the quaternary source:
*Die Quelle mit p_A = p_D = $p$ = 0.1 und $p_B = p_C$ = 0.4 weist folgende Entropie und (relative) Redundanz auf:
+
*The maximum entropy&nbsp; $H_\text{max} = 2 \; \rm bit/symbol$&nbsp; results for&nbsp; p = 0.25 &nbsp; &rArr; &nbsp; equally probable symbols: &nbsp; $p_{\rm A} = p_{\rm B} = p_{\rm C} = p_{\rm A} = 0.25$.
 +
 
 +
*With&nbsp; p = 0&nbsp; the quaternary source degenerates to a binary source with&nbsp; $p_{\rm B} = p_{\rm C} = 0. 5$, &nbsp; $p_{\rm A} = p_{\rm D} = 0$ &nbsp; &rArr; &nbsp; $H = 1 \; \rm bit/symbol$.&nbsp; Similar applies to $p = 0.5$.
 
   
 
   
$$\begin{align*}H \hspace{-0.1cm} & = \hspace{-0.1cm}  1 + H_{\rm bin} (2p) =1 + H_{\rm bin} (0.2) = 1.722\,{\rm bit}\hspace{0.05cm},\\
+
*The source with&nbsp; $p_{\rm A} = p_{\rm D} = 0.1&nbsp; and&nbsp; p_{\rm B} = p_{\rm C} = 0.4$&nbsp; has the following characteristics (each with the pseudo unit "bit/symbol"):
{\rm \Delta }H \hspace{-0.1cm} & = \hspace{-0.1cm} {\rm ld}\hspace{0.1cm} M - H =2\,{\rm bit}- 1.722\,{\rm bit} = 0.278\,{\rm bit}\hspace{0.05cm},\\
+
 
r \hspace{-0.1cm} & = \hspace{-0.1cm} {\rm \Delta }H/({\rm ld}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.\end{align*}$$
+
: &nbsp; &nbsp; '''(1)''' &nbsp; entropy: &nbsp; $H = 1 + H_{\rm bin} (2p) =1 + H_{\rm bin} (0.2) = 1.722,$
 +
 
 +
: &nbsp; &nbsp; '''(2)''' &nbsp; Redundancy: &nbsp; ${\rm \Delta }H = {\rm log_2}\hspace{0.1cm} M - H =2- 1.722= 0.278,$
 +
 
 +
: &nbsp; &nbsp; '''(3)''' &nbsp; relative redundancy: &nbsp; $r ={\rm \Delta }H/({\rm log_2}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.$
 +
 
 +
*The redundancy of the quaternary source with&nbsp; p = 0.1&nbsp; is&nbsp; $ΔH = 0.278 \; \rm bit/symbol$ &nbsp; &rArr; &nbsp; exactly the same as the redundancy of the binary source with&nbsp; $p = 0.2$.}}
 +
 
 +
 
 +
 
 +
== Exercises for the chapter==
 +
<br>
 +
[[Aufgaben:Exercise_1.1:_Entropy_of_the_Weather|Exercise 1.1: Entropy of the Weather]]
  
Die Redundanz ΔH der Quaternärquelle mit p = 0.1 ist gleich 0.278 bit und damit genau so groß wie die Redundanz der Binärquelle mit p = 0.2.
+
[[Aufgaben:Exercise_1.1Z:_Binary_Entropy_Function|Exercise 1.1Z: Binary Entropy Function]]
Anmerkung: Als Pseudoeinheit ist hier stets „bit” angegeben. Genauer wäre „bit/Symbol”.
 
  
== Aufgaben zu Kapitel 1.1==
+
[[Aufgaben:Exercise_1.2:_Entropy_of_Ternary_Sources|Exercise 1.2: Entropy of Ternary Sources]]
  
  
==Quellen==
+
==References==
 
<references />
 
<references />
  
  
 
{{Display}}
 
{{Display}}

Latest revision as of 16:52, 9 January 2024

# OVERVIEW OF THE FIRST MAIN CHAPTER #


This first chapter describes the calculation and the meaning of entropy.  According to the Shannonian information definition,  entropy is a measure of the mean uncertainty about the outcome of a statistical event or the uncertainty in the measurement of a stochastic quantity.  Somewhat casually expressed,  the entropy of a random quantity quantifies its  »randomness«.

In detail are discussed:

  1. The  »information content«  of a symbol and the  »entropy«  of a discrete memoryless source,
  2. the  »binary entropy function«  and its application to non-binary sources,
  3. the entropy calculation for  »sources with memory«  and suitable approximations,
  4. the special features of  »Markov sources«  regarding the entropy calculation,
  5. the procedure for sources with a large number of symbols, for example  »natural texts«,
  6. the  »entropy estimates«  according to Shannon and Küpfmüller.


Model and requirements


We consider a discrete message source  \rm Q, which gives a sequence  \langle q_ν \rangle  of symbols.

  • For the variable  ν = 1, ... , N, where  N  should be sufficiently large.
  • Each individual source symbol  q_ν  comes from a symbol set  \{q_μ \}  where  μ = 1, ... , MM  denotes the symbol set size:
q_{\nu} \in \left \{ q_{\mu} \right \}, \hspace{0.25cm}{\rm with}\hspace{0.25cm} \nu = 1, \hspace{0.05cm} \text{ ...}\hspace{0.05cm} , N\hspace{0.25cm}{\rm and}\hspace{0.25cm}\mu = 1,\hspace{0.05cm} \text{ ...}\hspace{0.05cm} , M \hspace{0.05cm}.

The figure shows a quaternary message source  (M = 4)  with alphabet  \rm \{A, \ B, \ C, \ D\}  and an exemplary sequence of length  N = 100.

Quaternary source

The following requirements apply:

  • The quaternary source is fully described by  M = 4  symbol probabilities  p_μ.  In general it applies:
\sum_{\mu = 1}^M \hspace{0.1cm}p_{\mu} = 1 \hspace{0.05cm}.
{\rm Pr} \left (q_{\nu} = q_{\mu} \right ) = {\rm Pr} \left (q_{\nu} = q_{\mu} \hspace{0.03cm} | \hspace{0.03cm} q_{\nu -1}, q_{\nu -2}, \hspace{0.05cm} \text{ ...}\hspace{0.05cm}\right ) \hspace{0.05cm}.
  • Since the alphabet consists of symbols  (and not of random variables),  the specification of  »expected values«  (linear mean, second moment, standard deviation, etc.)  is not possible here,  but also not necessary from an information-theoretical point of view.


These properties will now be illustrated with an example.

Relative frequencies as a function of  N

\text{Example 1:}  For the symbol probabilities of a quaternary source applies:

p_{\rm A} = 0.4 \hspace{0.05cm},\hspace{0.2cm}p_{\rm B} = 0.3 \hspace{0.05cm},\hspace{0.2cm}p_{\rm C} = 0.2 \hspace{0.05cm},\hspace{0.2cm} p_{\rm D} = 0.1\hspace{0.05cm}.

For an infinitely long sequence  (N \to \infty)

  • the  »relative frequencies«  h_{\rm A}h_{\rm B}h_{\rm C}h_{\rm D}   ⇒   a-posteriori parameters
  • were identical to the  »probabilities«  p_{\rm A}p_{\rm B}p_{\rm C}p_{\rm D}   ⇒   a-priori parameters.


With smaller  N  deviations may occur, as the adjacent table  (result of a simulation)  shows.

  • In the graphic above an exemplary sequence is shown with  N = 100  symbols.
  • Due to the set elements  \rm A\rm B\rm C  and  \rm D  no mean values can be given.


However,  if you replace the symbols with numerical values,  for example  \rm A \Rightarrow 1,   \rm B \Rightarrow 2,   \rm C \Rightarrow 3,   \rm D \Rightarrow 4, then you will get after
    »time averaging«   ⇒   crossing line     or     »ensemble averaging«   ⇒   expected value formation

m_1 = \overline { q_{\nu} } = {\rm E} \big [ q_{\mu} \big ] = 0.4 \cdot 1 + 0.3 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4 = 2 \hspace{0.05cm},
m_2 = \overline { q_{\nu}^{\hspace{0.05cm}2} } = {\rm E} \big [ q_{\mu}^{\hspace{0.05cm}2} \big ] = 0.4 \cdot 1^2 + 0.3 \cdot 2^2 + 0.2 \cdot 3^2 + 0.1 \cdot 4^2 = 5 \hspace{0.05cm},
\sigma = \sqrt {m_2 - m_1^2} = \sqrt {5 - 2^2} = 1 \hspace{0.05cm}.


Maximum entropy of a discrete source


\text{Claude Elwood Shannon}  defined in 1948 in the standard work of information theory  [Sha48][1]  the concept of information as  "decrease of uncertainty about the occurrence of a statistical event".

Let us make a mental experiment with  M  possible results, which are all equally probable:   p_1 = p_2 = \hspace{0.05cm} \text{ ...}\hspace{0.05cm} = p_M = 1/M \hspace{0.05cm}.

Under this assumption applies:

  • Is  M = 1, then each individual attempt will yield the same result and therefore there is no uncertainty about the output.
  • On the other hand, an observer learns about an experiment with  M = 2, for example the  »coin toss«  with the set of events  \big \{\rm \boldsymbol{\rm Z}(ahl), \rm \boldsymbol{\rm W}(app) \big \}  and the probabilities  p_{\rm Z} = p_{\rm W} = 0. 5, a gain in information.  The uncertainty regarding  \rm Z  resp.  \rm W  is resolved.
  • In the experiment  »dice«  (M = 6)  and even more in  »roulette«  (M = 37)  the gained information is even more significant for the observer than in the  »coin toss«  when he learns which number was thrown or which ball fell.
  • Finally it should be considered that the experiment  »triple coin toss«  with  M = 8  possible results  \rm ZZZ\rm ZZW\rm ZWZ\rm ZWW\rm WZZ\rm WZW\rm WWZ\rm WWW  provides three times the information as the single coin toss  (M = 2).


The following definition fulfills all the requirements listed here for a quantitative information measure for equally probable events, indicated only by the symbol set size  M.

\text{Definition:}  The  »maximum average information content«   of a message source depends only on the symbol set size  M  and results in

H_0 = {\rm log}\hspace{0.1cm}M = {\rm log}_2\hspace{0.1cm}M \hspace{0.15cm} {\rm (in \ “bit")} = {\rm ln}\hspace{0.1cm}M \hspace{0.15cm}\text {(in “nat")} = {\rm lg}\hspace{0.1cm}M \hspace{0.15cm}\text {(in “Hartley")}\hspace{0.05cm}.
  • Since  H_0  indicates the maximum value of the  \text{entropy}  HH_\text{max}=H_0  is also used in our tutorial as short notation.


Please note our nomenclature:

  • The logarithm will be called  »log«  in the following, independent of the base.
  • The relations mentioned above are fulfilled due to the following properties:
{\rm log}\hspace{0.1cm}1 = 0 \hspace{0.05cm},\hspace{0.2cm} {\rm log}\hspace{0.1cm}37 > {\rm log}\hspace{0.1cm}6 > {\rm log}\hspace{0.1cm}2\hspace{0.05cm},\hspace{0.2cm} {\rm log}\hspace{0.1cm}M^k = k \cdot {\rm log}\hspace{0.1cm}M \hspace{0.05cm}.
  • Usually we use the logarithm to the base  2   ⇒   »logarithm dualis«    \rm (ld),  where the pseudo unit  "bit"  (more precisely:  "bit/symbol")  is then added:
{\rm ld}\hspace{0.1cm}M = {\rm log_2}\hspace{0.1cm}M = \frac{{\rm lg}\hspace{0.1cm}M}{{\rm lg}\hspace{0.1cm}2} = \frac{{\rm ln}\hspace{0.1cm}M}{{\rm ln}\hspace{0.1cm}2} \hspace{0.05cm}.
  • In addition, you can find in the literature some additional definitions, which are based on the natural logarithm  \rm (ln)  or the logarithm of the tens  \rm (lg).

Information content and entropy


We now waive the previous requirement that all  M  possible results of an experiment are equally probable.  In order to keep the spelling as compact as possible, we define for this section only:

p_1 > p_2 > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_\mu > \hspace{0.05cm} \text{ ...}\hspace{0.05cm} > p_{M-1} > p_M\hspace{0.05cm},\hspace{0.4cm}\sum_{\mu = 1}^M p_{\mu} = 1 \hspace{0.05cm}.

We now consider the »information content«  of the individual symbols, where we denote the  "logarithm dualis"  with  \log_2:

I_\mu = {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\hspace{0.05cm}{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/Symbol)} \hspace{0.05cm}.

You can see:

  • Because of  p_μ ≤ 1  the information content is never negative.  In the borderline case  p_μ \to 1  goes  I_μ \to 0.
  • However, for  I_μ = 0   ⇒   p_μ = 1   ⇒   M = 1  the information content is also  H_0 = 0.
  • For decreasing probabilities  p_μ  the information content increases continuously:
I_1 < I_2 < \hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_\mu <\hspace{0.05cm} \text{ ...}\hspace{0.05cm} < I_{M-1} < I_M \hspace{0.05cm}.

\text{Conclusion:}  The more improbable an event is, the greater is its information content.  This fact is also found in daily life:

  1. "6 right ones" in the lottery are more likely to be noticed than "3 right ones" or no win at all.
  2. A tsunami in Asia also dominates the news in Germany for weeks as opposed to the almost standard Deutsche Bahn delays.
  3. A series of defeats of Bayern Munich leads to huge headlines in contrast to a winning series.  With 1860 Munich exactly the opposite is the case.


However, the information content of a single symbol (or event) is not very interesting.  On the other hand one of the central quantities of information theory is obtained,

  • by ensemble averaging over all possible symbols  q_μ  bzw. 
  • by time averaging over all elements of the sequence  \langle q_ν \rangle.


\text{Definition:}  The  »entropy«  H  of a discrete source indicates the  »mean information content of all symbols«:

H = \overline{I_\nu} = {\rm E}\hspace{0.01cm}[I_\mu] = \sum_{\mu = 1}^M p_{\mu} \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{p_\mu}= -\sum_{\mu = 1}^M p_{\mu} \cdot{\rm log_2}\hspace{0.1cm}{p_\mu} \hspace{0.5cm}\text{(unit: bit, more precisely: bit/symbol)} \hspace{0.05cm}.

The overline marks again a time averaging and  \rm E[\text{...}]  an ensemble averaging.


Entropy is among other things a measure for

  • the mean uncertainty about the outcome of a statistical event,
  • the  "randomness"  of this event,  and
  • the average information content of a random variable.


Binary entropy function


At first we will restrict ourselves to the special case  M = 2  and consider a binary source, which returns the two symbols  \rm A  and  \rm B.  The symbol probabilities are   p_{\rm A} = p  and   p_{\rm B} = 1 - p.

For the entropy of this binary source applies:

H_{\rm bin} (p) = p \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm}} + (1-p) \cdot {\rm log_2}\hspace{0.1cm}\frac{1}{1-p} \hspace{0.5cm}{\rm (unit\hspace{-0.15cm}: \hspace{0.15cm}bit\hspace{0.15cm}or\hspace{0.15cm}bit/symbol)} \hspace{0.05cm}.

This function is called  H_\text{bin}(p)  the  »binary entropy function«.  The entropy of a source with a larger symbol set size  M  can often be expressed using  H_\text{bin}(p) .

\text{Example 2:}  The figure shows the binary entropy function for the values  0 ≤ p ≤ 1  of the symbol probability of  \rm A  (or also of  \rm B).  You can see:

Binary entropy function as a function of  p
  • The maximum value  H_\text{max} = 1\; \rm bit  results for  p = 0.5, thus for equally probable binary symbols.  Then   \rm A  and  \rm B  contribute the same amount to the entropy.
  • H_\text{bin}(p)  is symmetrical around  p = 0.5.  A source with  p_{\rm A} = 0.1  and  p_{\rm B} = 0. 9  has the same entropy  H = 0.469 \; \rm bit  as a source with  p_{\rm A} = 0.9  and  p_{\rm B} = 0.1.
  • The difference  ΔH = H_\text{max} - H gives  the  »redundancy«  of the source and  r = ΔH/H_\text{max}  the  »relative redundancy«.   In the example,  ΔH = 0.531\; \rm bit  and  r = 53.1 \rm \%.
  • For  p = 0  this results in  H = 0, since the symbol sequence  \rm B \ B \ B \text{...}  can be predicted with certainty   ⇒   symbol set size only  M = 1.  The same applies to  p = 1   ⇒   symbol sequence  \rm A \ A \ A \text{...}.
  • H_\text{bin}(p)  is always a  »concave function«,  since the second derivative after the parameter  p  is negative for all values of  p :
\frac{ {\rm d}^2H_{\rm bin} (p)}{ {\rm d}\,p^2} = \frac{- 1}{ {\rm ln}(2) \cdot p \cdot (1-p)}< 0 \hspace{0.05cm}.

Non-binary sources


In the  "first section"  of this chapter we considered a quaternary message source  (M = 4)  with the symbol probabilities  p_{\rm A} = 0. 4,   p_{\rm B} = 0.3,   p_{\rm C} = 0.2  and  p_{\rm D} = 0.1.  This source has the following entropy:

H_{\rm quat} = 0.4 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.2}+ 0.1 \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.1}.

For numerical calculation, the detour via the decimal logarithm  \lg \ x = {\rm log}_{10} \ x  is often necessary, since the  "logarithm dualis"  {\rm log}_2 \ x  is mostly not found on pocket calculators.

H_{\rm quat}=\frac{1}{{\rm lg}\hspace{0.1cm}2} \cdot \left [ 0.4 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.4} + 0.3 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0. 3} + 0.2 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.2} + 0.1 \cdot {\rm lg}\hspace{0.1cm}\frac{1}{0.1} \right ] = 1.845\,{\rm bit} \hspace{0.05cm}.

\text{Example 3:}  Now there are certain symmetries between the symbol probabilities:

Entropy of binary source and quaternary source
p_{\rm A} = p_{\rm D} = p \hspace{0.05cm},\hspace{0.4cm}p_{\rm B} = p_{\rm C} = 0.5 - p \hspace{0.05cm},\hspace{0.3cm}{\rm with} \hspace{0.15cm}0 \le p \le 0.5 \hspace{0.05cm}.

In this case, the binary entropy function can be used to calculate the entropy:

H_{\rm quat} = 2 \cdot p \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{\hspace{0.1cm}p\hspace{0.1cm} } + 2 \cdot (0.5-p) \cdot {\rm log}_2\hspace{0.1cm}\frac{1}{0.5-p}

\Rightarrow \hspace{0.3cm} H_{\rm quat} = 1 + H_{\rm bin}(2p) \hspace{0.05cm}.

The graphic shows as a function of  p

  • the entropy of the quaternary source (blue)
  • in comparison to the entropy course of the binary source (red).


For the quaternary source only the abscissa  0 ≤ p ≤ 0.5  is allowed.

⇒   You can see from the blue curve for the quaternary source:

  • The maximum entropy  H_\text{max} = 2 \; \rm bit/symbol  results for  p = 0.25   ⇒   equally probable symbols:   p_{\rm A} = p_{\rm B} = p_{\rm C} = p_{\rm A} = 0.25.
  • With  p = 0  the quaternary source degenerates to a binary source with  p_{\rm B} = p_{\rm C} = 0. 5,   p_{\rm A} = p_{\rm D} = 0   ⇒   H = 1 \; \rm bit/symbol.  Similar applies to p = 0.5.
  • The source with  p_{\rm A} = p_{\rm D} = 0.1  and  p_{\rm B} = p_{\rm C} = 0.4  has the following characteristics (each with the pseudo unit "bit/symbol"):
    (1)   entropy:   H = 1 + H_{\rm bin} (2p) =1 + H_{\rm bin} (0.2) = 1.722,
    (2)   Redundancy:   {\rm \Delta }H = {\rm log_2}\hspace{0.1cm} M - H =2- 1.722= 0.278,
    (3)   relative redundancy:   r ={\rm \Delta }H/({\rm log_2}\hspace{0.1cm} M) = 0.139\hspace{0.05cm}.
  • The redundancy of the quaternary source with  p = 0.1  is  ΔH = 0.278 \; \rm bit/symbol   ⇒   exactly the same as the redundancy of the binary source with  p = 0.2.


Exercises for the chapter


Exercise 1.1: Entropy of the Weather

Exercise 1.1Z: Binary Entropy Function

Exercise 1.2: Entropy of Ternary Sources


References

  1. Shannon, C.E.: A Mathematical Theory of Communication. In: Bell Syst. Techn. J. 27 (1948), pp. 379-423 and pp. 623-656.