Difference between revisions of "Aufgaben:Exercise 1.7: Entropy of Natural Texts"

From LNTwww
 
(32 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{quiz-Header|Buchseite=Informationstheorie und Quellencodierung/Natürliche wertdiskrete Nachrichtenquellen
+
{{quiz-Header|Buchseite=Information_Theory/Natural_Discrete_Sources
 
}}
 
}}
  
[[File:P_ID2319__Inf_A_1_7_neu.png|right|]]
+
[[File:Inf_A_1_7_v2.png|right|frame|Text with erasures or errors]]
:Anfang der 1950er Jahre schätzte Claude E. Shannon die Entropie <i>H</i> der englischen Sprache mit einem bit pro Zeichen ab. Kurze Zeit später kam Karl Küpfmüller bei einer empirischen Untersuchung der deutschen Sprache auf einen Entropiewert von <i>H</i>&nbsp;=&nbsp;1.3&nbsp;bit/Zeichen, also nur etwas größer. Die Ergebnisse von Shannon und Küpfmüller beruhen dabei interessanter Weise auf zwei völlig unterschiedlichen Methoden.
+
In the early 1950s,&nbsp; [https://en.wikipedia.org/wiki/Claude_Shannon Claude E. Shannon]&nbsp; estimated the entropy&nbsp; $H$&nbsp; of the English language at one bit per character.&nbsp; A short time later,&nbsp; [https://en.wikipedia.org/wiki/Karl_K%C3%BCpfm%C3%BCller Karl Küpfmüller]&nbsp; arrived at an entropy value of&nbsp; $H =1.3\ \rm bit/character$, i.e. a somewhat larger value, in an empirical study of the German language.&nbsp; Interestingly, the results of Shannon and Küpfmüller are based on two completely different methods.
  
:Die differierenden Ergebnisse lassen sich eher nicht mit den geringen Differenzen hinsichtlich des Symbolumfangs <i>M</i> erklären:
+
The differing results cannot be explained by the small differences in the range of symbols&nbsp; $M$&nbsp;:
  
:* Shannon ging von 26 Buchstaben und dem Leerzeichen aus &nbsp;&#8658;&nbsp;  <i>M</i> = 27.
+
* Shannon assumed&nbsp; $26$ letters and the space &nbsp;&#8658;&nbsp;  $M = 27$.
 +
* Küpfmüller assumed only&nbsp; $M = 26$&nbsp; letters&nbsp; (i.e. without the&nbsp; "Blank Space").
  
:* Küpfmüller ging von <i>M</i> = 26 Buchstaben aus, ebenfalls ohne zwischen Groß&ndash; und Kleinschreibung zu unterscheiden.
 
  
:Mit dieser Aufgabe soll gezeigt werden, wie sich
+
Both made no distinction between upper and lower case.
  
:* Auslöschungen (<i>Erasures</i>) &#8658; man kennt den Ort eines Fehlers,
+
This task is intended to show how
 +
* "Erasures" &nbsp; &#8658; &nbsp; one knows the location of an error, resp.
 +
* "Errors" &nbsp; &#8658; &nbsp; it is not clear to the reader what is wrong and what is right,
  
:* Zeichenfehler &#8658; es ist nicht offensichtlich, was falsch und was richtig ist,
 
  
:auf die  Verständlichkeit eines Textes auswirken. Unser Text beinhaltet auch die typisch deutschen Buchstaben &bdquo&rdquo;, &bdquo&rdquo;, &bdquo;ü&rdquo; und &bdquo;ß&rdquo; sowie Ziffern und Interpunktion. Außerdem wird zwischen Groß&ndash; und Kleinschreibung unterschieden.
+
have an effect on the comprehensibility of a text.&nbsp; Our text also contains the typical German letters "ä",&nbsp; "ö",&nbsp; "ü"&nbsp; and&nbsp; "ß"&nbsp; as well as numbers and punctuation.&nbsp; In addition, a distinction is made between upper and lower case.
  
:<br><br><br><br>
 
:In der Abbildung ist dieser Text, der von Küpfmüllers Vorgehensweise handelt, in sechs Blöcke der Länge <i>N</i> = 197 bis <i>N</i> = 319 aufgeteilt. Beschrieben ist die Überprüfung seiner ersten Analyse (1.3 bit/Zeichen) auf völlig anderem Wege, die zum Ergebnis 1.51 bit/Zeichen führte.
 
  
:* In den oberen fünf Blöcken erkennt man <i>Erasures</i> mit verschiedenen Wahrscheinlichkeiten zwischen 10% und 50%.
+
In the figure, a text dealing with Küpfmüller's approach is divided into six blocks of length between&nbsp; $N = 197$&nbsp; and&nbsp; $N = 319$.&nbsp; Described is the check of his&nbsp; [[Information_Theory/Natürliche_wertdiskrete_Nachrichtenquellen#A_further_entropy_estimation_by_K.C3.BCpfm.C3.BCller|first analysis]]&nbsp; in a completely different way, which led to the result&nbsp;  $H =1.51\ \rm bit/character$.
  
:* Im letzten Block sind Zeichenfehler mit 20&ndash;prozentiger Verfälschungswahrscheinlichkeit eingefügt.
+
* In the upper five blocks one recognises&nbsp; "Erasures"&nbsp; with different erasure probabilities between&nbsp; $10\%$&nbsp; and&nbsp; $50\%$.
 +
* In the last block, "character errors" with&nbsp; $20$&ndash;error probability are inserted.
  
:Der Einfluss solcher Zeichenfehler auf die Lesbarkeit eines Textes soll in der Teilaufgabe (4) verglichen werden mit dem zweiten (rot umrandeten) Block, für den die Wahrscheinlichkeit eines Erasures ebenfalls 20% beträgt.
 
  
:<b>Hinweis:</b> Die Aufgabe bezieht sich auf das Kapitel 1.3 dieses Buches. Bezug genommen wird auch auf die relative Redundanz einer Folge, wobei mit dem Entscheidungsgehalt <i>H</i><sub>0</sub> und der Entropie <i>H</i> gilt:
+
The influence of such character errors on the readability of a text is to be compared in subtask&nbsp; '''(4)'''&nbsp; with the second block (outlined in red), for which the probability of an erasure is also&nbsp; $20\%$.
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
''Hints:''
 +
*The exercise belongs to the chapter&nbsp; [[Information_Theory/Natürliche_wertdiskrete_Nachrichtenquellen|Natural Discrete Sources]].
 +
*Bezug genommen wird insbesondere auf die beiden Seiten
 +
::[[Information_Theory/Natürliche_wertdiskrete_Nachrichtenquellen#Entropy_estimation_according_to_K.C3.BCpfm.C3.BCller|Entropy estimation according to Küpfmüller]],&nbsp; sowie&nbsp;
 +
::[[Information_Theory/Natural_Discrete_Sources#A_further_entropy_estimation_by_K.C3.BCpfm.C3.BCller|A further entropy estimation by Küpfmüller]].
 +
 +
*For the&nbsp; '''relative redundancy'''&nbsp;  of a sequence with the decision content&nbsp; $H_0$&nbsp; and the entropy&nbsp; $H$&nbsp; applies
 
:$$r = \frac{H_0 - H}{H_0}\hspace{0.05cm}.$$
 
:$$r = \frac{H_0 - H}{H_0}\hspace{0.05cm}.$$
  
 +
*To solve the subtask&nbsp; '''(4)'''&nbsp; you need a good knowledge of the German language.&nbsp;
  
===Fragebogen===
+
 
 +
 
 +
{{GraueBox|TEXT=
 +
$\text{Here is the English translation of the six description blocks:}$
 +
 
 +
$\rm (A)$ &nbsp; &nbsp; How did Küpfmüller arrive at the result H = 1.51 bit/character? Since there were no publications for the statistics of word groups or whole sentences, he estimated the entropy of the German language as follows:
 +
 
 +
$\rm (B)$ &nbsp; &nbsp; A coherent, otherwise arbitrary German text is covered at a certain point. The text before is read and the reader is asked to try to determine the following word from the previous text and the context.
 +
 
 +
$\rm (C)$ &nbsp; &nbsp; With a very large number of such attempts, the percentage of hits gives a measure of the bonds between the words and sentences. It turns out that for one and the same type of text by the same author, a constant final value of the hit ratio is reached relatively quickly, from about 100 to 200 attempts.
 +
 
 +
$\rm (D)$ &nbsp; &nbsp; The hit ratio depends strongly on the type of text. Values between 15% and 33% are obtained for different types of text, with an average value of 22%. However, this also means that on average 22% of all words can be determined from the context.
 +
 
 +
$\rm (E)$ &nbsp; &nbsp; In other words, the number of words in a long text can be reduced by a factor of 0.78 without significantly reducing its message content. By definition, the entropy H = 0.78 times 2.0 = 1.56 bit/character.
 +
 
 +
$\rm (F)$ &nbsp; &nbsp; Küpfmüller checked his result with a comparable empirical investigation of all syllables and came up with the reduction factor 0.54 and H3= 2.8 to the entropy H = 1.51 bit/character.}}
 +
 
 +
 
 +
===Questions===
  
 
<quiz display=simple>
 
<quiz display=simple>
{Von welchem Symbolumfang ist Küpfmüller ausgegangen?
+
{What symbol set size&nbsp; $M$&nbsp; did Küpfmüller assume?
 
|type="{}"}
 
|type="{}"}
$M$ = { 26 3% }
+
$M \ = \ $ { 26 }
  
  
{Welche relative Redundanz ergibt sich aus Küpfmüllers Entropiewert?
+
{What relative redundancy&nbsp; $r$&nbsp; results from Küpfmüller's entropy value?  
 
|type="{}"}
 
|type="{}"}
$r$ = { 72.3 3% } %
+
$r \ = \ $ { 72.3 3% } $\ \%$
  
  
{Wie lässt sich das Ergebnis der Teilaufgabe (2) interpretieren? Gehen Sie jeweils von einer Textdatei mit <i>M</i> = 26 unterschiedlichen Zeichen aus.
+
{How can the result of subtask&nbsp; '''(2)'''&nbsp; be interpreted?&nbsp; Assume a text file with&nbsp; $M = 26$&nbsp; different characters.
 
|type="[]"}
 
|type="[]"}
+ Eine solche Textdatei hinreichender Länge (<i>N</i> &#8594; &#8734;) könnte man mit 1.3 &middot; <i>N</i> Binärsymbolen darstellen.
+
+ Such a text file of sufficient length&nbsp; $(N \to \infty)$&nbsp; could be represented with&nbsp; $1.3 \cdot N$&nbsp; binary symbols.
- Eine solche Textdatei  mit <i>N</i> = 100000 Zeichen könnte man mit 130000 Binärsymbolen darstellen.
+
- Such a text file with&nbsp; $N= 100\hspace{0.05cm}000$&nbsp; characters could be represented with&nbsp; $130\hspace{0.1cm}000$&nbsp; binary symbols.
- Ein Leser kann den Text auch dann noch verstehen (oder zumindest erahnen), wenn 70% der Zeichen ausgelöscht sind.
+
- A reader can still understand (or at least guess) the text even if&nbsp; $70\%$&nbsp; of the characters are erased.
  
  
{Was erschwert die Verständlichkeit eines Textes mehr?
+
{What makes a text more difficult to understand?
|type="[]"}
+
|type="()"}
- 20% Auslöschungen (<i>Erasures</i>),
+
- $20\%$&nbsp; erasures,
+ eine Zeichenfehlerwahrscheinlichkeit von 20%.
+
+ $20\%$&nbsp; errors.
  
  
Line 61: Line 93:
 
</quiz>
 
</quiz>
  
===Musterlösung===
+
===Solution===
 
{{ML-Kopf}}
 
{{ML-Kopf}}
:<b>1.</b>&nbsp;&nbsp;Der Symbolumfang bei Küpfmüllers Untersuchungen war <u><i>M</i> = 26</u>, da er im Gegensatz zu Shannon das Leerzeichen zunächst nicht berücksichtigte. Bei dem vorgegebenen deutschen Text dieser Aufgabe ist der Symbolumfang deutlich größer,
+
'''(1)'''&nbsp; The symbol set size in Küpfmüller's investigations was&nbsp; $\underline{M = 26}$,&nbsp; because, in contrast to Shannon, he did not initially take the space into account.
  
:* da hier auch die typisch deutschen Zeichen &bdquo;ä&rdquo;, &bdquo&rdquo;, &bdquo;ü&rdquo; und &bdquo;ß&rdquo; vorkommen,
+
In the given German text of this task, the symbol set size is significantly larger,
 +
* since the typical German characters "ä",&nbsp; "ö",&nbsp; "ü"&nbsp; and&nbsp; "ß"&nbsp; also occur,
 +
* there is a distinction between upper and lower case,
 +
* and there are also numbers and punctuation marks.
  
:* zwischen Klein&ndash; und Großschreibung unterschieden wird,
 
  
:* und zudem noch Ziffern und Interpunktionszeichen hinzukommen.
+
'''(2)'''&nbsp; With the decision content&nbsp; $H_0 = \log_2 \ (31) \approx 4.7 \ \rm bit/character$&nbsp; and the entropy&nbsp; $H = 1.3\ \rm bit/character $&nbsp;, one obtains for the relative redundancy:
 +
:$$r = \frac{H_0 - H}{H_0}= \frac{4.7 - 1.3}{4.7}\underline {\hspace{0.1cm}\approx 72.3\,\%}\hspace{0.05cm}.$$
  
:<b>2.</b>&nbsp;&nbsp;Mit dem Entscheidungsgehalt <i>H</i><sub>0</sub> = log<sub>2</sub> (31) &asymp; 4.7 und der Entropie <i>H</i> = 1.3 (jeweils mit der Einheit bit/Zeichen) erhält man für die relative Redundanz:
 
:$$r = \frac{H_0 - H}{H_0}= \frac{4.7 - 1.3}{4.7}\underline {\hspace{0.1cm}\approx 72.3\,\%}\hspace{0.05cm}.$$
 
  
:<b>3.</b>&nbsp;&nbsp;Richtig ist nur der <u>erste Lösungsvorschlag</u>. Laut Küpfmüller benötigt man nur 1.3 Binärzeichen pro Quellenzeichen. Bei einer Datei der Länge <i>N</i> würden also 1.3 &middot; <i>N</i> Binärsymbole ausreichen, allerdings nur dann, wenn
+
'''(3)'''&nbsp; Only the <u>first suggested solution</u> is correct:
 +
*According to Küpfmüller, one only needs&nbsp; $1.3$&nbsp; binary characters per source character.
 +
*With a file of length&nbsp; $N$&nbsp;,&nbsp; $1.3 \cdot N$&nbsp; binary symbols would therefore be sufficient, but only if the source symbol sequence is infinitely long&nbsp; $(N \to \infty)$&nbsp; and this was encoded in the best possible way.
 +
*In contrast, Küpfmüller's result and the relative redundancy of more than&nbsp; $70\%$ calculated in subtask&nbsp; '''(2)'''&nbsp; do not mean that a reader can still understand the text if&nbsp; $70\%$&nbsp; of the characters have been erased.
 +
*Such a text is neither infinitely long nor has it been optimally encoded beforehand.
 +
 
  
:* die Quellensymbolfolge unendlich lang ist (<i>N</i> &#8594; &#8734;), und
 
  
:* diese bestmöglich codiert ist.
+
'''(4)'''&nbsp; Correct is <u>statement 2</u>:
 +
*Test it yourself: &nbsp; The second block of the graphic on the information page is easier to decode than the last block, because you know where there are errors.
 +
*If you want to keep trying: &nbsp; The exact same sequence of character errors was used for the bottom block&nbsp; '''(F)'''&nbsp; as for block&nbsp; '''(B)'''&nbsp; , that is, there are errors at characters&nbsp; $6$,&nbsp; $35$,&nbsp; $37$,&nbsp; and so on.
  
:Dagegen besagt Küpfmüllers Ergebnis und die in der Teilaufgabe (b) errechnete relative Redundanz von mehr als 70% nicht, dass ein Leser den Text noch verstehen kann, wenn 70% der Zeichen ausgelöscht sind. Ein solcher Text
 
  
:* ist nie unendlich lang,
+
[[File:Inf_A_1_7d_v2.png|right|frame|Original texts]]
 +
Finally, the original text is given, which is only reproduced on the information page distorted by erasures or real character errors.
  
:* noch wurde er vorher optimal codiert.
 
  
:<b>4.</b>&nbsp;&nbsp;Richtig ist <u>Aussage 2</u>. Testen Sie es selbst: Der zweite Block der Grafik auf der Angabenseite ist leichter zu entschlüsseln als der letzte Block, weil man weiß, wo es Fehler gibt. Wenn Sie es weiter versuchen wollen: Für den unteren Block wurde genau die gleiche Zeichenfehlerfolge wie für Block 2 verwendet, das heißt, Fehler gibt es bei den Zeichen 6, 35, 37, usw..
 
  
:Abschließend soll noch der Originaltext angegeben werden, der auf der Angabenseite nur durch Auslöschungen (<i>Erasures</i>) oder echte Zeichenfehler verfälscht wiedergegeben ist.
 
[[File:P_ID2318__Inf_A_1_7d.png|center|]]
 
 
{{ML-Fuß}}
 
{{ML-Fuß}}
  
  
  
[[Category:Aufgaben zu Informationstheorie und Quellencodierung|^1.3 Natürliche wertdiskrete Nachrichtenquellen^]]
+
[[Category:Information Theory: Exercises|^1.3 Natural Discrete Sources^]]

Latest revision as of 13:06, 10 August 2021

Text with erasures or errors

In the early 1950s,  Claude E. Shannon  estimated the entropy  $H$  of the English language at one bit per character.  A short time later,  Karl Küpfmüller  arrived at an entropy value of  $H =1.3\ \rm bit/character$, i.e. a somewhat larger value, in an empirical study of the German language.  Interestingly, the results of Shannon and Küpfmüller are based on two completely different methods.

The differing results cannot be explained by the small differences in the range of symbols  $M$ :

  • Shannon assumed  $26$ letters and the space  ⇒  $M = 27$.
  • Küpfmüller assumed only  $M = 26$  letters  (i.e. without the  "Blank Space").


Both made no distinction between upper and lower case.

This task is intended to show how

  • "Erasures"   ⇒   one knows the location of an error, resp.
  • "Errors"   ⇒   it is not clear to the reader what is wrong and what is right,


have an effect on the comprehensibility of a text.  Our text also contains the typical German letters "ä",  "ö",  "ü"  and  "ß"  as well as numbers and punctuation.  In addition, a distinction is made between upper and lower case.


In the figure, a text dealing with Küpfmüller's approach is divided into six blocks of length between  $N = 197$  and  $N = 319$.  Described is the check of his  first analysis  in a completely different way, which led to the result  $H =1.51\ \rm bit/character$.

  • In the upper five blocks one recognises  "Erasures"  with different erasure probabilities between  $10\%$  and  $50\%$.
  • In the last block, "character errors" with  $20$–error probability are inserted.


The influence of such character errors on the readability of a text is to be compared in subtask  (4)  with the second block (outlined in red), for which the probability of an erasure is also  $20\%$.




Hints:

Entropy estimation according to Küpfmüller,  sowie 
A further entropy estimation by Küpfmüller.
  • For the  relative redundancy  of a sequence with the decision content  $H_0$  and the entropy  $H$  applies
$$r = \frac{H_0 - H}{H_0}\hspace{0.05cm}.$$
  • To solve the subtask  (4)  you need a good knowledge of the German language. 


$\text{Here is the English translation of the six description blocks:}$

$\rm (A)$     How did Küpfmüller arrive at the result H = 1.51 bit/character? Since there were no publications for the statistics of word groups or whole sentences, he estimated the entropy of the German language as follows:

$\rm (B)$     A coherent, otherwise arbitrary German text is covered at a certain point. The text before is read and the reader is asked to try to determine the following word from the previous text and the context.

$\rm (C)$     With a very large number of such attempts, the percentage of hits gives a measure of the bonds between the words and sentences. It turns out that for one and the same type of text by the same author, a constant final value of the hit ratio is reached relatively quickly, from about 100 to 200 attempts.

$\rm (D)$     The hit ratio depends strongly on the type of text. Values between 15% and 33% are obtained for different types of text, with an average value of 22%. However, this also means that on average 22% of all words can be determined from the context.

$\rm (E)$     In other words, the number of words in a long text can be reduced by a factor of 0.78 without significantly reducing its message content. By definition, the entropy H = 0.78 times 2.0 = 1.56 bit/character.

$\rm (F)$     Küpfmüller checked his result with a comparable empirical investigation of all syllables and came up with the reduction factor 0.54 and H3= 2.8 to the entropy H = 1.51 bit/character.


Questions

1

What symbol set size  $M$  did Küpfmüller assume?

$M \ = \ $

2

What relative redundancy  $r$  results from Küpfmüller's entropy value?

$r \ = \ $

$\ \%$

3

How can the result of subtask  (2)  be interpreted?  Assume a text file with  $M = 26$  different characters.

Such a text file of sufficient length  $(N \to \infty)$  could be represented with  $1.3 \cdot N$  binary symbols.
Such a text file with  $N= 100\hspace{0.05cm}000$  characters could be represented with  $130\hspace{0.1cm}000$  binary symbols.
A reader can still understand (or at least guess) the text even if  $70\%$  of the characters are erased.

4

What makes a text more difficult to understand?

$20\%$  erasures,
$20\%$  errors.


Solution

(1)  The symbol set size in Küpfmüller's investigations was  $\underline{M = 26}$,  because, in contrast to Shannon, he did not initially take the space into account.

In the given German text of this task, the symbol set size is significantly larger,

  • since the typical German characters "ä",  "ö",  "ü"  and  "ß"  also occur,
  • there is a distinction between upper and lower case,
  • and there are also numbers and punctuation marks.


(2)  With the decision content  $H_0 = \log_2 \ (31) \approx 4.7 \ \rm bit/character$  and the entropy  $H = 1.3\ \rm bit/character $ , one obtains for the relative redundancy:

$$r = \frac{H_0 - H}{H_0}= \frac{4.7 - 1.3}{4.7}\underline {\hspace{0.1cm}\approx 72.3\,\%}\hspace{0.05cm}.$$


(3)  Only the first suggested solution is correct:

  • According to Küpfmüller, one only needs  $1.3$  binary characters per source character.
  • With a file of length  $N$ ,  $1.3 \cdot N$  binary symbols would therefore be sufficient, but only if the source symbol sequence is infinitely long  $(N \to \infty)$  and this was encoded in the best possible way.
  • In contrast, Küpfmüller's result and the relative redundancy of more than  $70\%$ calculated in subtask  (2)  do not mean that a reader can still understand the text if  $70\%$  of the characters have been erased.
  • Such a text is neither infinitely long nor has it been optimally encoded beforehand.


(4)  Correct is statement 2:

  • Test it yourself:   The second block of the graphic on the information page is easier to decode than the last block, because you know where there are errors.
  • If you want to keep trying:   The exact same sequence of character errors was used for the bottom block  (F)  as for block  (B)  , that is, there are errors at characters  $6$,  $35$,  $37$,  and so on.


Original texts

Finally, the original text is given, which is only reproduced on the information page distorted by erasures or real character errors.