Difference between revisions of "Aufgaben:Exercise 1.8: Synthetically Generated Texts"

From LNTwww
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
  
{{quiz-Header|Buchseite=Informationstheorie und Quellencodierung/Natürliche wertdiskrete Nachrichtenquellen
+
{{quiz-Header|Buchseite=Information_Theory/Natural_Discrete_Sources
 
}}
 
}}
  
[[File:Inf_A_1_8_vers2.png|right|frame|Two synthetically generated text files]]
+
[[File:EN_Inf_A_1_8.png|right|frame|Two synthetically generated text files]]
  
The former practical course attempt  [http://en.lntwww.de/downloads/Sonstiges/Texte/Wertdiskrete_Informationstheorie.pdf Value Discrete Information Theory]  by Günter Söder at the Chair of Communications Engineering at the TU Munich uses the Windows programme  [http://en.lntwww.de/downloads/Sonstiges/Programme/WDIT.zip WDIT].  The links given here lead to the PDF version of the practical course instructions or to the ZIP version of the programme.
+
With the Windows programme  "Discrete-Value Information Theory"  from the Chair of Communications Engineering at the TU Munich  
  
With this programme
+
*one can determine the frequencies of character triplets such as  "aaa",  "aab", ... ,  "xyz", ...  from a given text file  "TEMPLATE"  and save them in an auxiliary file,
 +
* then create a file  "SYNTHESIS"  whereby the new character is generated from the last two generated characters and the stored triple frequencies.
  
*from a given text file  "TEMPLATE"  one can determine the frequencies of letter triplets such as  "aaa",  "aab", ... ,  "xyz", ...  and save them in an auxiliary file,
 
* then create a file  "SYNTHESIS"  whereby the new character is generated from the last two characters and the stored triple frequencies.
 
  
 
+
Starting with the German and English Bible translations, we have thus synthesised two files, which are indicated in the graphic:
Starting with the German and English Bible translations, we have thus synthesised two files, which are indicated in the diagram:
+
* $\text{File 1}$  (red border),
* die  $\text{File 1}$  (red border),
+
* $\text{File 2}$  (green border).
* die  $\text{File 2}$  (green border)
 
  
  
 
It is not indicated which file comes from which template.  Determining this is your first task.
 
It is not indicated which file comes from which template.  Determining this is your first task.
  
The two templates are based on the natural alphabet  $(26$ letters$)$  and the space  ("LZ")   ⇒   $M = 27$.  In the German Bible, the umlauts have been replaced, for example "ä"   ⇒   "ae".
+
The two templates are based on the natural alphabet  $(26$  letters$)$  and the "Blank Space"  ("BS")   ⇒   $M = 27$.  In the German Bible, the umlauts have been replaced, for example "ä"   ⇒   "ae".
  
  
 
  $\text{File 1}$  has the following characteristics:
 
  $\text{File 1}$  has the following characteristics:
* The most frequent characters are "LZ" with  $19.8\%$, followed by "e" with  $10.2\%$  and "a" with  $8.5\%$.
+
* The most frequent characters are   "BS"   with  $19.8\%$, followed by   "e"   with  $10.2\%$  and   "a"   with  $8.5\%$.
* After "LZ" (space), "t" occurs most frequently with  $17.8\%$ .
+
* After   "BS",   "t"   occurs most frequently with  $17.8\%$.
* Before a space, "d" is most likely.
+
* Before   "BS",  "d"  is most likely.
* The entropy approximations in each case with the unit "bit/character" were determined as follows:
+
* The entropy approximations in each case with the unit  "bit/character"  were determined as follows:
 
:$$H_0 = 4.76\hspace{0.05cm},\hspace{0.2cm}
 
:$$H_0 = 4.76\hspace{0.05cm},\hspace{0.2cm}
 
H_1 = 4.00\hspace{0.05cm},\hspace{0.2cm}  
 
H_1 = 4.00\hspace{0.05cm},\hspace{0.2cm}  
Line 34: Line 32:
 
H_4 = 2.81\hspace{0.05cm}.  $$
 
H_4 = 2.81\hspace{0.05cm}.  $$
  
In contrast, the analysis of  $\text{file 2}$:
+
In contrast, the analysis of  $\text{File 2}$:
* The most frequent characters are "LZ" with  $17.6\%$  followed by "e" with  $14.4\%$  and "n" with  $8.9\%$.
+
* The most frequent characters are   "BS"  with  $17.6\%$  followed by  "e"  with  $14.4\%$  and  "n"  with  $8.9\%$.
* After "LZ", "d" is the most likely  $(15.1\%)$  followed by "s" with  $10.8\%$.
+
* After   "BS",  "d"  is most likely  $(15.1\%)$  followed by  "s"  with  $10.8\%$.
* After "LZ" and "d", the vowels "e"  $(48.3\%)$,  "i" $(23\%)$  and "a"  $(20.2\%)$  are dominant.
+
* After   "BS"  and  "d",  the vowels  "e"  $(48.3\%)$,  "i"  $(23\%)$  and  "a"  $(20.2\%)$  are dominant.
* The entropy approximations differ only slightly from those of  $\text{file 1}$.
+
* The entropy approximations differ only slightly from those of  $\text{File 1}$.
 
* For larger  $k$–values, these are slightly larger, for example  $H_3 = 3.17$  instead of  $H_3 = 3.11$.
 
* For larger  $k$–values, these are slightly larger, for example  $H_3 = 3.17$  instead of  $H_3 = 3.11$.
  
Line 47: Line 45:
  
 
''Hints:''  
 
''Hints:''  
*The task belongs to the chapter  [[Information_Theory/Natürliche_wertdiskrete_Nachrichtenquellen|Natural discrete value message sources]].
+
*The exercise belongs to the chapter  [[Information_Theory/Natural_Discrete_Sources|Natural Discrete Sources]].
 
+
*Reference is also made to the page  [[Information_Theory/Natural_Discrete_Sources#Synthetically_generated_texts|Synthetically Generated Texts]].
*Reference is made in particular to the page  [[Information_Theory/Natürliche_wertdiskrete_Nachrichtenquellen#Synthetisch_erzeugte_Texte|Synthetically generated texts]].
 
  
  
Line 58: Line 55:
 
{Which templates were used for the text synthesis shown here??
 
{Which templates were used for the text synthesis shown here??
 
|type="()"}
 
|type="()"}
+ Die  $\text{File 1}$  (red) is based on an English template.
+
+ $\text{File 1}$  (red) is based on an English template.
- Die  $\text{File 1}$  (red) is based on a German template.
+
- $\text{File 1}$  (red) is based on a German template.
  
{Compare the mean word lengths of  $\text{File 1}$  and  $\text{File 2}$ .
+
{Compare the mean word lengths of  $\text{File 1}$  and  $\text{File 2}$.
 
|type="[]"}
 
|type="[]"}
- The words of the "English" file are longer on average.
+
- The words of the  "English"  file are longer on average.
+ The words of the "German" file are longer on average.
+
+ The words of the  "German"  file are longer on average.
  
  
Line 75: Line 72:
  
  
{Which statements are true for the "English" text?
+
{Which statements are true for the  "English"  text?
 
|type="[]"}
 
|type="[]"}
 
+ Most words begin with  "t".
 
+ Most words begin with  "t".
Line 81: Line 78:
  
  
{Which statements could be true for German texts?
+
{Which statements could be true forthe  "German"  texts?
 
|type="[]"}
 
|type="[]"}
+ After  "de"  ist  "r"  is most likely.
+
+ After  "de",  "r"  is most likely.
+ After  "da"  ist  "s"  is most likely.
+
+ After  "da",  "s"  is most likely.
+ After  "di"  ist  "e"  is most likely.
+
+ After  "di",  "e"  is most likely.
  
  
Line 94: Line 91:
 
{{ML-Kopf}}
 
{{ML-Kopf}}
 
'''(1)'''&nbsp; The correct solution is <u>suggestion 1</u>.  
 
'''(1)'''&nbsp; The correct solution is <u>suggestion 1</u>.  
*In&nbsp; $\text{file 1}$&nbsp; you can recognise many English words, in&nbsp; $\text{file 2}$&nbsp; many German words.
+
*In&nbsp; $\text{File 1}$&nbsp; you can recognise many English words, in&nbsp; $\text{File 2}$&nbsp; many German words.
 
*Neither text makes sense.
 
*Neither text makes sense.
  
  
 
'''(2)'''&nbsp; Correct is <u>suggestions 2</u>. The estimations of Shannon and Küpfmüller confirm our result:
 
'''(2)'''&nbsp; Correct is <u>suggestions 2</u>. The estimations of Shannon and Küpfmüller confirm our result:
*The probability of a blank character in&nbsp; $\text{file 1}$&nbsp; (English)&nbsp; $19.8\%$.&nbsp;  
+
*The probability of a blank space&nbsp; "BS"&nbsp; in&nbsp; $\text{File 1}$&nbsp; (English)&nbsp; $19.8\%$.&nbsp;  
*So on average every&nbsp; $1/0.198 = 5.05$&ndash;th character is a space. &nbsp;  
+
*So on average every&nbsp; $1/0.198 = 5.05$&ndash;th character is&nbsp; "BS". &nbsp;  
 
*The average word length is therefore
 
*The average word length is therefore
 
:$$L_{\rm M} = \frac{1}{0.198}-1 \approx 4.05\,{\rm characters}\hspace{0.05cm}.$$
 
:$$L_{\rm M} = \frac{1}{0.198}-1 \approx 4.05\,{\rm characters}\hspace{0.05cm}.$$
*Correspondingly, for&nbsp; $\text{file 2}$&nbsp; (German):
+
*Correspondingly, for&nbsp; $\text{File 2}$&nbsp; (German):
 
:$$L_{\rm M} = \frac{1}{0.176}-1 \approx 4.68\,{\rm characters}\hspace{0.05cm}.$$
 
:$$L_{\rm M} = \frac{1}{0.176}-1 \approx 4.68\,{\rm characters}\hspace{0.05cm}.$$
  
Line 110: Line 107:
 
'''(3)'''&nbsp; The <u>first three statements</u> are correct, but not statement&nbsp;  '''(4)''':
 
'''(3)'''&nbsp; The <u>first three statements</u> are correct, but not statement&nbsp;  '''(4)''':
 
*To determine the entropy approximation&nbsp; $H_k$&nbsp; ,&nbsp; $k$&ndash;tuples must be evaluated, for example, for&nbsp; $k = 3$&nbsp;  the triples &nbsp; "aaa",&nbsp;  "aab", &nbsp; ....  
 
*To determine the entropy approximation&nbsp; $H_k$&nbsp; ,&nbsp; $k$&ndash;tuples must be evaluated, for example, for&nbsp; $k = 3$&nbsp;  the triples &nbsp; "aaa",&nbsp;  "aab", &nbsp; ....  
*According to the generation rule "New character depends on the two predecessors",&nbsp; $H_1$,&nbsp; $H_2$&nbsp; and&nbsp; $H_3$&nbsp; of&nbsp; "TEMPLATE"&nbsp; and&nbsp; "SYNTHESIS"&nbsp; will match, but only approximately due to the finite file length.
+
*According to the generation rule "New character depends on the two predecessors",&nbsp; $H_1$,&nbsp; $H_2$&nbsp; and&nbsp; $H_3$&nbsp; of&nbsp; "TEMPLATE"&nbsp; and&nbsp; "SYNTHESIS"&nbsp; will match, <br>but only approximately due to the finite file length.
*In contrast, the&nbsp; $H_4$& approximations differ more strongly because the third predecessor is not taken into account during generation.
+
*In contrast, the&nbsp; $H_4$&nbsp; approximations differ more strongly because the third predecessor is not taken into account during generation.
*It is only known that&nbsp; "SYNTHESIS"&nbsp; $H_4 < H_3$&nbsp; must also apply with regard to  
+
*It is only known that&nbsp; $H_4 < H_3$&nbsp; must also apply with regard to "SYNTHESIS".
 
 
 
 
  
'''(4)'''&nbsp; Only <u>statement 1</u> is correct here:
 
*Nach einem Leerzeichen (Wortanfang) folgt "t" mit&nbsp; $17.8\%$, während am Wortende (vor einem Leerzeichen) "t" nur mit der Häufigkeit&nbsp; $<8.3\%$&nbsp; auftritt.
 
  
*Insgesamt beträgt die Auftrittswahrscheinlichkeit von "t" über alle Positionen im Wort gemittelt&nbsp; $7.4\%$.
 
*Als dritter Buchstaben nach Leerzeichen und "t" folgt "h" mit fast&nbsp; $82\%$&nbsp; und nach "th" ist "e" mit&nbsp; $62%$&nbsp; am wahrscheinlichsten.&nbsp;
 
*Das lässt daraus schließen, dass "the" in einem englischen Text überdurchschnittlich oft vorkommt und damit auch in der synthetischen&nbsp; $\text{Datei 1}$, wie folgende Grafik zeigt. Aber nicht bei allen Markierungen tritt "the" isoliert auf &nbsp; &#8658; &nbsp; direkt vorher und nachher ein Leerzeichen.
 
  
[[File:Inf_A_1_8d_vers2.png|right|frame|Auftreten von "...the..." im englischen Text]]
+
'''(4)'''&nbsp; Only <u>statement 1</u> is correct here:
<br clear=all>
+
[[File:Inf_A_1_8d_vers2.png|right|frame|Occurrence of "...the..." in the English text]]
'''(5)'''&nbsp; <u>Alle Aussagen</u> treffen zu:
+
*After a&nbsp; "BS"&nbsp; (beginning of a word),&nbsp; "t" follows with&nbsp; $17.8\%$, while at the end of a word&nbsp; (before a space),&nbsp; "t"&nbsp; occurs only with the frequency&nbsp; $8.3\%$.
*Nach "de" ist tatsächlich "r" am wahrscheinlichsten&nbsp; $(32.8\%)$,&nbsp; gefolgt von "n" $(28.5\%)$,&nbsp; "s"&nbsp; $(9.3\%)$&nbsp; und "m"&nbsp; $(9.7\%)$.
+
*Overall, the probability of&nbsp; "t"&nbsp; averaged over all positions in the word is&nbsp; $7.4\%$.
*Dafür verantwortlich könnten&nbsp; "der",&nbsp; "den",&nbsp; "des"&nbsp; und&nbsp; "dem"&nbsp; sein.
+
*The third letter after&nbsp; "BS"&nbsp; and&nbsp;  "t"&nbsp; is&nbsp; "h"&nbsp; with almost&nbsp; $82\%$&nbsp; and after&nbsp; "th",&nbsp; "e"&nbsp; is most likeky with&nbsp; $62\%$.
[[File:Inf_A_1_8e_vers2.png|right|frame|Auftreten von "der", "die" und "das" im deutschen Text]]
+
*This suggests that&nbsp; "the"&nbsp; occurs more often than average in an English text and thus also in the synthetic&nbsp; $\text{File 1}$, as the following graph shows.
* Nach "da" folgt "s" mit größter Wahrscheinlichkeit: &nbsp; $48.2\%$.
+
*But&nbsp; "the"&nbsp; does not occur in isolation in all marks &nbsp; &#8658; &nbsp; immediately preceded and followed by a space.
* Nach "di" folgt "e" mit größter Wahrscheinlichkeit: &nbsp; $78.7\%$.
+
[[File:Inf_A_1_8e_vers2.png|right|frame|Occurrence of&nbsp; "der",&nbsp; "die",&nbsp; "das"&nbsp; in the German text]]
  
  
Die Grafik zeigt die&nbsp; $\text{Datei 2}$&nbsp; mit allen "der", "die" und "das".
+
'''(5)'''&nbsp; <u>All statements</u> are true:
 +
*After&nbsp; "de",&nbsp; "r"&nbsp; is indeed most likely&nbsp; $(32.8\%)$,&nbsp; followed by&nbsp; "n"&nbsp; $(28.5\%)$,&nbsp; "s"&nbsp; $(9.3\%)$&nbsp; and&nbsp; "m"&nbsp; $(9.7\%)$.
 +
*This could be responsible for&nbsp; "der",&nbsp; "den",&nbsp; "des"&nbsp; und&nbsp; "dem".
 +
* "da"&nbsp; is most likely followed by&nbsp; "s"&nbsp; $(48.2\%)$.
 +
* After&nbsp; "di"&nbsp; follows&nbsp; "e"&nbsp; with the highest probability &nbsp; $(78.7\%)$.
  
  
 +
The graph shows&nbsp; $\text{File 2}$&nbsp; with all occurrences of&nbsp;  "der",&nbsp; "die",&nbsp; "das".
 +
 
 
{{ML-Fuß}}
 
{{ML-Fuß}}
  

Latest revision as of 13:07, 10 August 2021

Two synthetically generated text files

With the Windows programme  "Discrete-Value Information Theory"  from the Chair of Communications Engineering at the TU Munich

  • one can determine the frequencies of character triplets such as  "aaa",  "aab", ... ,  "xyz", ...  from a given text file  "TEMPLATE"  and save them in an auxiliary file,
  • then create a file  "SYNTHESIS"  whereby the new character is generated from the last two generated characters and the stored triple frequencies.


Starting with the German and English Bible translations, we have thus synthesised two files, which are indicated in the graphic:

  • $\text{File 1}$  (red border),
  • $\text{File 2}$  (green border).


It is not indicated which file comes from which template.  Determining this is your first task.

The two templates are based on the natural alphabet  $(26$  letters$)$  and the "Blank Space"  ("BS")   ⇒   $M = 27$.  In the German Bible, the umlauts have been replaced, for example "ä"   ⇒   "ae".


  $\text{File 1}$  has the following characteristics:

  • The most frequent characters are  "BS"  with  $19.8\%$, followed by  "e"  with  $10.2\%$  and  "a"  with  $8.5\%$.
  • After  "BS",  "t"  occurs most frequently with  $17.8\%$.
  • Before  "BS",  "d"  is most likely.
  • The entropy approximations in each case with the unit  "bit/character"  were determined as follows:
$$H_0 = 4.76\hspace{0.05cm},\hspace{0.2cm} H_1 = 4.00\hspace{0.05cm},\hspace{0.2cm} H_2 = 3.54\hspace{0.05cm},\hspace{0.2cm} H_3 = 3.11\hspace{0.05cm},\hspace{0.2cm} H_4 = 2.81\hspace{0.05cm}. $$

In contrast, the analysis of  $\text{File 2}$:

  • The most frequent characters are  "BS"  with  $17.6\%$  followed by  "e"  with  $14.4\%$  and  "n"  with  $8.9\%$.
  • After  "BS",  "d"  is most likely  $(15.1\%)$  followed by  "s"  with  $10.8\%$.
  • After  "BS"  and  "d",  the vowels  "e"  $(48.3\%)$,  "i"  $(23\%)$  and  "a"  $(20.2\%)$  are dominant.
  • The entropy approximations differ only slightly from those of  $\text{File 1}$.
  • For larger  $k$–values, these are slightly larger, for example  $H_3 = 3.17$  instead of  $H_3 = 3.11$.




Hints:


Questions

1

Which templates were used for the text synthesis shown here??

$\text{File 1}$  (red) is based on an English template.
$\text{File 1}$  (red) is based on a German template.

2

Compare the mean word lengths of  $\text{File 1}$  and  $\text{File 2}$.

The words of the  "English"  file are longer on average.
The words of the  "German"  file are longer on average.

3

Which statements apply to the entropy approximations?

"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  $H_1$.
"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  $H_2$.
"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  $H_3$.
"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  $H_4$.

4

Which statements are true for the  "English"  text?

Most words begin with  "t".
Most words end with  "t".

5

Which statements could be true forthe  "German"  texts?

After  "de",  "r"  is most likely.
After  "da",  "s"  is most likely.
After  "di",  "e"  is most likely.


Solution

(1)  The correct solution is suggestion 1.

  • In  $\text{File 1}$  you can recognise many English words, in  $\text{File 2}$  many German words.
  • Neither text makes sense.


(2)  Correct is suggestions 2. The estimations of Shannon and Küpfmüller confirm our result:

  • The probability of a blank space  "BS"  in  $\text{File 1}$  (English)  $19.8\%$. 
  • So on average every  $1/0.198 = 5.05$–th character is  "BS".  
  • The average word length is therefore
$$L_{\rm M} = \frac{1}{0.198}-1 \approx 4.05\,{\rm characters}\hspace{0.05cm}.$$
  • Correspondingly, for  $\text{File 2}$  (German):
$$L_{\rm M} = \frac{1}{0.176}-1 \approx 4.68\,{\rm characters}\hspace{0.05cm}.$$


(3)  The first three statements are correct, but not statement  (4):

  • To determine the entropy approximation  $H_k$  ,  $k$–tuples must be evaluated, for example, for  $k = 3$  the triples   "aaa",  "aab",   ....
  • According to the generation rule "New character depends on the two predecessors",  $H_1$,  $H_2$  and  $H_3$  of  "TEMPLATE"  and  "SYNTHESIS"  will match,
    but only approximately due to the finite file length.
  • In contrast, the  $H_4$  approximations differ more strongly because the third predecessor is not taken into account during generation.
  • It is only known that  $H_4 < H_3$  must also apply with regard to "SYNTHESIS".


(4)  Only statement 1 is correct here:

Occurrence of "...the..." in the English text
  • After a  "BS"  (beginning of a word),  "t" follows with  $17.8\%$, while at the end of a word  (before a space),  "t"  occurs only with the frequency  $8.3\%$.
  • Overall, the probability of  "t"  averaged over all positions in the word is  $7.4\%$.
  • The third letter after  "BS"  and  "t"  is  "h"  with almost  $82\%$  and after  "th",  "e"  is most likeky with  $62\%$.
  • This suggests that  "the"  occurs more often than average in an English text and thus also in the synthetic  $\text{File 1}$, as the following graph shows.
  • But  "the"  does not occur in isolation in all marks   ⇒   immediately preceded and followed by a space.
Occurrence of  "der",  "die",  "das"  in the German text


(5)  All statements are true:

  • After  "de",  "r"  is indeed most likely  $(32.8\%)$,  followed by  "n"  $(28.5\%)$,  "s"  $(9.3\%)$  and  "m"  $(9.7\%)$.
  • This could be responsible for  "der",  "den",  "des"  und  "dem".
  • "da"  is most likely followed by  "s"  $(48.2\%)$.
  • After  "di"  follows  "e"  with the highest probability   $(78.7\%)$.


The graph shows  $\text{File 2}$  with all occurrences of  "der",  "die",  "das".