Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Exercise 1.8: Synthetically Generated Texts

From LNTwww
Revision as of 14:09, 23 June 2021 by Guenter (talk | contribs)

Two synthetically generated text files

With the Windows programme  "Discrete-Value Information Theory"  from the Chair of Communications Engineering at the TU Munich

  • one can determine the frequencies of character triplets such as  "aaa",  "aab", ... ,  "xyz", ...  from a given text file  "TEMPLATE"  and save them in an auxiliary file,
  • then create a file  "SYNTHESIS"  whereby the new character is generated from the last two generated characters and the stored triple frequencies.


Starting with the German and English Bible translations, we have thus synthesised two files, which are indicated in the graphic:

  • File 1  (red border),
  • File 2  (green border).


It is not indicated which file comes from which template.  Determining this is your first task.

The two templates are based on the natural alphabet  (26  letters)  and the "Blank Space"  ("BS")   ⇒   M=27.  In the German Bible, the umlauts have been replaced, for example "ä"   ⇒   "ae".


  File 1  has the following characteristics:

  • The most frequent characters are  "BS"  with  19.8%, followed by  "e"  with  10.2%  and  "a"  with  8.5%.
  • After  "BS",  "t"  occurs most frequently with  17.8%.
  • Before  "BS",  "d"  is most likely.
  • The entropy approximations in each case with the unit  "bit/character"  were determined as follows:
H0=4.76,H1=4.00,H2=3.54,H3=3.11,H4=2.81.

In contrast, the analysis of  File 2:

  • The most frequent characters are  "BS"  with  17.6%  followed by  "e"  with  14.4%  and  "n"  with  8.9%.
  • After  "BS",  "d"  is most likely  (15.1%)  followed by  "s"  with  10.8%.
  • After  "BS"  and  "d",  the vowels  "e"  (48.3%),  "i"  (23%)  and  "a"  (20.2%)  are dominant.
  • The entropy approximations differ only slightly from those of  File 1.
  • For larger  k–values, these are slightly larger, for example  H3=3.17  instead of  H3=3.11.




Hints:


Questions

1

Which templates were used for the text synthesis shown here??

File 1  (red) is based on an English template.
File 1  (red) is based on a German template.

2

Compare the mean word lengths of  File 1  and  File 2.

The words of the  "English"  file are longer on average.
The words of the  "German"  file are longer on average.

3

Which statements apply to the entropy approximations?

"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  H1.
"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  H2.
"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  H3.
"TEMPLATE"  and  "SYNTHESIS"  provide a nearly equal  H4.

4

Which statements are true for the  "English"  text?

Most words begin with  "t".
Most words end with  "t".

5

Which statements could be true forthe  "German"  texts?

After  "de",  "r"  is most likely.
After  "da",  "s"  is most likely.
After  "di",  "e"  is most likely.


Solution

(1)  The correct solution is suggestion 1.

  • In  File 1  you can recognise many English words, in  File 2  many German words.
  • Neither text makes sense.


(2)  Correct is suggestions 2. The estimations of Shannon and Küpfmüller confirm our result:

  • The probability of a blank space  "BS"  in  File 1  (English)  19.8%
  • So on average every  1/0.198=5.05–th character is  "BS".  
  • The average word length is therefore
LM=10.19814.05characters.
  • Correspondingly, for  File 2  (German):
LM=10.17614.68characters.


(3)  The first three statements are correct, but not statement  (4):

  • To determine the entropy approximation  Hk  ,  k–tuples must be evaluated, for example, for  k=3  the triples   "aaa",  "aab",   ....
  • According to the generation rule "New character depends on the two predecessors",  H1H2  and  H3  of  "TEMPLATE"  and  "SYNTHESIS"  will match,
    but only approximately due to the finite file length.
  • In contrast, the  H4  approximations differ more strongly because the third predecessor is not taken into account during generation.
  • It is only known that  H4<H3  must also apply with regard to "SYNTHESIS".


(4)  Only statement 1 is correct here:

Occurrence of "...the..." in the English text
  • After a  "BS"  (beginning of a word),  "t" follows with  17.8%, while at the end of a word  (before a space),  "t"  occurs only with the frequency  8.3%.
  • Overall, the probability of  "t"  averaged over all positions in the word is  7.4%.
  • The third letter after  "BS"  and  "t"  is  "h"  with almost  82%  and after  "th",  "e"  is most likeky with  62%.
  • This suggests that  "the"  occurs more often than average in an English text and thus also in the synthetic  File 1, as the following graph shows.
  • But  "the"  does not occur in isolation in all marks   ⇒   immediately preceded and followed by a space.
Occurrence of  "der",  "die"  and  "das"  in the German text


(5)  All statements are true:

  • After  "de",  "r"  is indeed most likely  (32.8%),  followed by  "n"  (28.5%),  "s"  (9.3%)  and  "m"  (9.7%).
  • This could be responsible for  "der",  "den",  "des"  und  "dem".
  • "da"  is most likely followed by  "s"  (48.2%).
  • After  "di"  follows  "e"  with the highest probability   (78.7%).


The graph shows  File 2  with all occurrences of  "der",  "die",  "das".