Difference between revisions of "Information Theory/Natural Discrete Sources"

From LNTwww
 
(17 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
   
 
   
 
{{Header
 
{{Header
|Untermenü=Entropie wertdiskreter Nachrichtenquellen
+
|Untermenü=Entropy of Discrete Sources
|Vorherige Seite=Nachrichtenquellen mit Gedächtnis
+
|Vorherige Seite=Discrete Sources with Memory
|Nächste Seite=Allgemeine Beschreibung
+
|Nächste Seite=General_Description
 
}}
 
}}
  
 
==Difficulties with the determination of entropy ==
 
==Difficulties with the determination of entropy ==
 
<br>
 
<br>
Up to now, we have been dealing exclusively with artificially generated symbol sequences.&nbsp; Now we consider written texts.&nbsp; Such a text can be seen as a natural discrete-value message source, which of course can also be analyzed information-theoretically by determining its entropy.
+
Up to now, we have been dealing exclusively with artificially generated symbol sequences.&nbsp; Now we consider written texts.&nbsp; Such a text can be seen as a natural discrete message source, which of course can also be analyzed information-theoretically by determining its entropy.
  
 
Even today (2011), natural texts are still often represented with the 8 bit character set according to ANSI ("American National Standard Institute"), although there are several "more modern" encodings;  
 
Even today (2011), natural texts are still often represented with the 8 bit character set according to ANSI ("American National Standard Institute"), although there are several "more modern" encodings;  
Line 14: Line 14:
 
The&nbsp; $M = 2^8 = 256$&nbsp; ANSI characters are used as follows:
 
The&nbsp; $M = 2^8 = 256$&nbsp; ANSI characters are used as follows:
 
* '''No.&nbsp; 0 &nbsp; to &nbsp; 31''': &nbsp; control commands that cannot be printed or displayed,
 
* '''No.&nbsp; 0 &nbsp; to &nbsp; 31''': &nbsp; control commands that cannot be printed or displayed,
 +
 
* '''No.&nbsp; 32 &nbsp; to &nbsp;127''': &nbsp; identical to the characters of the 7 bit ASCII code,
 
* '''No.&nbsp; 32 &nbsp; to &nbsp;127''': &nbsp; identical to the characters of the 7 bit ASCII code,
 +
 
* '''No.&nbsp; 128 &nbsp; to 159''': &nbsp; additional control characters or alphanumeric characters for Windows,
 
* '''No.&nbsp; 128 &nbsp; to 159''': &nbsp; additional control characters or alphanumeric characters for Windows,
 +
 
* '''No.&nbsp; 160 &nbsp; to &nbsp; 255''': &nbsp; identical to the Unicode charts.
 
* '''No.&nbsp; 160 &nbsp; to &nbsp; 255''': &nbsp; identical to the Unicode charts.
  
  
Theoretically, one could also define the entropy here as the border crossing point of the entropy approximation&nbsp; $H_k$&nbsp; for&nbsp; $k \to \infty$,&nbsp; according to the procedure from the&nbsp; [[Information_Theory/Sources_with_Memory#Generalization to k -tuple and boundary crossing|last chapter]].&nbsp; In practice, however, insurmountable numerical limitations can be found here as well:
+
Theoretically, one could also define the entropy here as the border crossing point of the entropy approximation&nbsp; $H_k$&nbsp; for&nbsp; $k \to \infty$,&nbsp; according to the procedure from the&nbsp; [[Information_Theory/Discrete_Sources_with_Memory#Generalization_to_.7F.27.22.60UNIQ-MathJax111-QINU.60.22.27.7F-tuple_and_boundary_crossing|"last chapter"]].&nbsp; In practice, however, insurmountable numerical limitations can be found here as well:
  
 
*Already for the entropy approximation&nbsp; $H_2$&nbsp; there are&nbsp; $M^2 = 256^2 = 65\hspace{0.1cm}536$&nbsp; possible two-tuples.&nbsp; Thus, the calculation requires the same amount of memory (in bytes). &nbsp; If you assume that you need for a sufficiently safe statistic&nbsp; $100$&nbsp; equivalents per tuple on average,&nbsp; the length of the source symbol sequence should already be&nbsp; $N > 6.5 · 10^6$.
 
*Already for the entropy approximation&nbsp; $H_2$&nbsp; there are&nbsp; $M^2 = 256^2 = 65\hspace{0.1cm}536$&nbsp; possible two-tuples.&nbsp; Thus, the calculation requires the same amount of memory (in bytes). &nbsp; If you assume that you need for a sufficiently safe statistic&nbsp; $100$&nbsp; equivalents per tuple on average,&nbsp; the length of the source symbol sequence should already be&nbsp; $N > 6.5 · 10^6$.
 +
 
*The number of possible three-tuples is&nbsp; $M^3 > 16 · 10^7$&nbsp; and thus the required source symbol length is already&nbsp; $N > 1.6 · 10^9$.&nbsp; This corresponds to a book with about&nbsp; $500\hspace{0.1cm}000$&nbsp; pages to&nbsp; $42$&nbsp; lines per page and&nbsp; $80$&nbsp; characters per line.
 
*The number of possible three-tuples is&nbsp; $M^3 > 16 · 10^7$&nbsp; and thus the required source symbol length is already&nbsp; $N > 1.6 · 10^9$.&nbsp; This corresponds to a book with about&nbsp; $500\hspace{0.1cm}000$&nbsp; pages to&nbsp; $42$&nbsp; lines per page and&nbsp; $80$&nbsp; characters per line.
 +
 
*For a natural text the statistical ties extend much further than two or three characters.&nbsp; Küpfmüller gives a value of&nbsp; $100$&nbsp; for the German language.&nbsp; To determine the 100th entropy approximation you need&nbsp; $2^{800}$ ≈ $10^{240}$&nbsp; frequencies and for the safe statistics&nbsp; $100$&nbsp; times more characters.
 
*For a natural text the statistical ties extend much further than two or three characters.&nbsp; Küpfmüller gives a value of&nbsp; $100$&nbsp; for the German language.&nbsp; To determine the 100th entropy approximation you need&nbsp; $2^{800}$ ≈ $10^{240}$&nbsp; frequencies and for the safe statistics&nbsp; $100$&nbsp; times more characters.
  
  
A justified question is therefore: &nbsp; How did&nbsp; [https://en.wikipedia.org/wiki/Karl_K%C3%BCpfm%C3%BCller Karl Küpfmüller]&nbsp; determined the entropy of the German language in 1954?&nbsp; How did&nbsp; [https://en.wikipedia.org/wiki/Claude_Shannon Claude Elwood Shannon]&nbsp; do the same for the English language, even before Küpfmüller?&nbsp; One thing is revealed beforehand: &nbsp; Not with the approach described above.  
+
A justified question is therefore: &nbsp; How did&nbsp; [https://en.wikipedia.org/wiki/Karl_K%C3%BCpfm%C3%BCller $\text{Karl Küpfmüller}$]&nbsp; determine the entropy of the German language in 1954?&nbsp; How did&nbsp; [https://en.wikipedia.org/wiki/Claude_Shannon $\text{Claude Elwood Shannon}$]&nbsp; do the same for the English language, even before Küpfmüller?&nbsp; One thing is revealed beforehand: &nbsp; Not with the approach described above.  
  
  
 
==Entropy estimation according to Küpfmüller ==
 
==Entropy estimation according to Küpfmüller ==
 
<br>
 
<br>
Karl Küpfmüller has investigated the entropy of German texts in his published assessment &nbsp; [Küpf54]<ref name ='Küpf54'>Küpfmüller, K.:&nbsp; Die Entropie der deutschen Sprache.&nbsp; Fernmeldetechnische Zeitung 7, 1954, S. 265-272.</ref>&nbsp; the following assumptions are made:
+
Karl Küpfmüller has investigated the entropy of German texts in his published assessment &nbsp; [Küpf54]<ref name ='Küpf54'>Küpfmüller, K.:&nbsp; Die Entropie der deutschen Sprache.&nbsp; Fernmeldetechnische Zeitung 7, 1954, S. 265-272.</ref>,&nbsp; the following assumptions are made:
 
*an alphabet with&nbsp; $26$&nbsp; letters&nbsp; (no umlauts and punctuation marks),
 
*an alphabet with&nbsp; $26$&nbsp; letters&nbsp; (no umlauts and punctuation marks),
 +
 
*not taking into account the space character,
 
*not taking into account the space character,
 +
 
*no distinction between upper and lower case.
 
*no distinction between upper and lower case.
  
Line 40: Line 47:
 
:$$H_0 = \log_2 (26) ≈ 4.7\ \rm bit/letter.$$  
 
:$$H_0 = \log_2 (26) ≈ 4.7\ \rm bit/letter.$$  
  
Küpfmueller's estimation is based on the following considerations:
+
Küpfmüller's estimation is based on the following considerations:
  
'''(1)'''&nbsp; The&nbsp; '''first entropy approximation'''&nbsp; results from the letter frequencies in German texts.&nbsp; According to a study of 1939, "e" is with a frequency of &nbsp; $16. 7\%$&nbsp; the most frequent, the rarest is "x" with&nbsp; $0.02\%$.&nbsp; Averaged over all letters we obtain&nbsp;  
+
'''(1)'''&nbsp; The&nbsp; &raquo;'''first entropy approximation'''&laquo;&nbsp; results from the letter frequencies in German texts.&nbsp; According to a study of 1939, "e" is with a frequency of &nbsp; $16. 7\%$&nbsp; the most frequent, the rarest is "x" with&nbsp; $0.02\%$.&nbsp; Averaged over all letters we obtain&nbsp;  
 
:$$H_1 \approx 4.1\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
 
:$$H_1 \approx 4.1\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
  
'''(2)'''&nbsp; Regarding the&nbsp; '''syllable frequency'''&nbsp; Küpfmüller evaluates the&nbsp; "Häufigkeitswörterbuch der deutschen Sprache"&nbsp; (Frequency Dictionary of the German Language), published by&nbsp; [https://en.wikipedia.org/wiki/Friedrich_Wilhelm_Kaeding Friedrich Wilhelm Kaeding]&nbsp; in 1898.&nbsp; He distinguishes between root syllables, prefixes, and final syllables and thus arrives at the average information content of all syllables:
+
'''(2)'''&nbsp; Regarding the&nbsp; &raquo;'''syllable frequency'''&laquo;&nbsp; Küpfmüller evaluates the&nbsp; "Häufigkeitswörterbuch der deutschen Sprache"&nbsp; (Frequency Dictionary of the German Language), published by&nbsp; [https://de.wikipedia.org/wiki/Friedrich_Wilhelm_Kaeding $\text{Friedrich Wilhelm Kaeding}$]&nbsp; in 1898.&nbsp; He distinguishes between root syllables, prefixes, and final syllables and thus arrives at the average information content of all syllables:
 
   
 
   
 
:$$H_{\rm syllable} = \hspace{-0.1cm} H_{\rm root} + H_{\rm prefix} + H_{\rm final} + H_{\rm rest} \approx  
 
:$$H_{\rm syllable} = \hspace{-0.1cm} H_{\rm root} + H_{\rm prefix} + H_{\rm final} + H_{\rm rest} \approx  
Line 53: Line 60:
 
:The following proportions were taken into account:
 
:The following proportions were taken into account:
 
:*According to the Kaeding study of 1898, the&nbsp; $400$&nbsp; most common root syllables&nbsp; (beginning with "de")&nbsp; represent $47\%$&nbsp; of a German text and contribute to the entropy with&nbsp; $H_{\text{root}} ≈ 4.15 \ \rm bit/syllable$.
 
:*According to the Kaeding study of 1898, the&nbsp; $400$&nbsp; most common root syllables&nbsp; (beginning with "de")&nbsp; represent $47\%$&nbsp; of a German text and contribute to the entropy with&nbsp; $H_{\text{root}} ≈ 4.15 \ \rm bit/syllable$.
 +
 
:*The contribution of&nbsp; $242$&nbsp; most common prefixes - in the first place "ge" with&nbsp; $9\%$ - is numbered by Küpfmüller with&nbsp; $H_{\text{prefix}} ≈ 0.82 \ \rm bit/syllable$.
 
:*The contribution of&nbsp; $242$&nbsp; most common prefixes - in the first place "ge" with&nbsp; $9\%$ - is numbered by Küpfmüller with&nbsp; $H_{\text{prefix}} ≈ 0.82 \ \rm bit/syllable$.
 +
 
:*The contribution of the&nbsp; $118$&nbsp; most used final syllables is&nbsp; $H_{\text{final}} ≈ 1.62 \ \rm bit/syllable$.&nbsp; Most often, "en" appears at the end of words with&nbsp; $30\%$&nbsp;.
 
:*The contribution of the&nbsp; $118$&nbsp; most used final syllables is&nbsp; $H_{\text{final}} ≈ 1.62 \ \rm bit/syllable$.&nbsp; Most often, "en" appears at the end of words with&nbsp; $30\%$&nbsp;.
 +
 
:*The remaining&nbsp; $14\%$&nbsp; is distributed over syllables not yet measured.&nbsp; Küpfmüller assumes that there are&nbsp; $4000$&nbsp; and that they are equally distributed.&nbsp; He assumes&nbsp; $H_{\text{rest}} ≈ 2 \ \rm bit/syllable$&nbsp; for this.
 
:*The remaining&nbsp; $14\%$&nbsp; is distributed over syllables not yet measured.&nbsp; Küpfmüller assumes that there are&nbsp; $4000$&nbsp; and that they are equally distributed.&nbsp; He assumes&nbsp; $H_{\text{rest}} ≈ 2 \ \rm bit/syllable$&nbsp; for this.
  
  
'''(3)'''&nbsp; As average number of letters per syllable Küpfmüller determined the value&nbsp; $3.03$.&nbsp; From this he deduced the&nbsp; '''third entropy approximation''''&nbsp; regarding the letters:  
+
'''(3)'''&nbsp; As average number of letters per syllable Küpfmüller determined the value&nbsp; $3.03$.&nbsp; From this he deduced the&nbsp; &raquo;'''third entropy approximation'''&laquo;&nbsp; regarding the letters:  
 
:$$H_3 \approx {8.6}/{3.03}\approx 2.8\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
 
:$$H_3 \approx {8.6}/{3.03}\approx 2.8\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
  
'''(4)'''&nbsp; Küpfmueller's estimation of the entropy approximation&nbsp; $H_3$&nbsp; based mainly on the syllable frequencies according to&nbsp; '''(2)'''&nbsp; and the mean value of&nbsp; $3.03$&nbsp; letters per syllable. To get another entropy approximation&nbsp; $H_k$&nbsp; with greater&nbsp; $k$&nbsp; Küpfmüller additionally analyzed the words in German texts.&nbsp; He came to the following results:
+
'''(4)'''&nbsp; Küpfmüller's estimation of the entropy approximation&nbsp; $H_3$&nbsp; based mainly on the syllable frequencies according to&nbsp; '''(2)'''&nbsp; and the mean value of&nbsp; $3.03$&nbsp; letters per syllable. To get another entropy approximation&nbsp; $H_k$&nbsp; with greater&nbsp; $k$&nbsp; Küpfmüller additionally analyzed the words in German texts.&nbsp; He came to the following results:
  
:*The&nbsp; $322$&nbsp; most common words provide an entropy contribution of&nbsp; $4.5 \ \rm bit/word$.  
+
:*The&nbsp; $322$&nbsp; most common words provide an entropy contribution of&nbsp; $4.5 \ \rm bit/word$.
:*The contributions of the remaining&nbsp; $40\hspace{0.1cm}000$ words&nbsp; were estimated.&nbsp; Assuming that the frequencies of rare words are reciprocal to their ordinal number ([https://en.wikipedia.org/wiki/Zipf%27s_law Zipf's Law]).  
+
 +
:*The contributions of the remaining&nbsp; $40\hspace{0.1cm}000$ words&nbsp; were estimated.&nbsp; Assuming that the frequencies of rare words are reciprocal to their ordinal number ([https://en.wikipedia.org/wiki/Zipf%27s_law $\text{Zipf's Law}$]).
 +
 
:*With these assumptions the average information content (related to words) is about &nbsp; $11 \ \rm bit/word$.
 
:*With these assumptions the average information content (related to words) is about &nbsp; $11 \ \rm bit/word$.
  
Line 70: Line 82:
 
'''(5)'''&nbsp; The counting "letters per word" resulted in average&nbsp; $5.5$.&nbsp; Analogous to point&nbsp; '''(3)'''&nbsp; the entropy approximation for&nbsp; $k = 5.5$&nbsp; was approximated.&nbsp; Küpfmüller gives the value:&nbsp;  
 
'''(5)'''&nbsp; The counting "letters per word" resulted in average&nbsp; $5.5$.&nbsp; Analogous to point&nbsp; '''(3)'''&nbsp; the entropy approximation for&nbsp; $k = 5.5$&nbsp; was approximated.&nbsp; Küpfmüller gives the value:&nbsp;  
 
:$$H_{5.5} \approx {11}/{5.5}\approx 2\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
 
:$$H_{5.5} \approx {11}/{5.5}\approx 2\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
:Of course,&nbsp; $k$&nbsp; can only assume integer values,&nbsp; according to&nbsp; [[Information_Theory/Sources_with_Memory#Generalization to k-tuple and boundary crossing|its definition]].&nbsp; This equation is therefore to be interpreted in such a way that for&nbsp; $H_5$&nbsp; a somewhat larger and for&nbsp; $H_6$&nbsp; a somewhat smaller value than&nbsp; $2 \ {\rm bit/letter}$&nbsp; will result.
+
:Of course,&nbsp; $k$&nbsp; can only assume integer values,&nbsp; according to&nbsp; [[Information_Theory/Sources_with_Memory#Generalization to k-tuple and boundary crossing|$\text{its definition}$]].&nbsp; This equation is therefore to be interpreted in such a way that for&nbsp; $H_5$&nbsp; a somewhat larger and for&nbsp; $H_6$&nbsp; a somewhat smaller value than&nbsp; $2 \ {\rm bit/letter}$&nbsp; will result.
  
  
Line 87: Line 99:
 
'''(8)'''&nbsp; Three years earlier, after a completely different approach, Claude E. Shannon had given the entropy value&nbsp; $H ≈ 1 \ \rm bit/letter$&nbsp; for the English language, but taking into account the space character.&nbsp; In order to be able to compare his results with Shannom, Küpfmüller subsequently included the space character in his result.  
 
'''(8)'''&nbsp; Three years earlier, after a completely different approach, Claude E. Shannon had given the entropy value&nbsp; $H ≈ 1 \ \rm bit/letter$&nbsp; for the English language, but taking into account the space character.&nbsp; In order to be able to compare his results with Shannom, Küpfmüller subsequently included the space character in his result.  
  
:*The correction factor is the quotient of the average word length without considering the space&nbsp; $(5.5)$&nbsp; and the average word length with consideration of the space&nbsp; $(5.5+1 = 6.5)$.  
+
:*The correction factor is the quotient of the average word length without considering the space&nbsp; $(5.5)$&nbsp; and the average word length with consideration of the space&nbsp; $(5.5+1 = 6.5)$.
 +
 
:*This correction led to Küpfmueller's final result:&nbsp;  
 
:*This correction led to Küpfmueller's final result:&nbsp;  
:$$H =1.51 \cdot {5.5}/{6.5}\approx 1.3\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
+
::$$H =1.51 \cdot {5.5}/{6.5}\approx 1.3\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
  
  
Line 96: Line 109:
 
<br>
 
<br>
 
For the sake of completeness, Küpfmüller's considerations are presented here, which led him to the final result&nbsp; $H = 1.51 \ \rm bit/letter$.&nbsp; &nbsp; Since there was no documentation for the statistics of word groups or whole sentences, he estimated the entropy value of the German language as follows:
 
For the sake of completeness, Küpfmüller's considerations are presented here, which led him to the final result&nbsp; $H = 1.51 \ \rm bit/letter$.&nbsp; &nbsp; Since there was no documentation for the statistics of word groups or whole sentences, he estimated the entropy value of the German language as follows:
*Any contiguous German text is covered behind a certain word.&nbsp; The preceding text is read and the reader should try to determine the following word from the context of the preceding text.
+
#Any contiguous German text is covered behind a certain word.&nbsp; The preceding text is read and the reader should try to determine the following word from the context of the preceding text.
*For a large number of such attempts, the percentage of hits gives a measure of the relationships between words and sentences.&nbsp; It can be seen that for one and the same type of text (novels, scientific writings, etc.) by one and the same author, a constant final value of this hit ratio is reached relatively quickly&nbsp; (about one hundred to two hundred attempts).
+
#For a large number of such attempts, the percentage of hits gives a measure of the relationships between words and sentences.&nbsp; It can be seen that for one and the same type of text (novels, scientific writings, etc.) by one and the same author, a constant final value of this hit ratio is reached relatively quickly&nbsp; (about one hundred to two hundred attempts).
*The hit ratio, however, depends quite strongly on the type of text.&nbsp; For different texts, values between&nbsp; $15\%$&nbsp; and&nbsp; $33\%$&nbsp;  are obtained, with the mean value at&nbsp; $22\%$.&nbsp; This also means: &nbsp; On average,&nbsp; $22\%$&nbsp; of the words in a German text can be determined from the context.
+
#The hit ratio, however, depends quite strongly on the type of text.&nbsp; For different texts, values between&nbsp; $15\%$&nbsp; and&nbsp; $33\%$&nbsp;  are obtained, with the mean value at&nbsp; $22\%$.&nbsp; This also means: &nbsp; On average,&nbsp; $22\%$&nbsp; of the words in a German text can be determined from the context.
*Alternatively: &nbsp; The word count of a long text can be reduced with the factor&nbsp; $0.78$&nbsp; without a significant loss of the message content of the text.&nbsp; Starting from the reference value&nbsp; $H_{5. 5} = 2 \ \rm bit/letter$&nbsp; $($see dot&nbsp; '''(5)'''&nbsp; in the last section$)$&nbsp; for a word of medium length this results in the entropy&nbsp; $H ≈ 0.78 · 2 = 1.56 \ \rm bit/letter$.
+
#Alternatively: &nbsp; The word count of a long text can be reduced with the factor&nbsp; $0.78$&nbsp; without a significant loss of the message content of the text.&nbsp; Starting from the reference value&nbsp; $H_{5. 5} = 2 \ \rm bit/letter$&nbsp; $($see dot&nbsp; '''(5)'''&nbsp; in the last section$)$&nbsp; for a word of medium length this results in the entropy&nbsp; $H ≈ 0.78 · 2 = 1.56 \ \rm bit/letter$.
*Küpfmüller verified this value with a comparable empirical study regarding the syllables and thus determined the reduction factor&nbsp; $0.54$&nbsp; (regarding syllables).&nbsp; Küpfmüller gives&nbsp; $H = 0. 54 · H_3 ≈ 1.51 \ \rm bit/letter$&nbsp; as the final result, where&nbsp; $H_3 ≈ 2.8 \ \rm bit/letter$&nbsp; corresponds to the entropy of a syllable of medium length&nbsp; $($about three letters, see point&nbsp; '''(3)'''&nbsp; on the last page$)$&nbsp;.
+
#Küpfmüller verified this value with a comparable empirical study regarding the syllables and thus determined the reduction factor&nbsp; $0.54$&nbsp; (regarding syllables).&nbsp; Küpfmüller gives&nbsp; $H = 0. 54 · H_3 ≈ 1.51 \ \rm bit/letter$&nbsp; as the final result, where&nbsp; $H_3 ≈ 2.8 \ \rm bit/letter$&nbsp; corresponds to the entropy of a syllable of medium length&nbsp; $($about three letters, see point&nbsp; '''(3)'''&nbsp; in the last section$)$&nbsp;.
  
  
The remarks on this and the previous page, which may be perceived as very critical, are not intended to diminish the importance of neither Küpfmüller's entropy estimation, nor Shannon's contributions to the same topic.  
+
The remarks in this and the previous section, which may be perceived as very critical, are not intended to diminish the importance of neither Küpfmüller's entropy estimation, nor Shannon's contributions to the same topic.  
 
*They are only meant to point out the great difficulties that arise in this task.  
 
*They are only meant to point out the great difficulties that arise in this task.  
 +
 
*This is perhaps also the reason why no one has dealt with this problem intensively since the 1950s.
 
*This is perhaps also the reason why no one has dealt with this problem intensively since the 1950s.
  
Line 114: Line 128:
  
 
The symbol set size has been reduced to&nbsp; $M = 33$&nbsp; and includes the characters '''a''',&nbsp; '''b''',&nbsp; '''c''',&nbsp; ... .&nbsp; '''x''',&nbsp; '''y''',&nbsp; '''z''',&nbsp; '''ä''',&nbsp; '''ö''',&nbsp; '''ü''',&nbsp; '''ß''',&nbsp; $\rm BS$,&nbsp; $\rm DI$,&nbsp; $\rm PM$. &nbsp; Our analysis did not differentiate between upper and lower case letters.&nbsp; In contrast to Küpfmüller's analysis, we also took into account:
 
The symbol set size has been reduced to&nbsp; $M = 33$&nbsp; and includes the characters '''a''',&nbsp; '''b''',&nbsp; '''c''',&nbsp; ... .&nbsp; '''x''',&nbsp; '''y''',&nbsp; '''z''',&nbsp; '''ä''',&nbsp; '''ö''',&nbsp; '''ü''',&nbsp; '''ß''',&nbsp; $\rm BS$,&nbsp; $\rm DI$,&nbsp; $\rm PM$. &nbsp; Our analysis did not differentiate between upper and lower case letters.&nbsp; In contrast to Küpfmüller's analysis, we also took into account:
*the German umlauts&nbsp; '''ä''',&nbsp; '''ö''',&nbsp; '''ü'''&nbsp; and&nbsp; '''ß''', which make up about&nbsp; $1.2\%$&nbsp; of the biblical text,
+
#The German umlauts&nbsp; '''ä''',&nbsp; '''ö''',&nbsp; '''ü'''&nbsp; and&nbsp; '''ß''', which make up about&nbsp; $1.2\%$&nbsp; of the biblical text,
*the class&nbsp; "Digits" &nbsp; &rArr; &nbsp; $\rm DI$&nbsp; with about&nbsp; $1.3\%$&nbsp; because of the verse numbering within the bible,
+
#the class&nbsp; "Digits" &nbsp; &rArr; &nbsp; $\rm DI$&nbsp; with about&nbsp; $1.3\%$&nbsp; because of the verse numbering within the bible,
*the class&nbsp; "Punctuation Marks" &nbsp; &rArr; &nbsp; $\rm PM$&nbsp; with about&nbsp; $3\%$,
+
#the class&nbsp; "Punctuation Marks" &nbsp; &rArr; &nbsp; $\rm PM$&nbsp; with about&nbsp; $3\%$,
*the class&nbsp; "Blank Space" &nbsp; &rArr; &nbsp; $\rm BS$&nbsp; as the most common character&nbsp; $(17.8\%)$, even more than the "e"&nbsp; $(12.8\%)$.
+
#the class&nbsp; "Blank Space" &nbsp; &rArr; &nbsp; $\rm BS$&nbsp; as the most common character&nbsp; $(17.8\%)$, even more than the "e"&nbsp; $(12.8\%)$.
  
  
 
The following table summarizes the results: &nbsp; $N$&nbsp; indicates the analyzed file size in characters (bytes). &nbsp; The decision content&nbsp; $H_0$&nbsp; as well as the entropy approximations&nbsp; $H_1$,&nbsp; $H_2$&nbsp; and&nbsp; $H_3$&nbsp; were each determined from&nbsp; $N$&nbsp; characters and are each given in "bit/characters".  
 
The following table summarizes the results: &nbsp; $N$&nbsp; indicates the analyzed file size in characters (bytes). &nbsp; The decision content&nbsp; $H_0$&nbsp; as well as the entropy approximations&nbsp; $H_1$,&nbsp; $H_2$&nbsp; and&nbsp; $H_3$&nbsp; were each determined from&nbsp; $N$&nbsp; characters and are each given in "bit/characters".  
  
[[File:EN_Inf_T_1_3_S3.png|left|frame|Entropy values&nbsp; (in bit/characters)&nbsp; of the German Bible]]
+
[[File:EN_Inf_T_1_3_S3_v2.png|left|frame|Entropy values&nbsp; (in bit/characters)&nbsp; of the German Bible]]
 
<br>
 
<br>
 
*Please do not consider these results to be scientific research.
 
*Please do not consider these results to be scientific research.
*It is only an attempt to give students an understanding of the subject matter in an internship.  
+
 
 +
*It is only an attempt to give students an understanding of the subject matter in an internship.
 +
 
*The basis of this study was the Bible, since we had both its German and English versions available to us in the appropriate ASCII format.  
 
*The basis of this study was the Bible, since we had both its German and English versions available to us in the appropriate ASCII format.  
 
<br clear=all>
 
<br clear=all>
 
The results of the above table can be summarized as follows:
 
The results of the above table can be summarized as follows:
 
*In all rows the entropy approximations&nbsp; $H_k$&nbsp; decreases monotously with increasing&nbsp; $k$.&nbsp; The decrease is convex, that means: &nbsp; $H_1 - H_2 > H_2 - H_3$. &nbsp; The extrapolation of the final value&nbsp; $(k \to \infty)$&nbsp; from the three entropy approximations determined in each case is not possible&nbsp; (or only extremely vague).
 
*In all rows the entropy approximations&nbsp; $H_k$&nbsp; decreases monotously with increasing&nbsp; $k$.&nbsp; The decrease is convex, that means: &nbsp; $H_1 - H_2 > H_2 - H_3$. &nbsp; The extrapolation of the final value&nbsp; $(k \to \infty)$&nbsp; from the three entropy approximations determined in each case is not possible&nbsp; (or only extremely vague).
 +
 
*If the evaluation of the digits&nbsp; $\rm (DI)$&nbsp; and additionally the evaluation of the punctuation marks&nbsp; $\rm (PM)$&nbsp; is omitted, the approximations&nbsp; $H_1$&nbsp; $($by&nbsp; $0. 114)$,&nbsp; $H_2$&nbsp; $($by&nbsp; $0.063)$&nbsp; and&nbsp; $H_3$&nbsp; $($by&nbsp; $0.038)$&nbsp; decrease. &nbsp; On the final entropy &nbsp; $H$&nbsp; as the limit value of&nbsp; $H_k$&nbsp; for&nbsp; $k \to \infty$&nbsp; the omission of digits and punctuation will probably have little effect.
 
*If the evaluation of the digits&nbsp; $\rm (DI)$&nbsp; and additionally the evaluation of the punctuation marks&nbsp; $\rm (PM)$&nbsp; is omitted, the approximations&nbsp; $H_1$&nbsp; $($by&nbsp; $0. 114)$,&nbsp; $H_2$&nbsp; $($by&nbsp; $0.063)$&nbsp; and&nbsp; $H_3$&nbsp; $($by&nbsp; $0.038)$&nbsp; decrease. &nbsp; On the final entropy &nbsp; $H$&nbsp; as the limit value of&nbsp; $H_k$&nbsp; for&nbsp; $k \to \infty$&nbsp; the omission of digits and punctuation will probably have little effect.
 +
 
*If one leaves also the blank spaces&nbsp; $(\rm BS)$&nbsp;  out of consideration&nbsp; $($Row 4 &nbsp; ⇒ &nbsp; $M = 30)$, the result is almost the same constellation as Küpfmüller originally considered.&nbsp; The only difference are the rather rare German special characters '''ä''',&nbsp; '''ö''',&nbsp; '''ü'''&nbsp; and&nbsp; '''ß'''.
 
*If one leaves also the blank spaces&nbsp; $(\rm BS)$&nbsp;  out of consideration&nbsp; $($Row 4 &nbsp; ⇒ &nbsp; $M = 30)$, the result is almost the same constellation as Küpfmüller originally considered.&nbsp; The only difference are the rather rare German special characters '''ä''',&nbsp; '''ö''',&nbsp; '''ü'''&nbsp; and&nbsp; '''ß'''.
 +
 
*The&nbsp; $H_1$&ndash;value indicated in the last row&nbsp; $(4.132)$&nbsp; corresponds very well with the value&nbsp; $H_1 ≈ 4.1$&nbsp; determined by Küpfmüller. &nbsp; However, with regard to the&nbsp; $H_3$&ndash;values there are clear differences: &nbsp; Our analysis results in a larger value&nbsp; $(H_3 ≈ 3.4)$&nbsp; than Küpfmüller&nbsp; $(H_3 ≈ 2.8)$.
 
*The&nbsp; $H_1$&ndash;value indicated in the last row&nbsp; $(4.132)$&nbsp; corresponds very well with the value&nbsp; $H_1 ≈ 4.1$&nbsp; determined by Küpfmüller. &nbsp; However, with regard to the&nbsp; $H_3$&ndash;values there are clear differences: &nbsp; Our analysis results in a larger value&nbsp; $(H_3 ≈ 3.4)$&nbsp; than Küpfmüller&nbsp; $(H_3 ≈ 2.8)$.
 +
 
*From the frequency of the blank spaces&nbsp; $(17.8\%)$&nbsp; here results an average word length of&nbsp; $1/0.178 - 1 ≈ 4.6$, a smaller value than Küpfmüller&nbsp; ($5.5$)&nbsp; had given.&nbsp; The discrepancy can be partly explained with our analysis file "Bible"&nbsp; (many spaces due to verse numbering).
 
*From the frequency of the blank spaces&nbsp; $(17.8\%)$&nbsp; here results an average word length of&nbsp; $1/0.178 - 1 ≈ 4.6$, a smaller value than Küpfmüller&nbsp; ($5.5$)&nbsp; had given.&nbsp; The discrepancy can be partly explained with our analysis file "Bible"&nbsp; (many spaces due to verse numbering).
 +
 
*Interesting is the comparison of lines 3 and 4.&nbsp; If&nbsp; $\rm BS$&nbsp; is taken into account, then although&nbsp; $H_0$&nbsp; from&nbsp; $\log_2 \ (30) \approx 4.907$&nbsp; to&nbsp; $\log_2 \ (31) \approx 4. 954$&nbsp; enlarges, but thereby reduces&nbsp; $H_1$&nbsp; $($by the factor&nbsp; $0.98)$,&nbsp; $H_2$&nbsp; $($by&nbsp; $0.96)$&nbsp; and&nbsp; $H_3$&nbsp; $($by&nbsp; $0.93)$. Küpfmüller has intuitively taken this factor into account with&nbsp; $85\%$.
 
*Interesting is the comparison of lines 3 and 4.&nbsp; If&nbsp; $\rm BS$&nbsp; is taken into account, then although&nbsp; $H_0$&nbsp; from&nbsp; $\log_2 \ (30) \approx 4.907$&nbsp; to&nbsp; $\log_2 \ (31) \approx 4. 954$&nbsp; enlarges, but thereby reduces&nbsp; $H_1$&nbsp; $($by the factor&nbsp; $0.98)$,&nbsp; $H_2$&nbsp; $($by&nbsp; $0.96)$&nbsp; and&nbsp; $H_3$&nbsp; $($by&nbsp; $0.93)$. Küpfmüller has intuitively taken this factor into account with&nbsp; $85\%$.
  
Line 139: Line 160:
 
Although we consider this own study to be rather insignificant, we believe that for today's texts the&nbsp; $1.0 \ \rm bit/character$&nbsp; given by Shannon are somewhat too low for the English language and also Küpfmüllers&nbsp; $1.3 \ \rm bit/character$&nbsp; for the German language, among other things because:
 
Although we consider this own study to be rather insignificant, we believe that for today's texts the&nbsp; $1.0 \ \rm bit/character$&nbsp; given by Shannon are somewhat too low for the English language and also Küpfmüllers&nbsp; $1.3 \ \rm bit/character$&nbsp; for the German language, among other things because:
 
*The symbol set size today is larger than that considered by Shannon and Küpfmüller in the 1950s; for example, for the ASCII character set&nbsp; $M = 256$.
 
*The symbol set size today is larger than that considered by Shannon and Küpfmüller in the 1950s; for example, for the ASCII character set&nbsp; $M = 256$.
 +
 
*The multiple formatting options (underlining, bold and italics, indents, colors) further increase the information content of a document.
 
*The multiple formatting options (underlining, bold and italics, indents, colors) further increase the information content of a document.
  
Line 146: Line 168:
 
The graphic shows artificially generated German and English texts, which are taken from&nbsp; [Küpf54]<ref name ='Küpf54'>Küpfmüller, K.:&nbsp; Die Entropie der deutschen Sprache.&nbsp; Fernmeldetechnische Zeitung 7, 1954, S. 265-272.</ref>.&nbsp; The underlying symbol set size is&nbsp; $M = 27$,&nbsp; that means, all letters&nbsp; (without umlauts and&nbsp; '''ß''')&nbsp; and the space character are considered.
 
The graphic shows artificially generated German and English texts, which are taken from&nbsp; [Küpf54]<ref name ='Küpf54'>Küpfmüller, K.:&nbsp; Die Entropie der deutschen Sprache.&nbsp; Fernmeldetechnische Zeitung 7, 1954, S. 265-272.</ref>.&nbsp; The underlying symbol set size is&nbsp; $M = 27$,&nbsp; that means, all letters&nbsp; (without umlauts and&nbsp; '''ß''')&nbsp; and the space character are considered.
  
[[File:Inf_T_1_3_S4_vers2.png|right|frame|Artificially generated German and English texts]]
+
[[File:EN_Inf_T_1_3_S4_v4.png|right|frame|Artificially generated German and English texts]]
  
 
*The&nbsp; "Zero-order Character Approximation"&nbsp; assumes equally probable characters in each case.&nbsp; There is therefore no difference between German (red) and English (blue).
 
*The&nbsp; "Zero-order Character Approximation"&nbsp; assumes equally probable characters in each case.&nbsp; There is therefore no difference between German (red) and English (blue).
Line 161: Line 183:
  
  
Further information on the synthetic generation of German and English texts can be found in the&nbsp; [[Aufgaben:1.8_Synthetisch_erzeugte_Texte|Exercise 1.8]].
+
Further information on the synthetic generation of German and English texts can be found in the&nbsp; [[Aufgaben:Exercise_1.8:_Synthetically_Generated_Texts|"Exercise 1.8"]].
  
 
   
 
   
 
==Exercises for the chapter==
 
==Exercises for the chapter==
 
<br>
 
<br>
[[Aufgaben:1.7 Entropie natürlicher Texte|Aufgabe 1.7: Entropie natürlicher Texte]]
+
[[Aufgaben:Exercise_1.7:_Entropy_of_Natural_Texts|Exercise 1.7: Entropy of Natural Texts]]
  
[[Aufgaben:1.8 Synthetisch erzeugte Texte|Aufgabe 1.8: Synthetisch erzeugte Texte]]  
+
[[Aufgaben:Exercise_1.8:_Synthetically_Generated_Texts|Exercise 1.8: Synthetically Generated Texts]]  
  
  
==List of sources==
+
==References==
 
<references/>
 
<references/>
  

Latest revision as of 16:35, 14 February 2023

Difficulties with the determination of entropy


Up to now, we have been dealing exclusively with artificially generated symbol sequences.  Now we consider written texts.  Such a text can be seen as a natural discrete message source, which of course can also be analyzed information-theoretically by determining its entropy.

Even today (2011), natural texts are still often represented with the 8 bit character set according to ANSI ("American National Standard Institute"), although there are several "more modern" encodings;

The  $M = 2^8 = 256$  ANSI characters are used as follows:

  • No.  0   to   31:   control commands that cannot be printed or displayed,
  • No.  32   to  127:   identical to the characters of the 7 bit ASCII code,
  • No.  128   to 159:   additional control characters or alphanumeric characters for Windows,
  • No.  160   to   255:   identical to the Unicode charts.


Theoretically, one could also define the entropy here as the border crossing point of the entropy approximation  $H_k$  for  $k \to \infty$,  according to the procedure from the  "last chapter".  In practice, however, insurmountable numerical limitations can be found here as well:

  • Already for the entropy approximation  $H_2$  there are  $M^2 = 256^2 = 65\hspace{0.1cm}536$  possible two-tuples.  Thus, the calculation requires the same amount of memory (in bytes).   If you assume that you need for a sufficiently safe statistic  $100$  equivalents per tuple on average,  the length of the source symbol sequence should already be  $N > 6.5 · 10^6$.
  • The number of possible three-tuples is  $M^3 > 16 · 10^7$  and thus the required source symbol length is already  $N > 1.6 · 10^9$.  This corresponds to a book with about  $500\hspace{0.1cm}000$  pages to  $42$  lines per page and  $80$  characters per line.
  • For a natural text the statistical ties extend much further than two or three characters.  Küpfmüller gives a value of  $100$  for the German language.  To determine the 100th entropy approximation you need  $2^{800}$ ≈ $10^{240}$  frequencies and for the safe statistics  $100$  times more characters.


A justified question is therefore:   How did  $\text{Karl Küpfmüller}$  determine the entropy of the German language in 1954?  How did  $\text{Claude Elwood Shannon}$  do the same for the English language, even before Küpfmüller?  One thing is revealed beforehand:   Not with the approach described above.


Entropy estimation according to Küpfmüller


Karl Küpfmüller has investigated the entropy of German texts in his published assessment   [Küpf54][1],  the following assumptions are made:

  • an alphabet with  $26$  letters  (no umlauts and punctuation marks),
  • not taking into account the space character,
  • no distinction between upper and lower case.


The maximum average information content is therefore 

$$H_0 = \log_2 (26) ≈ 4.7\ \rm bit/letter.$$

Küpfmüller's estimation is based on the following considerations:

(1)  The  »first entropy approximation«  results from the letter frequencies in German texts.  According to a study of 1939, "e" is with a frequency of   $16. 7\%$  the most frequent, the rarest is "x" with  $0.02\%$.  Averaged over all letters we obtain 

$$H_1 \approx 4.1\,\, {\rm bit/letter}\hspace{0.05 cm}.$$

(2)  Regarding the  »syllable frequency«  Küpfmüller evaluates the  "Häufigkeitswörterbuch der deutschen Sprache"  (Frequency Dictionary of the German Language), published by  $\text{Friedrich Wilhelm Kaeding}$  in 1898.  He distinguishes between root syllables, prefixes, and final syllables and thus arrives at the average information content of all syllables:

$$H_{\rm syllable} = \hspace{-0.1cm} H_{\rm root} + H_{\rm prefix} + H_{\rm final} + H_{\rm rest} \approx 4.15 + 0.82+1.62 + 2.0 \approx 8.6\,\, {\rm bit/syllable} \hspace{0.05cm}.$$
The following proportions were taken into account:
  • According to the Kaeding study of 1898, the  $400$  most common root syllables  (beginning with "de")  represent $47\%$  of a German text and contribute to the entropy with  $H_{\text{root}} ≈ 4.15 \ \rm bit/syllable$.
  • The contribution of  $242$  most common prefixes - in the first place "ge" with  $9\%$ - is numbered by Küpfmüller with  $H_{\text{prefix}} ≈ 0.82 \ \rm bit/syllable$.
  • The contribution of the  $118$  most used final syllables is  $H_{\text{final}} ≈ 1.62 \ \rm bit/syllable$.  Most often, "en" appears at the end of words with  $30\%$ .
  • The remaining  $14\%$  is distributed over syllables not yet measured.  Küpfmüller assumes that there are  $4000$  and that they are equally distributed.  He assumes  $H_{\text{rest}} ≈ 2 \ \rm bit/syllable$  for this.


(3)  As average number of letters per syllable Küpfmüller determined the value  $3.03$.  From this he deduced the  »third entropy approximation«  regarding the letters:

$$H_3 \approx {8.6}/{3.03}\approx 2.8\,\, {\rm bit/letter}\hspace{0.05 cm}.$$

(4)  Küpfmüller's estimation of the entropy approximation  $H_3$  based mainly on the syllable frequencies according to  (2)  and the mean value of  $3.03$  letters per syllable. To get another entropy approximation  $H_k$  with greater  $k$  Küpfmüller additionally analyzed the words in German texts.  He came to the following results:

  • The  $322$  most common words provide an entropy contribution of  $4.5 \ \rm bit/word$.
  • The contributions of the remaining  $40\hspace{0.1cm}000$ words  were estimated.  Assuming that the frequencies of rare words are reciprocal to their ordinal number ($\text{Zipf's Law}$).
  • With these assumptions the average information content (related to words) is about   $11 \ \rm bit/word$.


(5)  The counting "letters per word" resulted in average  $5.5$.  Analogous to point  (3)  the entropy approximation for  $k = 5.5$  was approximated.  Küpfmüller gives the value: 

$$H_{5.5} \approx {11}/{5.5}\approx 2\,\, {\rm bit/letter}\hspace{0.05 cm}.$$
Of course,  $k$  can only assume integer values,  according to  $\text{its definition}$.  This equation is therefore to be interpreted in such a way that for  $H_5$  a somewhat larger and for  $H_6$  a somewhat smaller value than  $2 \ {\rm bit/letter}$  will result.


(6)  Now you can try to get the final value of entropy for  $k \to \infty$  by extrapolation from these three points  $H_1$,  $H_3$  and  $H_{5.5}$ :

Approximate values of the entropy of the German language according to Küpfmüller
  • The continuous line, taken from Küpfmüller's original work  [Küpf54][1], leads to the final entropy value  $H = 1.6 \ \rm bit/letter$.
  • The green curves are two extrapolation attempts (of a continuous function course through three points) of the  $\rm LNTwww$'s author.
  • These and the brown arrows are actually only meant to show that such an extrapolation is  (carefully worded)  somewhat vague.


(7)  Küpfmüller then tried to verify the final value  $H = 1.6 \ \rm bit/letter$  found by him with this first estimation with a completely different methodology - see next section. After this estimation he revised his result slightly to 

$$H = 1.51 \ \rm bit/letter.$$


(8)  Three years earlier, after a completely different approach, Claude E. Shannon had given the entropy value  $H ≈ 1 \ \rm bit/letter$  for the English language, but taking into account the space character.  In order to be able to compare his results with Shannom, Küpfmüller subsequently included the space character in his result.

  • The correction factor is the quotient of the average word length without considering the space  $(5.5)$  and the average word length with consideration of the space  $(5.5+1 = 6.5)$.
  • This correction led to Küpfmueller's final result: 
$$H =1.51 \cdot {5.5}/{6.5}\approx 1.3\,\, {\rm bit/letter}\hspace{0.05 cm}.$$


A further entropy estimation by Küpfmüller


For the sake of completeness, Küpfmüller's considerations are presented here, which led him to the final result  $H = 1.51 \ \rm bit/letter$.    Since there was no documentation for the statistics of word groups or whole sentences, he estimated the entropy value of the German language as follows:

  1. Any contiguous German text is covered behind a certain word.  The preceding text is read and the reader should try to determine the following word from the context of the preceding text.
  2. For a large number of such attempts, the percentage of hits gives a measure of the relationships between words and sentences.  It can be seen that for one and the same type of text (novels, scientific writings, etc.) by one and the same author, a constant final value of this hit ratio is reached relatively quickly  (about one hundred to two hundred attempts).
  3. The hit ratio, however, depends quite strongly on the type of text.  For different texts, values between  $15\%$  and  $33\%$  are obtained, with the mean value at  $22\%$.  This also means:   On average,  $22\%$  of the words in a German text can be determined from the context.
  4. Alternatively:   The word count of a long text can be reduced with the factor  $0.78$  without a significant loss of the message content of the text.  Starting from the reference value  $H_{5. 5} = 2 \ \rm bit/letter$  $($see dot  (5)  in the last section$)$  for a word of medium length this results in the entropy  $H ≈ 0.78 · 2 = 1.56 \ \rm bit/letter$.
  5. Küpfmüller verified this value with a comparable empirical study regarding the syllables and thus determined the reduction factor  $0.54$  (regarding syllables).  Küpfmüller gives  $H = 0. 54 · H_3 ≈ 1.51 \ \rm bit/letter$  as the final result, where  $H_3 ≈ 2.8 \ \rm bit/letter$  corresponds to the entropy of a syllable of medium length  $($about three letters, see point  (3)  in the last section$)$ .


The remarks in this and the previous section, which may be perceived as very critical, are not intended to diminish the importance of neither Küpfmüller's entropy estimation, nor Shannon's contributions to the same topic.

  • They are only meant to point out the great difficulties that arise in this task.
  • This is perhaps also the reason why no one has dealt with this problem intensively since the 1950s.


Some own simulation results


Karl Küpfmüller's data regarding the entropy of the German language will now be compared with some (very simple) simulation results that were worked out by the author of this chapter (Günter Söder) at the Department of Communications Engineering at the Technical University of Munich as part of an internship for students.  The results are based on the German Bible in ASCII format with  $N \approx 4.37 \cdot 10^6$  characters. This corresponds to a book with  $1300$  pages at  $42$  lines per page and  $80$  characters per line.


The symbol set size has been reduced to  $M = 33$  and includes the characters abc,  ... .  xyzäöüß,  $\rm BS$,  $\rm DI$,  $\rm PM$.   Our analysis did not differentiate between upper and lower case letters.  In contrast to Küpfmüller's analysis, we also took into account:

  1. The German umlauts  äöü  and  ß, which make up about  $1.2\%$  of the biblical text,
  2. the class  "Digits"   ⇒   $\rm DI$  with about  $1.3\%$  because of the verse numbering within the bible,
  3. the class  "Punctuation Marks"   ⇒   $\rm PM$  with about  $3\%$,
  4. the class  "Blank Space"   ⇒   $\rm BS$  as the most common character  $(17.8\%)$, even more than the "e"  $(12.8\%)$.


The following table summarizes the results:   $N$  indicates the analyzed file size in characters (bytes).   The decision content  $H_0$  as well as the entropy approximations  $H_1$,  $H_2$  and  $H_3$  were each determined from  $N$  characters and are each given in "bit/characters".

Entropy values  (in bit/characters)  of the German Bible


  • Please do not consider these results to be scientific research.
  • It is only an attempt to give students an understanding of the subject matter in an internship.
  • The basis of this study was the Bible, since we had both its German and English versions available to us in the appropriate ASCII format.


The results of the above table can be summarized as follows:

  • In all rows the entropy approximations  $H_k$  decreases monotously with increasing  $k$.  The decrease is convex, that means:   $H_1 - H_2 > H_2 - H_3$.   The extrapolation of the final value  $(k \to \infty)$  from the three entropy approximations determined in each case is not possible  (or only extremely vague).
  • If the evaluation of the digits  $\rm (DI)$  and additionally the evaluation of the punctuation marks  $\rm (PM)$  is omitted, the approximations  $H_1$  $($by  $0. 114)$,  $H_2$  $($by  $0.063)$  and  $H_3$  $($by  $0.038)$  decrease.   On the final entropy   $H$  as the limit value of  $H_k$  for  $k \to \infty$  the omission of digits and punctuation will probably have little effect.
  • If one leaves also the blank spaces  $(\rm BS)$  out of consideration  $($Row 4   ⇒   $M = 30)$, the result is almost the same constellation as Küpfmüller originally considered.  The only difference are the rather rare German special characters äöü  and  ß.
  • The  $H_1$–value indicated in the last row  $(4.132)$  corresponds very well with the value  $H_1 ≈ 4.1$  determined by Küpfmüller.   However, with regard to the  $H_3$–values there are clear differences:   Our analysis results in a larger value  $(H_3 ≈ 3.4)$  than Küpfmüller  $(H_3 ≈ 2.8)$.
  • From the frequency of the blank spaces  $(17.8\%)$  here results an average word length of  $1/0.178 - 1 ≈ 4.6$, a smaller value than Küpfmüller  ($5.5$)  had given.  The discrepancy can be partly explained with our analysis file "Bible"  (many spaces due to verse numbering).
  • Interesting is the comparison of lines 3 and 4.  If  $\rm BS$  is taken into account, then although  $H_0$  from  $\log_2 \ (30) \approx 4.907$  to  $\log_2 \ (31) \approx 4. 954$  enlarges, but thereby reduces  $H_1$  $($by the factor  $0.98)$,  $H_2$  $($by  $0.96)$  and  $H_3$  $($by  $0.93)$. Küpfmüller has intuitively taken this factor into account with  $85\%$.


Although we consider this own study to be rather insignificant, we believe that for today's texts the  $1.0 \ \rm bit/character$  given by Shannon are somewhat too low for the English language and also Küpfmüllers  $1.3 \ \rm bit/character$  for the German language, among other things because:

  • The symbol set size today is larger than that considered by Shannon and Küpfmüller in the 1950s; for example, for the ASCII character set  $M = 256$.
  • The multiple formatting options (underlining, bold and italics, indents, colors) further increase the information content of a document.


Synthetically generated texts


The graphic shows artificially generated German and English texts, which are taken from  [Küpf54][1].  The underlying symbol set size is  $M = 27$,  that means, all letters  (without umlauts and  ß)  and the space character are considered.

Artificially generated German and English texts
  • The  "Zero-order Character Approximation"  assumes equally probable characters in each case.  There is therefore no difference between German (red) and English (blue).


  • The  "First-order Character Approximation"  already considers the different frequencies, the higher order approximations also the preceding characters.


  • In the  "Fourth-order Character Approximation"  one can already recognize meaningful words.  Here the probability for a new letter depends on the last three characters.


  • The  "First-order Word Approximation"  synthesizes sentences according to the word probabilities.  The  "Second-order Word Approximation"  also considers the previous word.


Further information on the synthetic generation of German and English texts can be found in the  "Exercise 1.8".


Exercises for the chapter


Exercise 1.7: Entropy of Natural Texts

Exercise 1.8: Synthetically Generated Texts


References

  1. 1.0 1.1 1.2 Küpfmüller, K.:  Die Entropie der deutschen Sprache.  Fernmeldetechnische Zeitung 7, 1954, S. 265-272.