Difference between revisions of "Information Theory/Natural Discrete Sources"
Line 68: | Line 68: | ||
− | '''(5)''' The counting "letters per word" resulted in average 5.5. Analogous to point '''(3)''' the entropy approximation for k=5.5 was approximated. Küpfmüller gives the value H5.5≈11/5.5≈2bit/letter. Of course, k can only assume integer values, according to [[Information_Theory/ | + | '''(5)''' The counting "letters per word" resulted in average 5.5. Analogous to point '''(3)''' the entropy approximation for k=5.5 was approximated. Küpfmüller gives the value H5.5≈11/5.5≈2bit/letter. Of course, k can only assume integer values, according to [[Information_Theory/Sources_with_Memory#Generalization to k-tuple and boundary crossing|its definition]]. This equation is therefore to be interpreted in such a way that for H5 a somewhat larger and for H6 a somewhat smaller value than 2 bit/letter will result. |
− | [[File:P_ID2303__Inf_T_1_3_S2.png|right|frame|Approximate values of the entropy of the German language according to Küpfmüller]] | + | [[File:P_ID2303__Inf_T_1_3_S2.png|right|frame|Approximate values of the entropy of the German language according to Küpfmüller]] |
'''(6)''' Now you can try to get the final value of entropy for k→∞ by extrapolation from these three points: | '''(6)''' Now you can try to get the final value of entropy for k→∞ by extrapolation from these three points: | ||
:*The continuous line, taken from Küpfmüller's original work [Küpf54]<ref name ='Küpf54'>Küpfmüller, K.: ''Die Entropie der deutschen Sprache''. Fernmeldetechnische Zeitung 7, 1954, S. 265-272.</ref>, leads to the final entropy value H=1.6 bit/letter. | :*The continuous line, taken from Küpfmüller's original work [Küpf54]<ref name ='Küpf54'>Küpfmüller, K.: ''Die Entropie der deutschen Sprache''. Fernmeldetechnische Zeitung 7, 1954, S. 265-272.</ref>, leads to the final entropy value H=1.6 bit/letter. | ||
Line 130: | Line 130: | ||
The results of the above table can be summarized as follows: | The results of the above table can be summarized as follows: | ||
*In all rows the entropy approximations Hk decreases monotously with increasing k. The decrease is convex, that means H1−H2>H2−H3. The extrapolation of the final value (k→∞) is not (or only extremely vague) possible from the three entropy approximations determined in each case. | *In all rows the entropy approximations Hk decreases monotously with increasing k. The decrease is convex, that means H1−H2>H2−H3. The extrapolation of the final value (k→∞) is not (or only extremely vague) possible from the three entropy approximations determined in each case. | ||
− | *If the evaluation of the numbers (ZI, line 2 ⇒ M=32) and additionally the evaluation of the punctuation marks (IP, line 3 ⇒ M=31) is omitted, the entropy approximations H1 ( | + | *If the evaluation of the numbers (ZI, line 2 ⇒ M=32) and additionally the evaluation of the punctuation marks (IP, line 3 ⇒ M=31) is omitted, the entropy approximations H1 (by 0.114), H2 (by 0.063) and H3 (by 0.038) decrease. On the final value H as the limit value of Hk for k→∞ the omission of numbers and punctuation will probably have little effect. |
*If one leaves the space (LZ, line 4 ⇒ M=30) out of consideration, the result is almost the same constellation as Küpfmüller originally considered. The only difference are the rather rare German special characters '''ä''', '''ö''', '''ü''' and '''ß'''. | *If one leaves the space (LZ, line 4 ⇒ M=30) out of consideration, the result is almost the same constellation as Küpfmüller originally considered. The only difference are the rather rare German special characters '''ä''', '''ö''', '''ü''' and '''ß'''. | ||
*The H1-value (4.132) indicated in the last line corresponds very well with the value H1≈4.1 determined by Küpfmüller. However, with regard to the H3-values there are clear differences: Our analysis yields H3≈3.4, while Küpfmüller H3≈names2.8 (all data in bit/letter). | *The H1-value (4.132) indicated in the last line corresponds very well with the value H1≈4.1 determined by Küpfmüller. However, with regard to the H3-values there are clear differences: Our analysis yields H3≈3.4, while Küpfmüller H3≈names2.8 (all data in bit/letter). | ||
*From the frequency of occurrence of the space (17.8%) here results an average word length of 1/0.178−1≈4.6, a smaller value than Küpfmüller (5.5) given. The discrepancy can be partly explained with our analysis file "Bible" (many spaces due to verse numbering). | *From the frequency of occurrence of the space (17.8%) here results an average word length of 1/0.178−1≈4.6, a smaller value than Küpfmüller (5.5) given. The discrepancy can be partly explained with our analysis file "Bible" (many spaces due to verse numbering). | ||
− | *Interesting is the comparison of lines 3 and 4. If the space is taken into account, then although H0 from log2 (30)≈4.907 to log2 (31)≈4.954 enlarges, but thereby reduces H1 (by the factor 0.98), H2 ( | + | *Interesting is the comparison of lines 3 and 4. If the space is taken into account, then although H0 from log2 (30)≈4.907 to log2 (31)≈4.954 enlarges, but thereby reduces H1 (by the factor 0.98), H2 (by 0.96) and H3 (by 0.93). Küpfmüller has intuitively taken this factor into account with 85% . |
Line 161: | Line 161: | ||
− | Further information on the synthetic generation of German and English texts can be found in the [[Aufgaben:1.8_Synthetisch_erzeugte_Texte| | + | Further information on the synthetic generation of German and English texts can be found in the [[Aufgaben:1.8_Synthetisch_erzeugte_Texte|Exercise 1.8]]. |
Revision as of 15:37, 2 November 2020
Contents
Difficulties with the determination of entropy
Up to now, we have been dealing exclusively with artificially generated symbol sequences. Now we consider written texts. Such a text can be seen as a natural discrete-value message source, which of course can also be analyzed information-theoretically by determining its entropy.
Even today (2011), natural texts are still often represented with the 8 bit character set according to ANSI (American National Standard Institute), although there are several "more modern" encodings;
The M=28=256 ANSI characters are used as follows:
- No. 0 to 31: control commands that cannot be printed or displayed,
- No. 32 to 127: identical to the characters of the 7 bit ASCII code,
- No. 128 to 159: additional control characters or alphanumeric characters for Windows,
- No. 160 to 255: identical to the Unicode charts.
Theoretically, one could also define the entropy here as the border crossing point of the entropy approximation Hk for k→∞, according to the procedure from the last chapter. In practice, however, insurmountable numerical limitations can be found here as well:
- Already for the entropy approximation H2 there is M2=2562=65536 possible two-tuples. Thus, the calculation requires the same amount of memory (in bytes). If you assume that you need 100 equivalents per tuple on average, for a sufficiently safe statistic, the length of the source symbol sequence should already be N>6.5·106 .
- The number of possible three-tuples is M3>16·107 and thus the required source symbol length is already N>1.6·109. This corresponds to 42 lines per page and 80 characters per line to a book with about 500000 pages.
- For a natural text the statistical bonds extend much further than two or three characters. Küpfmüller gives a value of 100 for the german language. To determine the 100th entropy approximation you need 2800 ≈ 10240 frequencies and for the safe statistics 100 times more characters.
A justified question is therefore: How did Karl Küpfmüller determined the entropy of the German language in 1954? How did Claude Elwood Shannon do the same for the English language, even before Küpfmüller? One thing is revealed beforehand: Not with the approach described above.
Entropy estimation according to Küpfmüller
Karl Küpfmüller has investigated the entropy of German texts in his published assessment [Küpf54][1] the following assumptions are made:
- an alphabet with 26 letters (no umlauts or punctuation marks),
- Not taking into account the space character,
- no distinction between upper and lower case.
The content of the decision is therefore H0=log2(26)≈4.7 bit/letter.
Küpfmueller's estimation is based on the following considerations:
(1) The first entropy approximation results from the letter frequencies in German texts. According to a study of 1939, "e" is with a frequency of 16.7% the most frequent, the rarest is "x" with 0.02%. Averaged over all letters we obtain H1≈4.1bit/letter.
(2) Regarding the syllable frequency Küpfmüller evaluates the "Häufigkeitswörterbuch der deutschen Sprache" (Frequency Dictionary of the German Language), published by Friedrich Wilhelm Kaeding 1898; He distinguishes between root syllables, prefixes, and ending syllables and thus arrives at the average information content of all syllables:
- Hsyllable=Hstem+Hfront+Hend+Hrest≈4.15+0.82+1.62+2.0≈8.6bit/syllable.
- The following proportions were taken into account:
- According to the Kaeding study of 1898, the 400 most common root syllables (beginning with "de") represent 47% of a German text and contribute to the entropy with HRoot≈4.15 bit/syllable .
- The contribution of 242 most common prefixes - in the first place "ge" with 9% - is numbered by Küpfmüller with HPre≈0.82 bit/syllable.
- The contribution of the 118 most used ending syllables is HEnd≈1.62 bit/syllable. Most often, "en" appears at the end of words with 30% .
- The remaining 14% is distributed over syllables not yet measured. Küpfmüller assumes that there are 4000 and that they are equally distributed He assumes HRest≈2 bit/syllable for this.
(3) As average number of letters per syllable Küpfmüller determined the value 3.03. From this he deduced the third entropy approximation' regarding the letters:
- H3≈8.6/3.03≈2.8bit/letter.
(4) Küpfmueller's estimation of the entropy approximation H3 based mainly on the syllable frequencies according to (2) and the mean value of 3.03 letters per syllable. To get another entropy approximation Hk with greater k Küpfmüller additionally analyzed the words in German texts. He came to the following results:
- The 322 most common words provide an entropy contribution of 4.5 bit/word.
- The contributions of the remaining 40000 words were estimated. Assuming that the frequencies of rare words are reciprocal to their ordinal number (Zipf's Law).
- With these assumptions the average information content (related to words) is about 11 bit/word.
(5) The counting "letters per word" resulted in average 5.5. Analogous to point (3) the entropy approximation for k=5.5 was approximated. Küpfmüller gives the value H5.5≈11/5.5≈2bit/letter. Of course, k can only assume integer values, according to its definition. This equation is therefore to be interpreted in such a way that for H5 a somewhat larger and for H6 a somewhat smaller value than 2 bit/letter will result.
(6) Now you can try to get the final value of entropy for k→∞ by extrapolation from these three points:
- The continuous line, taken from Küpfmüller's original work [Küpf54][1], leads to the final entropy value H=1.6 bit/letter.
- The green curves are two extrapolation attempts (of a continuous function course through three points) of the LNTwww's author.
- These and the brown arrows are actually only meant to show that such an extrapolation (carefully worded) is somewhat vague.
(7) Küpfmüller then tried to verify the final value H=1.6 bit/letter found by him with this first estimation with a completely different methodology - see next section. After this estimation he revised his result slightly to H=1.51 bit/letter.
(8) Three years earlier, after a completely different approach, Claude E. Shannon had given the entropy value H≈1 bit/letter for the English language, but taking into account the space character. In order to be able to compare his results with Shannom, Küpfmüller subsequently included the space character in his result.
- The correction factor is the quotient of the average word length without considering the space (5.5) and the average word length with consideration of the space (5.5+1=6.5).
- This correction led to Küpfmueller's final result H=1.51⋅5.5/6.5≈1.3bit/letter.
A further entropy estimation by Küpfmüller
For the sake of completeness, Küpfmüller's considerations are presented here, which led him to the final result H=1.51 bit/letter Since there was no documentation for the statistics of word groups or whole sentences, he estimated the entropy value of the German language as follows:
- Any contiguous German text is covered behind a certain word. The preceding text is read and the reader should try to determine the following word from the context of the preceding text.
- For a large number of such attempts, the percentage of hits gives a measure of the links between words and sentences It can be seen that for one and the same type of text (novels, scientific writings, etc.) by one and the same author, a constant final value of this hit ratio is reached relatively quickly (about one hundred to two hundred attempts).
- The hit ratio, however, depends quite strongly on the type of text. For different texts, values between 15% and 33%, with the mean value at 22%, are obtained. This also means: On average, 22% of the words in a German text can be determined from the context.
- Alternatively: The word count of a long text can be reduced with the factor 0.78 without a significant loss of the message content of the text. Starting from the reference value H5.5=2 bit/letter (see dot (5) in the last section) for a word of medium length this results in the entropy H≈0.78·2=1.56 bit/letter.
- Küpfmüller verified this value with a comparable empirical study regarding the syllables and thus determined the reduction factor 0.54 (regarding syllables). As final result Küpfmüller H=0.54·H3≈1.51 bit/letter, where H3≈2.8 bit/letter corresponds to the entropy of a syllable of medium length (about three letters, see point (3) on the last page) .
The remarks on this and the previous page, which may be perceived as very critical, are not intended to diminish the importance of neither Küpfmüller's entropy estimation, nor Shannon's contributions to the same topic are not.
- They are only meant to point out the great difficulties that arise in this task.
- This is perhaps also the reason why no one has dealt with this problem intensively since the 1950s.
Some own simulation results
The information given by Karl Küpfmüller regarding the entropy of the German language shall now be compared with some (very simple) simulation results, which were developed by the author of this chapter (Günter Söder) at the Chair of Communications Engineering at the Technical University of Munich in the course of an internship. The results are based on
- the Windows program WDIT ⇒ the link refers to the ZIP version of the program;
- the associated practical training manual Wertdiskrete Informationstheorie (Value Discrete Information Theory). ⇒ the link refers to the PDF version;
- the German Bible in ASCII format with N≈4.37⋅106 characters. This corresponds to a book with 1300 pages at 42 lines per page and 80 characters per line.
The symbol range has been reduced to M=33 and includes the characters a, b, c, ... . x, y, z, ä, ö, ü, ß, LZ, ZI, IP. Our analysis did not differentiate between upper and lower case letters.
In contrast to Küpfmüller's analysis, we also took into account:
- the German umlauts ä, ö, ü and ß, which make up about 1.2% of the biblical text,
- the class punctuation IP (Interpunktion) with ca. 3%,
- the class digit ZI (Ziffer) with ca. 1.3% because of the verse numbering within the bible,
- the space (Leerzeichen) (LZ) as the most common character (17.8%), even more than the "e" (12.8%).
The following table summarizes the results N indicates the analyzed file size in characters (bytes). The decision content H0 as well as the entropy approximations H1, H2 and H3 were each determined from N characters and are each given in "bit/characters".
- Please do not consider these results to be scientific research.
- It is only an attempt to give students an understanding of the subject matter in an internship.
- The basis of this study was the Bible, since we had both its German and English versions available to us in the appropriate ASCII format.
The results of the above table can be summarized as follows:
- In all rows the entropy approximations Hk decreases monotously with increasing k. The decrease is convex, that means H1−H2>H2−H3. The extrapolation of the final value (k→∞) is not (or only extremely vague) possible from the three entropy approximations determined in each case.
- If the evaluation of the numbers (ZI, line 2 ⇒ M=32) and additionally the evaluation of the punctuation marks (IP, line 3 ⇒ M=31) is omitted, the entropy approximations H1 (by 0.114), H2 (by 0.063) and H3 (by 0.038) decrease. On the final value H as the limit value of Hk for k→∞ the omission of numbers and punctuation will probably have little effect.
- If one leaves the space (LZ, line 4 ⇒ M=30) out of consideration, the result is almost the same constellation as Küpfmüller originally considered. The only difference are the rather rare German special characters ä, ö, ü and ß.
- The H1-value (4.132) indicated in the last line corresponds very well with the value H1≈4.1 determined by Küpfmüller. However, with regard to the H3-values there are clear differences: Our analysis yields H3≈3.4, while Küpfmüller H3≈names2.8 (all data in bit/letter).
- From the frequency of occurrence of the space (17.8%) here results an average word length of 1/0.178−1≈4.6, a smaller value than Küpfmüller (5.5) given. The discrepancy can be partly explained with our analysis file "Bible" (many spaces due to verse numbering).
- Interesting is the comparison of lines 3 and 4. If the space is taken into account, then although H0 from log2 (30)≈4.907 to log2 (31)≈4.954 enlarges, but thereby reduces H1 (by the factor 0.98), H2 (by 0.96) and H3 (by 0.93). Küpfmüller has intuitively taken this factor into account with 85% .
Although we consider this own research to be rather insignificant, we believe that for today's texts the 1.0 bit/letter given by Shannon are somewhat too low for the English language and also Küpfmüllers 1.3 bit/letter for the German language, among other things because
- the symbol range today is larger than that considered by Shannon and Küpfmüller in the 1950s; for example, for the ASCII character set M=256,
- the multiple formatting options (underlining, bold and italics, indents, colors) further increase the information content of a document.
Synthetically generated texts
The graphic shows artificially generated German and English texts, which are taken from [Küpf54][1] taken from The underlying symbol range is M=27, that means, all letters (without umlauts and ß) and the space character are considered.
- The Zero-order letter approximation assumes equally probable characters in each case. There is therefore no difference between German (red) and English (blue).
- The first letter approximation already considers the different frequencies, the higher order approximations also the preceding characters.
- In the 4th order synthesis one can already recognize meaningful words. Here the probability for a new letter depends on the last three characters.
- The First-order word approximation synthesizes sentences according to the word probabilities that Second-order word approximation also considers the previous word.
Further information on the synthetic generation of German and English texts can be found in the Exercise 1.8.
Exercises for chapter
Aufgabe 1.7: Entropie natürlicher Texte
Aufgabe 1.8: Synthetisch erzeugte Texte
Quellenverzeichnis
- ↑ Jump up to: 1.0 1.1 1.2 Küpfmüller, K.: Die Entropie der deutschen Sprache. Fernmeldetechnische Zeitung 7, 1954, S. 265-272.