Difference between revisions of "Aufgaben:Exercise 2.4: About the LZW Algorithm"
(Die Seite wurde neu angelegt: „ {{quiz-Header|Buchseite=Informationstheorie und Quellencodierung/Komprimierung nach Lempel, Ziv und Welch }} right| :Der Ko…“) |
|||
(34 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
− | {{quiz-Header|Buchseite= | + | {{quiz-Header|Buchseite=Information_Theory/Compression_According_to_Lempel,_Ziv_and_Welch |
}} | }} | ||
− | [[File: | + | [[File:EN_Inf_A_2_4.png|right|frame|LZW dictionary for Encoding/Decoding]] |
− | : | + | The compression algorithm "LZW" – named after the inventors [https://en.wikipedia.org/wiki/Abraham_Lempel Abraham Lempel], [https://en.wikipedia.org/wiki/Jacob_Ziv Jacob Ziv] and [https://en.wikipedia.org/wiki/Terry_Welch Terry Welch] – works like "LZ78" with a global dictionary. In this case, all possible characters – in the example <b>A</b>, <b>B</b>, <b>C</b>, <b>D</b> – are preassigned at the beginning and are successively expanded during encoding. |
− | + | During decoding, exactly the same dictionary is created, only the same entry with index $I$ is made one step later than during encoding. To designate the decoding steps, we use the variable $i$. | |
− | : | + | Here are some more notes on LZW encoding and decoding: |
+ | *During encoding, at each time $i$, a search is made in the dictionary for the longest possible character string that matches the input string currently present. | ||
+ | *The dictionary index $I_i$ found is always transmitted in binary form. At the same time, $W(I_i) + Z$ is entered into the dictionary under the next free index. | ||
+ | *Here, $W(I_i)$ denotes a character or a character string, and $Z$ is the first character of the upcoming input string (i.e. also a character), which is no longer considered in $W(I_i)$ . | ||
+ | *With $M = 4$ possible characters, the first index $I_1$ is transmitted with two bits, the indices $I_2$, ... , $I_5$ with three bits, the next eight indices with four bits, then 16 indices with five bits, etc. The reason for this bit-saving measure can be found in the solution to this task. | ||
− | |||
− | : | + | After coding in the manner described here over $16$ coding steps, the following binary sequence of length $N_{\rm bit} = 61$: |
+ | :$$\boldsymbol{\rm 00 \ 000 \ 001 \ 010 \ 100 \ 0001 \ 1000 \ 0111 \ 0001 \ 0001 \ 0011 \ 1011 \ 1011 \ 01101 \ 00011 \ 01001}\hspace{0.05cm}.$$ | ||
− | : | + | The task of the decoder (more precisely: your task) is now, |
+ | * to reconstruct from this binary sequence the indices $I_1$, ... , $I_{16}$ taking into account the different number of bits (description of the 16 decoding results by decimal numbers), | ||
+ | * then read out the corresponding characters or character sequences from the dictionary according to the indices and finally – with one step delay – generate the new dictionary entry. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | === | + | |
+ | Hints: | ||
+ | *The exercise belongs to the chapter [[Information_Theory/Komprimierung_nach_Lempel,_Ziv_und_Welch|Compression according to Lempel, Ziv and Welch]]. | ||
+ | *In particular, reference is made to the pages | ||
+ | :: [[Information_Theory/Komprimierung_nach_Lempel,_Ziv_und_Welch#LZ77_-_the_basic_form_of_the_Lempel-Ziv_algorithms|LZ77 - the basic form of Lempel-Ziv algorithms]], | ||
+ | :: [[Information_Theory/Komprimierung_nach_Lempel,_Ziv_und_Welch#Lempel-Ziv_coding_with_variable_index_bit_length|Lempel-Ziv coding with variable index bit length]], | ||
+ | :: [[Information_Theory/Komprimierung_nach_Lempel,_Ziv_und_Welch#Decoding_of_the_LZW_algorithm|Decoding of the LZW algorithm]]. | ||
+ | |||
+ | * [[Aufgaben:Exercise_2.3:_About_the_LZ78_Compression|Exercise 2.3]] and [[Aufgaben:Exercise_2.3Z:_About_the_LZ77_Coding|Exercise 2.3Z]] deal with other Lempel-Ziv methods in a similar way. | ||
+ | |||
+ | |||
+ | |||
+ | ===Questions=== | ||
<quiz display=simple> | <quiz display=simple> | ||
− | { | + | {Which indices are available for the first four steps $(i=1$, ... , $i=4)$ for decoding. Enter all $I_i = I(i)$ as decimal numbers. |
|type="{}"} | |type="{}"} | ||
− | $ | + | $I_1 \ = \ $ { 0. } |
− | $ | + | $I_2 \ = \ $ { 0. } |
− | $ | + | $I_3 \ = \ $ { 1 } |
− | $ | + | $I_4 \ = \ $ { 2 } |
− | { | + | {Continue the subdivision of the decoder input string to the end. Which indices result from the decoding steps $i=5$, ... , $i=8$? |
|type="{}"} | |type="{}"} | ||
− | $ | + | $I_5 \ = \ $ { 4 } |
− | $ | + | $I_6 \ = \ $ { 1 } |
− | $ | + | $I_7 \ = \ $ { 8 } |
− | $ | + | $I_8 \ = \ $ { 7 } |
− | { | + | |
− | |type=" | + | {What is the output string after $i = 16$ decoding steps? |
+ | |type="()"} | ||
- <b>AABCAABAABCAABCABBDBBDCDBA</b>, | - <b>AABCAABAABCAABCABBDBBDCDBA</b>, | ||
+ <b>AABCAABAABCABBDCABCABBDDBA</b>, | + <b>AABCAABAABCABBDCABCABBDDBA</b>, | ||
Line 55: | Line 67: | ||
− | { | + | {The next index is: $I_{17} = 10010$ (binary) $= 18$ (decimal). Which of the following statements are true? |
|type="[]"} | |type="[]"} | ||
− | + | + | + The decoder output to step $i = 17$ is "<b>DB</b>". |
− | - | + | - The decoder output to step $i = 17$ is "<b>BAD</b>". |
− | - | + | - New dictionary entry "<b>DB</b>" in line $I = 19$. |
− | + | + | + New dictionary entry "<b>BAD</b>" in line $I = 19$. |
− | { | + | {Let the decoder input string consist of $N_{\rm bit} = 300$ binary characters (bits). Which of the following statements is then certainly true? |
|type="[]"} | |type="[]"} | ||
− | + | + | + $58$ LZW phrases with variable bit length were transmitted. |
− | - | + | - With a uniform index bit length, fewer bits would be required. |
− | - | + | - A file with $N =150$ quaternary characters was transmitted. |
Line 73: | Line 85: | ||
</quiz> | </quiz> | ||
− | === | + | ===Solution=== |
{{ML-Kopf}} | {{ML-Kopf}} | ||
− | + | '''(1)''' For the LZW coding of the first character $(i = 1)$, only the four indices $I = 0$, ... , $I = 3$ are possible for which a dictionary entry has already been made at step $i = 0$ (preassignment). Therefore, it is sufficient here to transmit the index with two bits. | |
− | + | *In contrast, three bits are already required for step $i = 2$. If the input sequence had started with <b>AAA</b> , the LZW coding $I_2 = I(i = 2)$ would have resulted in the decimal value $4$. This can no longer be represented with two bits and must be coded with three bits, just like $I_3$, $I_4$ and $I_5$. | |
− | + | *The given input string:$$\boldsymbol{\rm 00 000 001 010 100 0001}\hspace{0.05cm}...$$ | |
− | :$$\boldsymbol{\rm 00 000 001 010 100 0001}\hspace{0.05cm}...$$ | + | :must therefore be divided by the decoder as follows: |
− | : | + | :$$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001|}\hspace{0.05cm}...$$ |
− | :$$\boldsymbol{\rm 00 | 000 |001 |010 |100 |0001|}\hspace{0.05cm}...$$ | ||
− | |||
− | :* <u>< | + | *The decoder results of the first four steps are therefore: |
+ | :* $I_1 =$ <b>00</b> (binary) <u> = 0</u> (decimal) ⇒ <b>A</b>, | ||
+ | :* $I_2 =$ <b>000</b> (binary) <u> = 0</u> (decimal) ⇒ <b>A</b>, | ||
+ | :* $I_3 =$ <b>001</b> (binary) <u> = 1</u> (decimal) ⇒ <b>B</b>, | ||
+ | :* $I_4 =$ <b>010</b> (binary) <u> = 2</u> (decimal) ⇒ <b>C</b>. | ||
− | |||
− | + | '''(2)''' If one takes into account that | |
+ | * for $i = 1$ two bits are used, | ||
+ | * for $2 ≤ i ≤ 5$ three bits are used, | ||
+ | * for $6 ≤ i ≤ 19$ four bits are used, | ||
+ | * for $14 ≤ i ≤ 29$ five bits are used, | ||
− | |||
− | + | we get from the "continuous decoder input string" | |
− | + | :$$\boldsymbol{\rm 00 000 001 010 100 0001 1000 0111 0001 0001 0011 1011 1011 01101 00011 01001}$$ | |
− | : | ||
− | + | to the "divided decoder input string" $($designated $I_1$, ... , $I_{16})$: | |
− | + | :$$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001 |1000 |0111 | 0001 |0001 |0011 |1011 |1011 |01101 |00011 |01001} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | :$$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001 |1000 |0111 | | ||
− | |||
\hspace{0.1cm}.$$ | \hspace{0.1cm}.$$ | ||
− | |||
− | :* <u>< | + | Thus the results sought for the decoding steps $i = 5$, ... , $i = 8$: |
+ | * $I_5 =$ <b>100</b> (binary) <u> = 4</u> (decimal) ⇒ <b>AA</b>, | ||
+ | * $I_6 =$ <b>0001</b> (binary) <u> = 1</u> (decimal) ⇒ <b>B</b>, | ||
+ | * $I_7 =$ <b>1000</b> (binary) <u> = 8</u> (decimal) ⇒ <b>AAB</b>, | ||
+ | * $I_8 =$ <b>0111</b> (binary) <u> = 7</u> (decimal) ⇒ <b>CA</b>. | ||
− | |||
− | + | '''(3)''' <u>Statement 2</u> is correct. | |
+ | *The following decoding results are obtained (in abbreviated form): | ||
− | : | + | :: $I_{9} = 1$ ⇒ <b>B</b>, $\hspace{0.5cm}I_{10} = 1$ ⇒ <b>B</b>,$\hspace{0.5cm}I_{11} = 3$ ⇒ <b>D</b>,$\hspace{0.5cm}I_{12} = 11$ ⇒ <b>CAB</b>, |
− | :<b>3 | + | :: $I_{13} = 11$ ⇒ <b>CAB</b>,$\hspace{0.5cm}I_{14} = 13$ ⇒ <b>BD</b>,$\hspace{0.5cm}I_{15} = 3$ ⇒ <b>D</b>,$\hspace{0.5cm}I_{16} = 9$ ⇒ <b>BA</b>. |
− | |||
− | |||
− | + | '''(4)''' Correct are the<u>statements 1 and 4</u>. Reason: | |
− | + | * The new incoming index is $18$ (decimal) and in the dictionary the entry "<b>DB</b>" is found under the index $I = 18$. | |
+ | * For the decoding step $i = 17$ , the line $I = 19$ is entered into the dictionary. | ||
+ | * The entry "<b>BAD</b>" is composed of the decoding result from step $i = 16$ $($<b>BA</b>$)$ and the first character $($<b>D</b>$)$ of the new result "<b>DB</b>". | ||
− | |||
− | |||
− | :* | + | '''(5)''' Only <u>statement 1</u> is correct here: |
+ | * Two bits are sufficient for the first phrase. | ||
+ | * For the phrases $I_2$, ... , $I_5$ three bits are needed. | ||
+ | * The phrases $I_6$, ... , $I_{13}$ require four bits. | ||
+ | * The phrases $I_{14}$, ... , $I_{29}$ require five bits. | ||
+ | * The phrases $I_{30}$, ... , $I_{58}$ require six bits. | ||
− | : | + | *This gives the total number of bits: |
+ | :$$N_{\rm bits} = 1 \cdot 2 + 4 \cdot 3 + 8 \cdot 4 + 16 \cdot 5 + 29 \cdot 6 = 300 \hspace{0.05cm}.$$ | ||
− | + | *With a uniform bit length (six bits for each index), $58 · 6 = 348$ bits (i.e. more) would be required ⇒ statement 2 is wrong in principle. | |
− | |||
− | + | Now to the third statement, which is also not true: | |
− | + | * With the uncoded transmission of $N = 150$ characters from the symbol set $\{$ <b>A</b>, <b>B</b>, <b>C</b>, <b>D</b> $\}$ one would need exactly $300$ bits. With LZW one certainly needs more bits if the source is redundancy-free ⇒ characters equally likely and statistically independent. | |
− | + | * With a redundant source with (e.g.) $H = 1.6$ bit/source symbol, one can limit the number of bits to $N_{\rm bits} = N \cdot H$, provided it is a very large file $(N \to \infty)$. | |
− | + | * With a rather small file – as here with only $N = 150$ bits – it is not possible to say whether $N_{\rm bits}$ will be less than, equal to or greater than $150 · 1.6 = 240$. | |
− | + | *Also quite possible is $N_{\rm bits} > 300$. In that case, it is better to do without the "Lempel–Ziv compression". | |
− | + | {{ML-Fuß}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | [[Category: | + | [[Category:Information Theory: Exercises|^2.2 Lempel-Ziv-Welch Compression^]] |
Latest revision as of 13:11, 10 August 2021
The compression algorithm "LZW" – named after the inventors Abraham Lempel, Jacob Ziv and Terry Welch – works like "LZ78" with a global dictionary. In this case, all possible characters – in the example A, B, C, D – are preassigned at the beginning and are successively expanded during encoding.
During decoding, exactly the same dictionary is created, only the same entry with index $I$ is made one step later than during encoding. To designate the decoding steps, we use the variable $i$.
Here are some more notes on LZW encoding and decoding:
- During encoding, at each time $i$, a search is made in the dictionary for the longest possible character string that matches the input string currently present.
- The dictionary index $I_i$ found is always transmitted in binary form. At the same time, $W(I_i) + Z$ is entered into the dictionary under the next free index.
- Here, $W(I_i)$ denotes a character or a character string, and $Z$ is the first character of the upcoming input string (i.e. also a character), which is no longer considered in $W(I_i)$ .
- With $M = 4$ possible characters, the first index $I_1$ is transmitted with two bits, the indices $I_2$, ... , $I_5$ with three bits, the next eight indices with four bits, then 16 indices with five bits, etc. The reason for this bit-saving measure can be found in the solution to this task.
After coding in the manner described here over $16$ coding steps, the following binary sequence of length $N_{\rm bit} = 61$:
- $$\boldsymbol{\rm 00 \ 000 \ 001 \ 010 \ 100 \ 0001 \ 1000 \ 0111 \ 0001 \ 0001 \ 0011 \ 1011 \ 1011 \ 01101 \ 00011 \ 01001}\hspace{0.05cm}.$$
The task of the decoder (more precisely: your task) is now,
- to reconstruct from this binary sequence the indices $I_1$, ... , $I_{16}$ taking into account the different number of bits (description of the 16 decoding results by decimal numbers),
- then read out the corresponding characters or character sequences from the dictionary according to the indices and finally – with one step delay – generate the new dictionary entry.
Hints:
- The exercise belongs to the chapter Compression according to Lempel, Ziv and Welch.
- In particular, reference is made to the pages
- Exercise 2.3 and Exercise 2.3Z deal with other Lempel-Ziv methods in a similar way.
Questions
Solution
- In contrast, three bits are already required for step $i = 2$. If the input sequence had started with AAA , the LZW coding $I_2 = I(i = 2)$ would have resulted in the decimal value $4$. This can no longer be represented with two bits and must be coded with three bits, just like $I_3$, $I_4$ and $I_5$.
- The given input string:$$\boldsymbol{\rm 00 000 001 010 100 0001}\hspace{0.05cm}...$$
- must therefore be divided by the decoder as follows:
- $$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001|}\hspace{0.05cm}...$$
- The decoder results of the first four steps are therefore:
- $I_1 =$ 00 (binary) = 0 (decimal) ⇒ A,
- $I_2 =$ 000 (binary) = 0 (decimal) ⇒ A,
- $I_3 =$ 001 (binary) = 1 (decimal) ⇒ B,
- $I_4 =$ 010 (binary) = 2 (decimal) ⇒ C.
(2) If one takes into account that
- for $i = 1$ two bits are used,
- for $2 ≤ i ≤ 5$ three bits are used,
- for $6 ≤ i ≤ 19$ four bits are used,
- for $14 ≤ i ≤ 29$ five bits are used,
we get from the "continuous decoder input string"
- $$\boldsymbol{\rm 00 000 001 010 100 0001 1000 0111 0001 0001 0011 1011 1011 01101 00011 01001}$$
to the "divided decoder input string" $($designated $I_1$, ... , $I_{16})$:
- $$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001 |1000 |0111 | 0001 |0001 |0011 |1011 |1011 |01101 |00011 |01001} \hspace{0.1cm}.$$
Thus the results sought for the decoding steps $i = 5$, ... , $i = 8$:
- $I_5 =$ 100 (binary) = 4 (decimal) ⇒ AA,
- $I_6 =$ 0001 (binary) = 1 (decimal) ⇒ B,
- $I_7 =$ 1000 (binary) = 8 (decimal) ⇒ AAB,
- $I_8 =$ 0111 (binary) = 7 (decimal) ⇒ CA.
(3) Statement 2 is correct.
- The following decoding results are obtained (in abbreviated form):
- $I_{9} = 1$ ⇒ B, $\hspace{0.5cm}I_{10} = 1$ ⇒ B,$\hspace{0.5cm}I_{11} = 3$ ⇒ D,$\hspace{0.5cm}I_{12} = 11$ ⇒ CAB,
- $I_{13} = 11$ ⇒ CAB,$\hspace{0.5cm}I_{14} = 13$ ⇒ BD,$\hspace{0.5cm}I_{15} = 3$ ⇒ D,$\hspace{0.5cm}I_{16} = 9$ ⇒ BA.
(4) Correct are thestatements 1 and 4. Reason:
- The new incoming index is $18$ (decimal) and in the dictionary the entry "DB" is found under the index $I = 18$.
- For the decoding step $i = 17$ , the line $I = 19$ is entered into the dictionary.
- The entry "BAD" is composed of the decoding result from step $i = 16$ $($BA$)$ and the first character $($D$)$ of the new result "DB".
(5) Only statement 1 is correct here:
- Two bits are sufficient for the first phrase.
- For the phrases $I_2$, ... , $I_5$ three bits are needed.
- The phrases $I_6$, ... , $I_{13}$ require four bits.
- The phrases $I_{14}$, ... , $I_{29}$ require five bits.
- The phrases $I_{30}$, ... , $I_{58}$ require six bits.
- This gives the total number of bits:
- $$N_{\rm bits} = 1 \cdot 2 + 4 \cdot 3 + 8 \cdot 4 + 16 \cdot 5 + 29 \cdot 6 = 300 \hspace{0.05cm}.$$
- With a uniform bit length (six bits for each index), $58 · 6 = 348$ bits (i.e. more) would be required ⇒ statement 2 is wrong in principle.
Now to the third statement, which is also not true:
- With the uncoded transmission of $N = 150$ characters from the symbol set $\{$ A, B, C, D $\}$ one would need exactly $300$ bits. With LZW one certainly needs more bits if the source is redundancy-free ⇒ characters equally likely and statistically independent.
- With a redundant source with (e.g.) $H = 1.6$ bit/source symbol, one can limit the number of bits to $N_{\rm bits} = N \cdot H$, provided it is a very large file $(N \to \infty)$.
- With a rather small file – as here with only $N = 150$ bits – it is not possible to say whether $N_{\rm bits}$ will be less than, equal to or greater than $150 · 1.6 = 240$.
- Also quite possible is $N_{\rm bits} > 300$. In that case, it is better to do without the "Lempel–Ziv compression".