Exercise 2.4: About the LZW Algorithm
The compression algorithm "LZW " – named after the inventors Abraham Lempel, Jacob Ziv and Terry Welch – works like "LZ78 " with a global dictionary. In this case, all possible characters – in the example A, B, C, D – are preassigned at the beginning and are successively expanded during encoding.
During decoding, exactly the same dictionary is created, only the same entry with index $I$ is made one step later than during coding. To designate the decoding steps, we use the run variable $i$ here.
Here are some more notes on LZW encoding and decoding:
- During encoding, at each time $i$ , a search is made in the dictionary for the longest possible character string that matches the input string currently present.
- The dictionary index $I_i$ found is always transmitted in binary form. At the same time, $W(I_i) + Z$ is entered into the dictionary under the next free index.
- Here, $W(I_i)$ denotes a character or a character string, and $Z$ is the first character of the upcoming input string (i.e. also a character), which is no longer considered in $W(I_i)$ .
- With $M = 4$ possible characters, the first index $I_1$ is transmitted with two bits, the indices $I_2$, ... , $I_5$ with three bits, the next eight indices with four bits, then 16 indices with five bits, etc. The reason for this bit-saving measure can be found in the sample solution to this task.
After coding in the manner described here over $16$ coding steps, the following binary sequence of length $N_{\rm Bit} = 61$:
- $$\boldsymbol{\rm 00 \ 000 \ 001 \ 010 \ 100 \ 0001 \ 1000 \ 0111 \ 0001 \ 0001 \ 0011 \ 1011 \ 1011 \ 01101 \ 00011 \ 01001}\hspace{0.05cm}.$$
The task of the decoder (more precisely: your task) is now,
- to reconstruct from this binary sequence the indices $I_1$, ... , $I_{16}$ taking into account the different number of bits (description of the 16 decoding results by decimal numbers),
- then read out the corresponding characters or character sequences from the dictionary according to the indices and finally – with one step delay – generate the new dictionary entry.
Hints:
- The task belongs to the chapter Compression according to Lempel, Ziv and Welch.
- Insbesondere wird Bezug genommen auf die Seiten
Questions
Solution
- In contrast, three bits are already required for step $i = 2$ . If the input sequence had started with AAA , the LZW coding $I_2 = I(i = 2)$ would have resulted in the decimal value $4$ . This can no longer be represented with two bits and must be coded with three bits, just like $I_3$, $I_4$ and $I_5$.
- The given input string:$$\boldsymbol{\rm 00 000 001 010 100 0001}\hspace{0.05cm}...$$
- must therefore be divided by the decoder as follows:
- $$\boldsymbol{\rm 00 | 000 |001 |010 |100 |0001|}\hspace{0.05cm}...$$
- The decoder results of the first four steps are therefore:
- $I_1 =$ 00 (binary) = 0 (decimal) ⇒ A,
- $I_2 =$ 000 (binary) = 0 (decimal) ⇒ A,
- $I_3 =$ 001 (binary) = 1 (decimal) ⇒ B,
- $I_4 =$ 010 (binary) = 2 (decimal) ⇒ C.
(2) If one takes into account that
- for $i = 1$ two bits are used,
- for $2 ≤ i ≤ 5$ three bits,
- for $6 ≤ i ≤ 19$ four bits,
- for $14 ≤ i ≤ 29$ five bits,
we get from the "continuous decoder input string"
- $$\boldsymbol{\rm 00 000 001 010 100 0001 1000 0111 0001 0001 0011 1011 1011 01101 00011 01001}$$
to the "divided decoder input string" $($designated $I_1$, ... , $I_{16})$:
- $$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001 |1000 |0111 | 0001 |0001 |0011 |1011 |1011 |01101 |00011 |01001} \hspace{0.1cm}.$$
Damit lauten die gesuchten Ergebnisse für die Decodierschritte $i = 5$, ... , $i = 8$:
- $I_5 =$ 100 (binary) = 4 (decimal) ⇒ AA,
- $I_6 =$ 0001 (binary) = 1 (decimal) ⇒ B,
- $I_7 =$ 1000 (binary) = 8 (decimal) ⇒ AAB,
- $I_8 =$ 0111 (binary) = 7 (decimal) ⇒ CA.
(3) Statement 2 is correct.
- The following decoding results are obtained (in abbreviated form):
- $I_{9} = 1$ ⇒ B, $\hspace{0.5cm}I_{10} = 1$ ⇒ B,$\hspace{0.5cm}I_{11} = 3$ ⇒ D,$\hspace{0.5cm}I_{12} = 11$ ⇒ CAB,
- $I_{13} = 11$ ⇒ CAB,$\hspace{0.5cm}I_{14} = 13$ ⇒ BD,$\hspace{0.5cm}I_{15} = 3$ ⇒ D,$\hspace{0.5cm}I_{16} = 9$ ⇒ BA.
(4) Correct are statements 1 and 4. Reason:
- The new incoming index is $18$ (decimal) and in the dictionary the entry DB is found under the index $I = 18$ .
- For the decoding step $i = 17$ , the line $I = 19$ is entered into the dictionary.
- The entry BAD is composed of the decoding result from step $i = 16$ $($BA$)$ and the first character $($D$)$ of the new result DB.
(5) Only statement 1 is correct here:
- Two bits are sufficient for the first phrase.
- For the phrases $I_2$, ... , $I_5$ three bits are needed.
- For the phrasesn $I_6$, ... , $I_{13}$ require four bits.
- For the phrases $I_{14}$, ... , $I_{29}$ require five bits.
- For the phrases $I_{30}$, ... , $I_{58}$ require six bits.
- This gives the total number of bits:
- $$N_{\rm Bit} = 1 \cdot 2 + 4 \cdot 3 + 8 \cdot 4 + 16 \cdot 5 + 29 \cdot 6 = 300 \hspace{0.05cm}.$$
- With a uniform bit length (six bits for each index), $58 · 6 = 348$ bits (i.e. more) would be required ⇒ statement 2 is wrong in principle.
Now to the third statement, which is also not true:
- With the uncoded transmission of $N = 150$ characters from the symbol set $\{$ A, \ B, \ C, \ D $\}$ one would need exactly $300$ bits. With LZW one certainly needs more bits if the source is redundancy-free characters equally likely and statistically independent).
- With redundant source with (for example) $H = 1.6$ bit/source symbol, one can limit the number of bits to $N_{\rm Bit} = N \cdot H$ , provided it is a very large file $(N \to \infty)$.
- With a rather small file – as here only with $N = 150$ bit – it is not possible to say whether the number of bits $N_{\rm Bit}$ will be less than, equal to or greater than $150 · 1.6 = 240$ .
- $N_{\rm Bit} > 300$ is also quite possible. In that case, it is better to do without the "Lempel–Ziv–compression".