Exercise 2.4: About the LZW Algorithm

From LNTwww

LZW dictionary for Encoding/Decoding

The compression algorithm  "LZW"  – named after the inventors  Abraham LempelJacob Ziv  and  Terry Welch  – works like  "LZ78"  with a global dictionary.   In this case, all possible characters – in the example  ABCD  – are preassigned at the beginning and are successively expanded during encoding.

During decoding, exactly the same dictionary is created, only the same entry with index  $I$  is made one step later than during encoding.  To designate the decoding steps, we use the variable  $i$.

Here are some more notes on LZW encoding and decoding:

  • During encoding, at each time  $i$,  a search is made in the dictionary for the longest possible character string that matches the input string currently present.
  • The dictionary index  $I_i$  found is always transmitted in binary form.  At the same time,  $W(I_i) + Z$  is entered into the dictionary under the next free index.
  • Here,  $W(I_i)$  denotes a character or a character string, and  $Z$  is the first character of the upcoming input string (i.e. also a character), which is no longer considered in  $W(I_i)$ .
  • With  $M = 4$  possible characters, the first index  $I_1$  is transmitted with two bits, the indices  $I_2$, ... , $I_5$  with three bits, the next eight indices with four bits, then 16 indices with five bits, etc. The reason for this bit-saving measure can be found in the solution to this task.


After coding in the manner described here over  $16$  coding steps, the following binary sequence of length  $N_{\rm bit} = 61$:

$$\boldsymbol{\rm 00 \ 000 \ 001 \ 010 \ 100 \ 0001 \ 1000 \ 0111 \ 0001 \ 0001 \ 0011 \ 1011 \ 1011 \ 01101 \ 00011 \ 01001}\hspace{0.05cm}.$$

The task of the decoder  (more precisely:  your task)  is now,

  • to reconstruct from this binary sequence the indices  $I_1$, ... , $I_{16}$  taking into account the different number of bits  (description of the 16 decoding results by decimal numbers),
  • then read out the corresponding characters or character sequences from the dictionary according to the indices and finally – with one step delay – generate the new dictionary entry.





Hints:

LZ77 - the basic form of Lempel-Ziv algorithms,
Lempel-Ziv coding with variable index bit length,
Decoding of the LZW algorithm.


Questions

1

Which indices are available for the first four steps  $(i=1$, ... , $i=4)$  for decoding.  Enter all  $I_i = I(i)$  as decimal numbers.

$I_1 \ = \ $

$I_2 \ = \ $

$I_3 \ = \ $

$I_4 \ = \ $

2

Continue the subdivision of the decoder input string to the end.  Which indices result from the decoding steps  $i=5$, ... , $i=8$?

$I_5 \ = \ $

$I_6 \ = \ $

$I_7 \ = \ $

$I_8 \ = \ $

3

What is the output string after  $i = 16$  decoding steps?

AABCAABAABCAABCABBDBBDCDBA,
AABCAABAABCABBDCABCABBDDBA,
AABCAABAABCABDBBABCCDBAABDCDBA.

4

The next index is:   $I_{17} = 10010$ (binary) $= 18$ (decimal).  Which of the following statements are true?

The decoder output to step  $i = 17$  is  "DB".
The decoder output to step  $i = 17$  is  "BAD".
New dictionary entry  "DB"  in line  $I = 19$.
New dictionary entry  "BAD"  in line  $I = 19$.

5

Let the decoder input string consist of  $N_{\rm bit} = 300$  binary characters (bits).  Which of the following statements is then certainly true?

$58$  LZW phrases with variable bit length were transmitted.
With a uniform index bit length, fewer bits would be required.
A file with  $N =150$  quaternary characters was transmitted.


Solution

(1)  For the LZW coding of the first character  $(i = 1)$,  only the four indices  $I = 0$, ... , $I = 3$  are possible for which a dictionary entry has already been made at step  $i = 0$  (preassignment).  Therefore, it is sufficient here to transmit the index with two bits.

  • In contrast, three bits are already required for step  $i = 2$.  If the input sequence had started with  AAA , the LZW coding  $I_2 = I(i = 2)$  would have resulted in the decimal value  $4$.  This can no longer be represented with two bits and must be coded with three bits, just like  $I_3$,  $I_4$  and  $I_5$.
  • The given input string:$$\boldsymbol{\rm 00 000 001 010 100 0001}\hspace{0.05cm}...$$
must therefore be divided by the decoder as follows:
$$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001|}\hspace{0.05cm}...$$
  • The decoder results of the first four steps are therefore:
  • $I_1 =$ 00 (binary)      = 0 (decimal)   ⇒   A,
  • $I_2 =$ 000 (binary)    = 0 (decimal)   ⇒   A,
  • $I_3 =$ 001 (binary)    = 1 (decimal)   ⇒   B,
  • $I_4 =$ 010 (binary)    = 2 (decimal)   ⇒   C.


(2)  If one takes into account that

  • for  $i = 1$  two bits are used,
  • for  $2 ≤ i ≤ 5$  three bits are used,
  • for  $6 ≤ i ≤ 19$  four bits are used,
  • for  $14 ≤ i ≤ 29$  five bits are used,


we get from the  "continuous decoder input string"

$$\boldsymbol{\rm 00 000 001 010 100 0001 1000 0111 0001 0001 0011 1011 1011 01101 00011 01001}$$

to the  "divided decoder input string"  $($designated  $I_1$, ... ,  $I_{16})$:

$$\boldsymbol{\rm 00 |000 |001 |010 |100 |0001 |1000 |0111 | 0001 |0001 |0011 |1011 |1011 |01101 |00011 |01001} \hspace{0.1cm}.$$

Thus the results sought for the decoding steps  $i = 5$, ... ,  $i = 8$:

  • $I_5 =$ 100 (binary)      = 4 (decimal)   ⇒   AA,
  • $I_6 =$ 0001 (binary)    = 1 (decimal)   ⇒   B,
  • $I_7 =$ 1000 (binary)    = 8 (decimal)   ⇒   AAB,
  • $I_8 =$ 0111 (binary)    = 7 (decimal)   ⇒   CA.


(3) Statement 2 is correct.

  • The following decoding results are obtained (in abbreviated form):
  $I_{9} = 1$   ⇒   B, $\hspace{0.5cm}I_{10} = 1$   ⇒  B,$\hspace{0.5cm}I_{11} = 3$   ⇒   D,$\hspace{0.5cm}I_{12} = 11$   ⇒   CAB,
  $I_{13} = 11$   ⇒   CAB,$\hspace{0.5cm}I_{14} = 13$   ⇒   BD,$\hspace{0.5cm}I_{15} = 3$   ⇒   D,$\hspace{0.5cm}I_{16} = 9$   ⇒   BA.


(4)  Correct are thestatements 1 and 4.  Reason:

  • The new incoming index is  $18$  (decimal) and in the dictionary the entry  "DB"  is found under the index  $I = 18$.
  • For the decoding step  $i = 17$ , the line  $I = 19$  is entered into the dictionary.
  • The entry  "BAD"  is composed of the decoding result from step  $i = 16$  $($BA$)$  and the first character  $($D$)$  of the new result  "DB".


(5)  Only statement 1 is correct here:

  • Two bits are sufficient for the first phrase.
  • For the phrases  $I_2$, ... , $I_5$  three bits are needed.
  • The phrases  $I_6$, ... , $I_{13}$  require four bits.
  • The phrases  $I_{14}$, ... , $I_{29}$  require five bits.
  • The phrases  $I_{30}$, ... , $I_{58}$  require six bits.
  • This gives the total number of bits:
$$N_{\rm bits} = 1 \cdot 2 + 4 \cdot 3 + 8 \cdot 4 + 16 \cdot 5 + 29 \cdot 6 = 300 \hspace{0.05cm}.$$
  • With a uniform bit length (six bits for each index),  $58 · 6 = 348$  bits (i.e. more) would be required   ⇒   statement 2 is wrong in principle.


Now to the third statement, which is also not true:

  • With the uncoded transmission of  $N = 150$  characters from the symbol set  $\{$ ABCD $\}$  one would need exactly  $300$  bits.  With LZW one certainly needs more bits if the source is redundancy-free   ⇒   characters equally likely and statistically independent.
  • With a redundant source with (e.g.)  $H = 1.6$  bit/source symbol, one can limit the number of bits to  $N_{\rm bits} = N \cdot H$,  provided it is a very large file  $(N \to \infty)$.
  • With a rather small file – as here with only  $N = 150$  bits – it is not possible to say whether   $N_{\rm bits}$  will be less than, equal to or greater than  $150 · 1.6 = 240$.
  • Also quite possible is  $N_{\rm bits} > 300$.  In that case, it is better to do without the  "Lempel–Ziv compression".