Exercise 2.13: Inverse Burrows-Wheeler Transformation

From LNTwww

BWT result to be analyzed

The  "Burrows–Wheeler Transformation" – abbreviated  BWT  – causes a blockwise sorting of the characters of a text with the aim of preparing the text for efficient data compression with the help of run-length coding or entropy coding.

  • First, an  N×N matrix is generated from a block of length  N,  with each row of this first matrix resulting from the preceding row by cyclic left shift.
  • Then the matrix is sorted lexicographically  (without special characters:   alphabetically) .  The result of the BWT is the last row of the new BWT matrix, the so-called  L column  (from "Last").
  • Further, this task refers to the  F column  (from "First", first row of the BWT matrix), which is needed for the inverse Burrows–Wheeler Transformation   ⇒   reconstruction of the original text from the L column.
  • For the inverse BWT, the so-called  "primary index"  I  is also required.  This indicates the row of the BWT matrix in which the algorithm must be started.


The graphic shows the result of a BWT, more precisely its L column.  The original text is to be reconstructed from this according to the description in the theory section  Burrows–Wheeler Transformation,

  • in subtask  (2)  with the primary index  I=7,
  • in subtask  (3)    I=0  is to be assumed.




Hints:


Questions

1

What is the  F column associated with the given  L column ?

SEINMEINDEIN,
NIIINEEEDSMN,
DEEEIIIMNNNS.

2

What is the result of the reconstruction with primary index  I=7_?

MEINDEINSEIN,
DEINSEINMEIN,
NIESNIEDNIEM.

3

What happens if the reconstruction  (inverse BWT transformation)  starts from the wrong primary index  I=0 ?

MEINDEINSEIN,
DEINSEINMEIN,
NIESNIEDNIEM.

4

Why is the Burrows–Wheeler transformation better suited than the original with regard to a later data compression?

It results in more favourable character frequencies.
All characters are sorted lexicographically.
Identical characters follow each other more often in the BWT.


Solution

(1)  The solution suggestion 3 is correct:

  • The first column of the BWT matrix is also called the  "F column"  and the last column the  "L column"  (from  "First"  or  "Last").
  • Only the  "L column" is passed on to the next coding level.
  • The  "F column", which is also needed for the reverse transformation, results from the  "L column"  by lexicographically sorting.


Inverse BWT with  I=7  (left) or  I=0  (right)

(2)  The solution suggestion 1 is correct:   MEINDEINSEIN, as can be seen from the left-hand representation of the following diagram.
Note that the top line represents the line number  I=0  in each case. For explanation:

  • Start the decoding with the line  I=7  of the  "F column". 
    The content is  M.
  • Search for the corresponding  M  in the  "L column"  and find it in line number  "1".
  • From line 1 of the  "L column"  one goes horizontally to the  "F column"  and finds the symbol  E.
  • Similarly, one finds the third output symbol  I  in line 4 of the  "F column".
  • The decoding algorithm ends with the output symbol  N  in the third last row.


(3)  Correct is the proposed solution 2:   DEINSEINMEIN, as shown in the graph on the right.

(4)  Correct is the suggested solution 3:

  • In BWT, four characters here are equal to their predecessors, in the original none.
  • In the  "F column",  even more characters would be the same as their respective predecessors  (6 in total)  due to the lexicographical sorting, but this sorting cannot be reversed without loss.
  • Solution suggestion 1 is wrong too:  
    The original and BWT contain exactly the same characters  (three times  E,  three times  I,  three times  N and one each of  DM  and  S).