Exercise 2.13Z: Combination of BWT and MTF: Difference between revisions

Revision as of 14:48, 9 August 2021

Scheme for Burrows-Wheeler data compression

We refer to the theory page Application Scenario for the Burrows-Wheeler Transform and consider the coding system sketched on the right, consisting of the blocks.

Burrows–Wheeler–Transformation $\rm (BWT)$ as described in Exercise 2.13; the two character sets at the input and output of the BWT are the same: $\{$ $\rm D$, $\rm E$, $\rm I$, $\rm M$, $\rm N$, $\rm S$ $\}$;
Move–to–Front $\rm (MTF)$, a sorting algorithm that outputs a string of the same length $($ $N = 12 in the example)$, but with a different alphabet $\{$0, 1, 2, 3, 4, 5$\}$ ;
$\rm RLC0$ – a run-length encoding specifically for zero, which is (as) frequent according to $\rm BWT$ and $\rm MTF$ ; all other indices are not changed by $\rm RLC0$ ;
$\rm Huffman$ as described in the chapter Entropy coding according to Huffman; frequent characters are represented by short binary sequences and rare ones by long ones.

The $\rm MTF$–algorithm can be described as follows for $M = 6$ input symbols:

The output sequence of the $\rm MTF$ is a string of indices from the set

$ I = \{$0, 1, 2, 3, 4, 5$\}$.

Before starting the actual $\rm MTF$ algorithm, the possible input symbols are sorted lexicographically and assigned to the following indices:

$\rm D$ → 0, $\rm E$ → 1, $\rm I$ → 2, $\rm M$ → 3, $\rm N$ → 4, $\rm S$ → 5.

Let the $\rm MTF$ input string here be $\rm N\hspace{0.05cm}M\hspace{0.05cm}S\hspace{0.05cm}D\hspace{0.05cm}E\hspace{0.05cm}E\hspace{0.05cm}E\hspace{0.05cm}N\hspace{0.05cm}I\hspace{0.05cm}I\hspace{0.05cm}I\hspace{0.05cm}N$. This was the $\rm BWT$ result in Exercise 2.13. The first $\rm N$ is represented as $I = 4$ according to the default setting.
Then the $\rm N$ is placed at the beginning in the sorting, so that after the coding step $i = 1$ the assignment holds:

$\rm N$ → 0, $\rm D$ → 1, $\rm E$ → 2, $\rm I$ → 3, $\rm M$ → 4, $\rm S$ → 5.

Continue in the same way until the entire input text has been processed. If a character is already at position 0, no reordering is necessary.

Hints:

The task belongs to the chapter Other source coding methods.
In particular, reference is made to the page Burrows–Wheeler Transformation.
Information on the Huffman code can be found in the chapter Entropy Coding according to Huffman. This information is not necessary for the solution of the task.

Questions

Musterlösung

Solution

(1) Die Grafik auf der Angabenseite zeigt, dass die Lösungsvorschläge 1 und 2 richtig sind und der Vorschlag 3 falsch ist:

$\rm E$ und $\rm I$ treten zwar gruppiert auf,
aber nicht die $\rm N$–Zeichen.

(2) Richtig sind die Lösungsvorschläge 2 und 3:

Die Eingangsfolge wird Zeichen für Zeichen abgearbeitet. Auch die Ausgangsfolge hat somit die Länge $N = 12$.
Tatsächlich wird die Eingangsmenge $\{ \hspace{0.05cm}\rm D,\hspace{0.05cm} E,\hspace{0.05cm} I,\hspace{0.05cm} M,\hspace{0.05cm} N , \hspace{0.05cm} S \}$ in die Ausgangsmenge $\{ \hspace{0.05cm}\rm 0,\hspace{0.05cm} 1,\hspace{0.05cm} 2,\hspace{0.05cm} 3,\hspace{0.05cm} 4 , \hspace{0.05cm} 5 \}$ gewandelt.
Allerdings nicht durch einfaches Mapping, sondern durch einen Algorithmus, der nachfolgend skizziert wird.

(3) Richtig ist der Lösungsvorschlag 2:

Die Tabelle zeigt den MTF–Algorithmus. Der Schritt $i=0$ (rote Hinterlegung) gibt die Vorbelegung an. Die Eingabe der MTF ist gelb hinterlegt, die Ausgabe grün.
Im Schritt $i=1$ wird das Eingangszeichen $\rm N$ entsprechend der Spalte $i=0$ durch den Index $I = 4$ dargestellt. Anschließend wird $\rm N$ nach vorne sortiert, während die Reihenfolge der anderen Zeichen gleich bleibt.
Das Eingangszeichen $\rm M$ im zweiten Schritt erhält entsprechend der Spalte $i=2$ ebenfalls den Index $I = 4$. In gleicher Weise macht man weiter bis zum zwölften Zeichen $\rm N$, dem der Index $I = 1$ zugeordnet wird.
Man erkennt aus obiger Tabelle weiter, dass zu den Zeitpunkten $i=6$, $i=7$, $i=10$ und $i=11$ der Ausgabeindex jeweils $I = 0$ ist.

(4) Richtig sind die Aussagen 1 und 2:

Die Vorverarbeitungen "BWT" und "MTF" haben nur die Aufgabe, möglichst viele Nullen zu generieren.

(5) Alle Aussagen sind richtig.

Nähere Angaben zum Huffman–Algorithmus finden Sie im Kapitel "Entropiecodierung nach Huffman".

@@ Line 2: / Line 2: @@
 }}
-[[File:EN_Inf_Z_2_14b.png|right|frame|Schema für die Burrows–Wheeler–Datenkomprimierung]]
+[[File:EN_Inf_Z_2_14b.png|right|frame|Scheme for Burrows-Wheeler data compression]]
-Wir beziehen uns auf die Theorieseite&nbsp; [[Information_Theory/Weitere_Quellencodierverfahren#Anwendungsszenario_f.C3.BCr_die_Burrows.E2.80.93Wheeler.E2.80.93Transformation|Anwendungsszenario für die Burrows-Wheeler-Transformation]]&nbsp; und betrachten das rechts skizzierte Codiersystem, bestehend aus den Blöcken.
+We refer to the theory page&nbsp; [[Information_Theory/Weitere_Quellencodierverfahren#Application_scenario_for_the_Burrows-Wheeler_transformation|Application Scenario for the Burrows-Wheeler Transform]]&nbsp; and consider the coding system sketched on the right, consisting of the blocks.
-* <i>Burrows&ndash;Wheeler&ndash;Transformation</i>&nbsp; $\rm (BWT)$&nbsp; gemäß der Beschreibung in&nbsp; [[Aufgaben:2.13_Burrows-Wheeler-Rücktransformation|Aufgabe 2.13]]; die beiden Zeichenmengen am Eingang und Ausgang des BWT sind gleich: &nbsp; $\{$ $\rm D$,&nbsp; $\rm E$,&nbsp; $\rm I$,&nbsp; $\rm M$,&nbsp; $\rm N$,&nbsp; $\rm S$ $\}$;
+* <i>Burrows&ndash;Wheeler&ndash;Transformation</i>&nbsp; $\rm (BWT)$&nbsp; as described in&nbsp; [[Aufgaben:2.13_Burrows-Wheeler-Rücktransformation|Exercise 2.13]]; the two character sets at the input and output of the BWT are the same: &nbsp; $\{$ $\rm D$,&nbsp; $\rm E$,&nbsp; $\rm I$,&nbsp; $\rm M$,&nbsp; $\rm N$,&nbsp; $\rm S$ $\}$;
-* <i>Move&ndash;to&ndash;Front</i>&nbsp; $\rm (MTF)$, ein Sortieralgorithmus, der eine gleich lange Zeichenfolge&nbsp; $($im Beispiel&nbsp; $N = 12)$, aber mit anderem Alphabet&nbsp; $\{$<b>0</b>,&nbsp; <b>1</b>,&nbsp; <b>2</b>,&nbsp; <b>3</b>,&nbsp; <b>4</b>,&nbsp; <b>5</b>$\}$&nbsp; ausgibt;
+* <i>Move&ndash;to&ndash;Front</i>&nbsp; $\rm (MTF)$, a sorting algorithm that outputs a string of the same length&nbsp; $($&nbsp; $N = 12 in the example)$, but with a different alphabet&nbsp; $\{$<b>0</b>,&nbsp; <b>1</b>,&nbsp; <b>2</b>,&nbsp; <b>3</b>,&nbsp; <b>4</b>,&nbsp; <b>5</b>$\}$&nbsp;;
-* $\rm RLC0$&nbsp; &ndash; eine Lauflängencodierung speziell für die nach&nbsp; $\rm BWT$&nbsp; und&nbsp; $\rm MTF$&nbsp; (möglichst) häufige Null;&nbsp; alle anderen Indizes werden durch&nbsp; $\rm RLC0$&nbsp; nicht verändert;
+* $\rm RLC0$&nbsp; &ndash; a run-length encoding specifically for zero, which is (as) frequent according to&nbsp; $\rm BWT$&nbsp; and&nbsp; $\rm MTF$&nbsp; ;&nbsp; all other indices are not changed by&nbsp; $\rm RLC0$&nbsp;;
-* $\rm Huffman$&nbsp; gemäß der Beschreibung im Kapitel&nbsp; [[Information_Theory/Entropiecodierung_nach_Huffman|Entropiecodierung nach Huffman]]; häufige Zeichen werden durch kurze Binärfolgen dargestellt und seltene durch lange.
+* $\rm Huffman$&nbsp; as described in the chapter&nbsp; [[Information_Theory/Entropiecodierung_nach_Huffman|Entropy coding according to Huffman]]; frequent characters are represented by short binary sequences and rare ones by long ones.
-Der&nbsp; $\rm MTF$&ndash;Algorithmus lässt sich bei&nbsp; $M = 6$&nbsp; Eingangssymbolen  wie folgt beschreiben:
+The&nbsp; $\rm MTF$&ndash;algorithm can be described as follows for&nbsp; $M = 6$&nbsp; input symbols:
-* Die Ausgangsfolge des&nbsp; $\rm MTF$&nbsp; ist eine Aneinanderreihung von Indizes aus der Menge
+* The output sequence of the&nbsp; $\rm MTF$&nbsp; is a string of indices from the set
 :&nbsp; &nbsp;  &nbsp; &nbsp;$ I = \{$<b>0</b>,&nbsp; <b>1</b>,&nbsp; <b>2</b>,&nbsp; <b>3</b>,&nbsp; <b>4</b>,&nbsp; <b>5</b>$\}$.
-* Vor Beginn des eigentlichen&nbsp; $\rm MTF$&ndash;Algorithmus werden die möglichen Eingangssymbole lexikografisch sortiert und den folgenden Indizes zugeordnet:
+* Before starting the actual&nbsp; $\rm MTF$ algorithm, the possible input symbols are sorted lexicographically and assigned to the following indices:
 : &nbsp; &nbsp;  &nbsp; &nbsp;   $\rm D$ &nbsp; &#8594; &nbsp; <b>0</b>, &nbsp; &nbsp; $\rm E$ &nbsp; &#8594; &nbsp; <b>1</b>, &nbsp; &nbsp;  $\rm I$ &nbsp; &#8594; &nbsp; <b>2</b>, &nbsp; &nbsp; $\rm M$ &nbsp; &#8594; &nbsp; <b>3</b>, &nbsp; &nbsp;  $\rm N$ &nbsp; &#8594; &nbsp; <b>4</b>,&nbsp; &nbsp;  $\rm S$ &nbsp; &#8594; &nbsp; <b>5</b>.
-* Der&nbsp; $\rm MTF$&ndash;Eingabestring sei hier&nbsp; $\rm N\hspace{0.05cm}M\hspace{0.05cm}S\hspace{0.05cm}D\hspace{0.05cm}E\hspace{0.05cm}E\hspace{0.05cm}E\hspace{0.05cm}N\hspace{0.05cm}I\hspace{0.05cm}I\hspace{0.05cm}I\hspace{0.05cm}N$.&nbsp; Dies war das&nbsp; $\rm BWT$&ndash;Ergebnis in der&nbsp; [[Aufgaben:2.13_Burrows-Wheeler-Rücktransformation|Aufgabe 2.13]].&nbsp; Das erste&nbsp; $\rm N$&nbsp; wird gemäß Voreinstellung mit&nbsp; $I = 4$&nbsp; dargestellt.
+* Let the&nbsp; $\rm MTF$ input string here be&nbsp; $\rm N\hspace{0.05cm}M\hspace{0.05cm}S\hspace{0.05cm}D\hspace{0.05cm}E\hspace{0.05cm}E\hspace{0.05cm}E\hspace{0.05cm}N\hspace{0.05cm}I\hspace{0.05cm}I\hspace{0.05cm}I\hspace{0.05cm}N$.&nbsp; This was the&nbsp; $\rm BWT$ result in&nbsp; [[Aufgaben:2.13_Burrows-Wheeler-Rücktransformation|Exercise 2.13]].&nbsp; The first&nbsp; $\rm N$&nbsp; is represented as&nbsp; $I = 4$&nbsp; according to the default setting.
-* Anschließend wird das&nbsp; $\rm N$&nbsp; in der Sortierung an den Anfang gestellt, so dass nach dem Codierschritt&nbsp; $i = 1$&nbsp; die Zuordnung gilt:
+* Then the&nbsp; $\rm N$&nbsp; is placed at the beginning in the sorting, so that after the coding step&nbsp; $i = 1$&nbsp; the assignment holds:
 : &nbsp; &nbsp;  &nbsp; &nbsp;   $\rm N$ &nbsp; &#8594; &nbsp; <b>0</b>, &nbsp; &nbsp; $\rm D$ &nbsp; &#8594; &nbsp; <b>1</b>, &nbsp; &nbsp;  $\rm E$ &nbsp; &#8594; &nbsp;  <b>2</b>, &nbsp; &nbsp; $\rm I$ &nbsp; &#8594; &nbsp; <b>3</b>, &nbsp; &nbsp;  $\rm M$ &nbsp; &#8594; &nbsp; <b>4</b>,&nbsp; &nbsp;  $\rm S$ &nbsp; &#8594; &nbsp; <b>5</b>.
-* In gleicher Weise fährt man fort, bis der gesamte Eingangstext abgearbeitet ist.&nbsp; Steht ein Zeichen bereits an Position&nbsp; <b>0</b>, so ist keine Neusortierung erforderlich.
+* Continue in the same way until the entire input text has been processed.&nbsp; If a character is already at position&nbsp; <b>0</b>, no reordering is necessary.
+Hints:
+*The task belongs to the chapter&nbsp;  [[Information_Theory/Weitere_Quellencodierverfahren|Other source coding methods]].
+*In particular, reference is made to the page&nbsp;  [[Information_Theory/Weitere_Quellencodierverfahren#Burrows.E2.80.93Wheeler_transformation|Burrows&ndash;Wheeler Transformation]].
+*Information on the Huffman code can be found in the chapter&nbsp;  [[Information_Theory/Entropiecodierung_nach_Huffman|Entropy Coding according to Huffman]].&nbsp;  This information is not necessary for the solution of the task.
-''Hinweise:''
+===Questions===
-*Die Aufgabe gehört zum  Kapitel&nbsp;  [[Information_Theory/Weitere_Quellencodierverfahren|Weitere Quellencodierverfahren]].
-*Insbesondere wird  Bezug genommen auf die Seite&nbsp;  [[Information_Theory/Weitere_Quellencodierverfahren#Burrows.E2.80.93Wheeler.E2.80.93Transformation|Burrows&ndash;Wheeler&ndash;Transformation]].
-*Informationen zum Huffman&ndash;Code finden Sie im Kapitel&nbsp;  [[Information_Theory/Entropiecodierung_nach_Huffman|Entropiecodierung nach Huffman]].&nbsp;  Für die Lösung der Aufgabe sind diese Informationen nicht erforderlich.
-===Fragebogen===
 <quiz display=simple>

	Die Eingangszeichenmenge ist $\{ \hspace{0.05cm}\rm D,\hspace{0.05cm} E,\hspace{0.05cm} I,\hspace{0.05cm} M,\hspace{0.05cm} N , \hspace{0.05cm} S \}$.
	Die Ausgangszeichenmenge ist $\{ \hspace{0.05cm}\rm D,\hspace{0.05cm} E,\hspace{0.05cm} I,\hspace{0.05cm} M,\hspace{0.05cm} N , \hspace{0.05cm} S \}$.
	In der Ausgangsfolge treten alle $M = 6$ Zeichen gruppiert auf.

	Die Ausgangszeichenmenge ist $\{ \hspace{0.05cm}\rm D,\hspace{0.05cm} E,\hspace{0.05cm} I,\hspace{0.05cm} M,\hspace{0.05cm} N , \hspace{0.05cm} S \}$.
	Die Ausgangszeichenmenge ist $\{ \hspace{0.05cm}\rm 0,\hspace{0.05cm} 1,\hspace{0.05cm} 2,\hspace{0.05cm} 3,\hspace{0.05cm} 4 , \hspace{0.05cm} 5 \}$.
	Die MTF–Ausgangsfolge hat die Länge $N = 12$.

	Der Eingangswert $0$ erfährt eine Sonderbehandlung.
	Je häufiger eine $0$ auftritt, um so effektiver ist dieser Block.
	Am besten wäre $\rm Pr(0) ≈ Pr(1) ≈ \text{...} ≈ Pr(5)$.

	Die Ausgangsfolge ist binär.
	Er bewirkt eine möglichst kleine mittlere Codewortlänge.
	Die Dimensionierung richtet sich nach den anderen Blöcken.

	$\rm 230000100405$,
	$\rm 445340045001$,
	$\rm 543120345123$.