Difference between revisions of "Aufgaben:Exercise 2.6: About the Huffman Coding"

Revision as of 22:39, 2 August 2021

Tree diagram of the
Huffman method

We consider here a source symbol sequence $\langle q_\nu \rangle$ with symbol range $M = 8$:

$$q_{\nu} = \{ \hspace{0.05cm}q_{\mu} \} = \{ \boldsymbol{\rm A} \hspace{0.05cm},\hspace{0.05cm} \boldsymbol{\rm B}\hspace{0.05cm},\hspace{0.05cm} \boldsymbol{\rm C}\hspace{0.05cm},\hspace{0.05cm} \boldsymbol{\rm D}\hspace{0.05cm},\hspace{0.05cm} \boldsymbol{\rm E}\hspace{0.05cm},\hspace{0.05cm} \boldsymbol{\rm F}\hspace{0.05cm},\hspace{0.05cm} \boldsymbol{\rm G}\hspace{0.05cm},\hspace{0.05cm} \boldsymbol{\rm H}\hspace{0.05cm} \}\hspace{0.05cm}.$$

If the symbols are equally probable, i.e. $p_{\rm A} = p_{\rm B} =$ ... $ = p_{\rm H} = 1/M$ then source coding makes no sense. Already with the dual code $\rm A$ → 000, $\rm B$ → 001, ... , $\rm H$ → 111, the mean codeword length $L_{\rm M}$ is its lower bound $H$ according to the source coding theorem $(H$ denotes the source entropy$ here)$:

$$L_{\rm M,\hspace{0.08cm}min} = H = 3 \hspace{0.15cm}{\rm bit/source symbol} \hspace{0.05cm}.$$

However, let the symbol probabilities in this task be given as follows:

$$p_{\rm A} \hspace{-0.05cm}= \hspace{-0.05cm} 0.04 \hspace{0.05cm},\hspace{0.1cm}p_{\rm B} \hspace{-0.05cm}= \hspace{-0.05cm} 0.08 \hspace{0.05cm},\hspace{0.1cm}p_{\rm C} \hspace{-0.05cm}= \hspace{-0.05cm} 0.14 \hspace{0.05cm},\hspace{0.1cm} p_{\rm D} \hspace{-0.05cm}= \hspace{-0.05cm} 0.25 \hspace{0.05cm},\hspace{0.1cm} p_{\rm E} \hspace{-0.05cm}= \hspace{-0.05cm} 0.24 \hspace{0.05cm},\hspace{0.1cm}p_{\rm F} \hspace{-0.05cm}= \hspace{-0.05cm} 0.12 \hspace{0.05cm},\hspace{0.1cm}p_{\rm G} \hspace{-0.05cm}= \hspace{-0.05cm} 0.10 \hspace{0.05cm},\hspace{0.1cm} p_{\rm H} \hspace{-0.05cm}= \hspace{-0.05cm} 0.03 \hspace{0.05cm}.$$

So here we have a redundant message source that can be compressed by Huffman coding.
This algorithm was published by David Albert Huffman – shortly after Shannon's groundbreaking work on information theory – and allows the construction of optimal prefix-free codes.

The algorithm is given here without derivation and proof, whereby we restrict ourselves to binary codes ⇒ the code symbol sequence consists only of zeros and ones:

(1) Order the symbols according to decreasing probabilities of occurrence.

(2) Combine the two least likely symbols into a new symbol.

(3) Repeat steps (1) and (2), until only two (combined) symbols remain.

(4) Binary code the more probable set of symbols with 1 , the other set with 0.

(5) Step by step (from bottom to top) add 1 or 0 to the split partial codes.

This algorithm is often illustrated by a tree diagram. The diagram above shows this for the case at hand.

You have the following tasks:

(a) Assign the symbols $\rm A$, ... , $\rm H$ to the inputs labelled [1], ... , [8] .

(b) Determination of the sum probabilities $U$, ... , $Z$ and $R$ (root).

(c) Assignment of the symbols $\rm A$, ... , $\rm H$ to the corresponding Huffman binary sequences. A red connection corresponds to a 1, a blue connection to a 0.

You will notice that the mean codeword length is

$$L_{\rm M} = \sum_{\mu = 1}^{M}\hspace{0.05cm} p_{\mu} \cdot L_{\mu} $$

for Huffman coding is only slightly larger than the source entropy $H$. IIn this equation, the following values apply to the present case:

$M = 8$, $p_1 = p_{\rm A}$, ... , $p_8 = p_{\rm H}$.
The respective number of bits of the code symbols for $\rm A$, ... , $\rm H$ is denoted by $L_1$, ... , $L_8$ .

Hints:

The exercise belongs to the chapter Entropy Coding according to Huffman.
In particular, reference is made to the pages
- The Huffman algorithm,
- Representation of the Huffman code as a tree diagram.
To check your results, please refer to the interaction module Shannon–Fano– and Huffman–coding.

Questions

Input number $ \ = \ $

⇒ symbol $\rm A$

Input number $ \ = \ $

⇒ symbol $\rm B$

Input number $ \ = \ $

⇒ symbol $\rm C$

Input number $ \ = \ $

⇒ symbol $\rm D$

Probability $ \ = \ $

at node $\rm U$

Probability $ \ = \ $

at node $\rm V$

Probability $ \ = \ $

at node $\rm W$

Probability $ \ = \ $

at node $\rm X$

Probability $ \ = \ $

at node $\rm Y$

Probability $ \ = \ $

at node $\rm Z$

Probability $ \ = \ $

at $\rm root$

Binary code $ \ = \ $

⇒ symbol $\rm A$

Binary code $ \ = \ $

⇒ symbol $\rm B$

Binary code $ \ = \ $

⇒ symbol $\rm C$

Binary code $ \ = \ $

⇒ symbol $\rm D$

$L_{\rm M} \ = \ $

$\ \rm bit/source symbol$

	$H = 2.71\ \rm bit/source symbol.$
	$H = 2.75\ \rm bit/source symbol.$
	$H = 3.00\ \rm bit/source symbol.$

Solution

(1) Before the Huffman algorithm, the symbols must be sorted according to their probabilities of occurrence.

Since the two least likely symbols are already combined in the first step, the probabilities of occurrence decrease from top to bottom
(from left to right in the diagram for this sample solution).
By comparison with the data sheet, one obtains:

Symbol $\rm A$: Eingang 7, Symbol $\rm B$: Eingang 6, Symbol $\rm C$: Eingang 3, Symbol $\rm D$: Input 1.

Tree diagram adapted to the exercise

P_ID2452__Inf_A_2_6a.png

(2) The node $\rm R$ is the tree (root). This is always assigned $\underline{R=1}$ , regardless of the occurrence probabilities.

The following applies to the other values:

Step 1: $p_{\rm U} = p_{\rm A} + p_{\rm H} = 0.04 + 0.03 \hspace{0.15cm}\underline{ =0.07}$,

Step 2: $p_{\rm V} = p_{\rm U} + p_{\rm B} = 0.07 + 0.08 \hspace{0.15cm}\underline{ =0.15}$,

Step 3: $p_{\rm W} = p_{\rm F} + p_{\rm G} = 0.12 + 0.10 \hspace{0.15cm}\underline{ =0.22}$,

Step 4: $p_{\rm X} = p_{\rm V} + p_{\rm C} = 0.15 + 0.14 \hspace{0.15cm}\underline{ =0.29}$,

Step 5: $p_{\rm Y} = p_{\rm W} + p_{\rm E} = 0.22 + 0.24 \hspace{0.15cm}\underline{ =0.46}$,

Schritt 6: $p_{\rm Z} = p_{\rm X} + p_{\rm D} = 0.29 + 0.25 \hspace{0.15cm}\underline{ =0.54}$.

(3) The Huffman code for symbol $\rm A$ is obtained by tracing the path from the $\rm root$ (yellow dot) to symbol $\rm A$ and assigning a 1 to each red connecting line and a 0 to each blue one..

Symbol $\rm A$: red–red–red–blue–red→ 11101,

Symbol $\rm B$: red–red–red–red → 1111,

Symbol $\rm C$: red–red–blue → 110,

Symbol $\rm D$: red–blue– → 10,

Symbol $\rm E$: blue–red → 01,

Symbol $\rm F$: blue–blue–red → 001,

Symbol $\rm G$: blue–blue–blue → 000,

Symbol $\rm H$: red–red–red–blue–blue → 11100.

(4) The coding under point (3) has shown that

the symbols $\rm D$ and $\rm E$ with two bits,

the symbols $\rm C$, $\rm F$ and $\rm G$ with three bits,

the symbols $\rm B$ with four bits, and

the symbols $\rm A$ and $\rm H$ are each represented with five bits

are represented. Thus, for the mean codeword length (in "bit/source symbol&dquo;):

$$L_{\rm M} \hspace{0.2cm} = \hspace{0.2cm} (p_{\rm D} + p_{\rm E}) \cdot 2 + (p_{\rm C} + p_{\rm F} + p_{\rm G}) \cdot 3 + p_{\rm B} \cdot 4 +(p_{\rm A} + p_{\rm H}) \cdot 5 = 0.49 \cdot 2 + 0.36 \cdot 3 +0.08 \cdot 4 +0.07 \cdot 5 \hspace{0.15cm}\underline{= 2.73}\hspace{0.05cm}.$$

(5) Only answer 1 is correct:

The mean codeword length $L_{\rm M}$ cannot be smaller than the source entropy $H$. This eliminates answers 2 and 3.
From the given probabilities of occurrence, the source entropy can actually be calculated to $H = 2.71$ bit/source symbol.
It can be seen that this Huffman coding almost reaches the given limitnbsp; (source coding theorem) $L_{\rm M, \ min } \ge H = 2.71$ bit/source symbol.

@@ Line 110: / Line 110: @@
 ===Solution===
 {{ML-Kopf}}
-'''(1)'''&nbsp; Vor dem Huffman&ndash;Algorithmus müssen die Symbole nach ihren Auftrittswahrscheinlichkeiten sortiert werden.
+'''(1)'''&nbsp; Before the Huffman algorithm, the symbols must be sorted according to their probabilities of occurrence.
-*Da die zwei unwahrscheinlichsten Symbole schon im ersten Schritt zusammengefasst werden, nehmen die Auftrittswahrscheinlichkeiten von oben nach unten ab <br>(in der Grafik zu dieser Musterlösung von links nach rechts).
+*Since the two least likely symbols are already combined in the first step, the probabilities of occurrence decrease from top to bottom <br>(from left to right in the diagram for this sample solution).
-*Durch Vergleich mit dem Angabenblatt erhält man:
+*By comparison with the data sheet, one obtains:
-:Symbol $\rm A$:&nbsp;  <u>Eingang 7</u>,  &nbsp; &nbsp; Symbol $\rm B$:&nbsp;  <u>Eingang 6</u>, &nbsp; &nbsp; Symbol $\rm C$:&nbsp;  <u>Eingang 3</u>, &nbsp; &nbsp; Symbol $\rm D$:&nbsp;  <u>Eingang 1</u>.
+:Symbol $\rm A$:&nbsp;  <u>Eingang 7</u>,  &nbsp; &nbsp; Symbol $\rm B$:&nbsp;  <u>Eingang 6</u>, &nbsp; &nbsp; Symbol $\rm C$:&nbsp;  <u>Eingang 3</u>, &nbsp; &nbsp; Symbol $\rm D$:&nbsp;  <u>Input 1</u>.
-[[File:EN_Inf_Z_2_6b.png|right|right|frame|An die Aufgabe angepasstes Baumdiagramm]]
+[[File:EN_Inf_Z_2_6b.png|right|right|frame|Tree diagram adapted to the exercise]]
 P_ID2452__Inf_A_2_6a.png
-'''(2)'''&nbsp; Der Knoten&nbsp; $\rm R$&nbsp; ist die Baumwurzel&nbsp; (<i>Root</i>).&nbsp; Dieser ist stets mit&nbsp; $\underline{R=1}$&nbsp; belegt, unabhängig von den Auftrittswahrscheinlichkeiten.
+'''(2)'''&nbsp; The node&nbsp; $\rm R$&nbsp; is the tree&nbsp; (<i>root</i>).&nbsp; This is always assigned&nbsp; $\underline{R=1}$&nbsp;, regardless of the occurrence probabilities.
-Für die weiteren Werte gilt:
+The following applies to the other values:
-:Schritt 1:&nbsp;&nbsp;&nbsp; $p_{\rm U} = p_{\rm A} + p_{\rm H} = 0.04 + 0.03 \hspace{0.15cm}\underline{ =0.07}$,
+:Step 1:&nbsp;&nbsp;&nbsp; $p_{\rm U} = p_{\rm A} + p_{\rm H} = 0.04 + 0.03 \hspace{0.15cm}\underline{ =0.07}$,
-:Schritt 2:&nbsp;&nbsp;&nbsp; $p_{\rm V} = p_{\rm U} + p_{\rm B} = 0.07 + 0.08 \hspace{0.15cm}\underline{ =0.15}$,
+:Step 2:&nbsp;&nbsp;&nbsp; $p_{\rm V} = p_{\rm U} + p_{\rm B} = 0.07 + 0.08 \hspace{0.15cm}\underline{ =0.15}$,
-:Schritt 3:&nbsp;&nbsp;&nbsp; $p_{\rm W} = p_{\rm F} + p_{\rm G} = 0.12 + 0.10 \hspace{0.15cm}\underline{ =0.22}$,
+:Step 3:&nbsp;&nbsp;&nbsp; $p_{\rm W} = p_{\rm F} + p_{\rm G} = 0.12 + 0.10 \hspace{0.15cm}\underline{ =0.22}$,
-:Schritt 4:&nbsp;&nbsp;&nbsp; $p_{\rm X} = p_{\rm V} + p_{\rm C} = 0.15 + 0.14  \hspace{0.15cm}\underline{ =0.29}$,
+:Step 4:&nbsp;&nbsp;&nbsp; $p_{\rm X} = p_{\rm V} + p_{\rm C} = 0.15 + 0.14  \hspace{0.15cm}\underline{ =0.29}$,
-:Schritt 5:&nbsp;&nbsp;&nbsp; $p_{\rm Y} = p_{\rm W} + p_{\rm E} = 0.22 + 0.24   \hspace{0.15cm}\underline{ =0.46}$,
+:Step 5:&nbsp;&nbsp;&nbsp; $p_{\rm Y} = p_{\rm W} + p_{\rm E} = 0.22 + 0.24   \hspace{0.15cm}\underline{ =0.46}$,
 :Schritt 6:&nbsp;&nbsp;&nbsp; $p_{\rm Z} = p_{\rm X} + p_{\rm D} = 0.29 + 0.25  \hspace{0.15cm}\underline{ =0.54}$.
 <br clear=all>
-'''(3)'''&nbsp; Den Huffman&ndash;Code für das Symbol&nbsp; $\rm A$&nbsp; erhält man, wenn man den Weg von der&nbsp; $\rm Root$&nbsp; (gelber Punkt) zum Symbol&nbsp; $\rm A$&nbsp; zurückverfolgt und jeder roten Verbindungslinie eine&nbsp; <b>1</b>&nbsp; zuordnet, jeder blauen eine&nbsp; <b>0</b>.
+'''(3)'''&nbsp; The Huffman code for symbol&nbsp; $\rm A$&nbsp; is obtained by tracing the path from the&nbsp; $\rm root$&nbsp; (yellow dot) to symbol &nbsp; $\rm A$&nbsp; and assigning a&nbsp; <b>1</b>&nbsp; to each red connecting line and a&nbsp; <b>0</b> to each blue one..
-* Symbol $\rm A$:&nbsp;&nbsp;&nbsp; rot&ndash;rot&ndash;rot&ndash;blau&ndash;rot &#8594; <b><u>11101</u></b>,
+* Symbol $\rm A$:&nbsp;&nbsp;&nbsp; red&ndash;red&ndash;red&ndash;blue&ndash;red&#8594; <b><u>11101</u></b>,
-* Symbol $\rm B$:&nbsp;&nbsp;&nbsp; rot&ndash;rot&ndash;rot&ndash;rot &#8594; <b><u>1111</u></b>,
+* Symbol $\rm B$:&nbsp;&nbsp;&nbsp; red&ndash;red&ndash;red&ndash;red &#8594; <b><u>1111</u></b>,
-* Symbol $\rm C$:&nbsp;&nbsp;&nbsp; rot&ndash;rot&ndash;blau &#8594; <b><u>110</u></b>,
+* Symbol $\rm C$:&nbsp;&nbsp;&nbsp; red&ndash;red&ndash;blue &#8594; <b><u>110</u></b>,
-* Symbol $\rm D$:&nbsp;&nbsp;&nbsp; rot&ndash;blau&ndash; &#8594; <b><u>10</u></b>,
+* Symbol $\rm D$:&nbsp;&nbsp;&nbsp; red&ndash;blue&ndash; &#8594; <b><u>10</u></b>,
-* Symbol $\rm E$:&nbsp;&nbsp;&nbsp; blau&ndash;rot &#8594; <b>01</b>,
+* Symbol $\rm E$:&nbsp;&nbsp;&nbsp; blue&ndash;red &#8594; <b>01</b>,
-* Symbol $\rm F$:&nbsp;&nbsp;&nbsp; blau&ndash;blau&ndash;rot &#8594; <b>001</b>,
+* Symbol $\rm F$:&nbsp;&nbsp;&nbsp; blue&ndash;blue&ndash;red &#8594; <b>001</b>,
-* Symbol $\rm G$:&nbsp;&nbsp;&nbsp; blau&ndash;blau&ndash;blau &#8594; <b>000</b>,
+* Symbol $\rm G$:&nbsp;&nbsp;&nbsp; blue&ndash;blue&ndash;blue &#8594; <b>000</b>,
-* Symbol $\rm H$:&nbsp;&nbsp;&nbsp; rot&ndash;rot&ndash;rot&ndash;blau&ndash;blau &#8594; <b>11100</b>.
+* Symbol $\rm H$:&nbsp;&nbsp;&nbsp; red&ndash;red&ndash;red&ndash;blue&ndash;blue &#8594; <b>11100</b>.
-'''(4)'''&nbsp; Die Codierung unter Punkt&nbsp; '''(3)'''&nbsp; hat ergeben, dass
+'''(4)'''&nbsp; The coding under point&nbsp; '''(3)'''&nbsp; has shown that
-* die Symbole&nbsp; $\rm D$&nbsp; und&nbsp; $\rm E$&nbsp; mit zwei Bit,
+* the symbols&nbsp; $\rm D$&nbsp; and&nbsp; $\rm E$&nbsp; with two bits,
-* die Symbole&nbsp; $\rm C$,&nbsp; $\rm F$&nbsp; und&nbsp; $\rm G$&nbsp; mit drei Bit,
+* the symbols&nbsp; $\rm C$,&nbsp; $\rm F$&nbsp; and&nbsp; $\rm G$&nbsp; with three bits,
-* das Symbol&nbsp; $\rm B$&nbsp;  mit vier Bit,&nbsp; und
+* the symbols&nbsp; $\rm B$&nbsp;  with four bits,&nbsp; and
-* die Symbole&nbsp; $\rm A$&nbsp; und&nbsp; $\rm H$&nbsp; jeweils mit fünf Bit
+* the symbols&nbsp; $\rm A$&nbsp; and&nbsp; $\rm H$&nbsp; are each represented with five bits
-dargestellt werden.&nbsp; Damit erhält man für die mittlere Codewortlänge (in "bit/Quellensymbol&dquo;):
+are represented.&nbsp; Thus, for the mean codeword length (in "bit/source symbol&dquo;):
 :$$L_{\rm M} \hspace{0.2cm} =  \hspace{0.2cm}  (p_{\rm D} + p_{\rm E}) \cdot 2 + (p_{\rm C} + p_{\rm F} + p_{\rm G}) \cdot 3  + p_{\rm B} \cdot 4 +(p_{\rm A} + p_{\rm H}) \cdot 5 =  0.49 \cdot 2 + 0.36 \cdot 3 +0.08 \cdot 4 +0.07 \cdot 5
 \hspace{0.15cm}\underline{= 2.73}\hspace{0.05cm}.$$
@@ Line 173: / Line 173: @@
-'''(5)'''&nbsp; Richtig ist allein die <u>Antwort 1</u>:
+'''(5)'''&nbsp; Only <u>answer 1</u> is correct:
-*Die mittlere Codewortlänge&nbsp; $L_{\rm M}$&nbsp; kann nicht kleiner sein als die Quellenentropie&nbsp; $H$.&nbsp; Damit scheiden die Antworten 2 und 3 aus.
+*The mean codeword length&nbsp; $L_{\rm M}$&nbsp; cannot be smaller than the source entropy&nbsp; $H$.&nbsp; This eliminates answers 2 and 3.
-*Aus den vorgegebenen Auftrittswahrscheinlichkeiten kann   die Quellenentropie tatsächlich  zu&nbsp; $H = 2.71$ bit/Quellensymbol berechnet werden.
+*From the given probabilities of occurrence, the source entropy can actually be calculated to&nbsp; $H = 2.71$ bit/source symbol.
-*Man erkennt, dass diese Huffman&ndash;Codierung die vorgegebene Grenze&nbsp; (Quellencodierungstheorem)&nbsp; $L_{\rm M, \ min } \ge H = 2.71$&nbsp; bit/Quellensymbol nahezu erreicht.
+*It can be seen that this Huffman coding almost reaches the given limitnbsp; (source coding theorem)&nbsp; $L_{\rm M, \ min } \ge H = 2.71$&nbsp; bit/source symbol.
 {{ML-Fuß}}