Difference between revisions of "Information Theory/Some Preliminary Remarks on Two-Dimensional Random Variables"

Latest revision as of 16:58, 16 February 2023

1 # OVERVIEW OF THE THIRD MAIN CHAPTER #
2 Introductory example on the statistical dependence of random variables
3 Prerequisites and nomenclature
4 Probability mass function and probability density function
5 Probability mass function and entropy
6 Informational divergence - Kullback-Leibler distance
7 Joint probability and joint entropy
8 Exercises for the chapter
9 References

# OVERVIEW OF THE THIRD MAIN CHAPTER #

The focus of this third main chapter is the »mutual information« $I(X; Y)$ between two random variables $X$ and $Y$. With statistical dependence, $I(X; Y)$ is smaller than the individual entropies $H(X)$ or $H(Y)$.

For example, the uncertainty regarding the random variable $X$ ⇒ entropy $H(X)$ is reduced by the knowledge of $Y$, by the magnitude $H(X\hspace{0.03cm}|\hspace{0.03cm}Y)$ ⇒ conditional entropy of $X$, if $Y$ is known. The remaining residue is the mutual information

$$I(X; Y)= H(X) - H(X\hspace{0.03cm}|\hspace{0.03cm}Y).$$

At the same time, however:

$$I(X; Y) = H(Y) - H(Y\hspace{0.03cm}|\hspace{0.03cm}X).$$

The semicolon indicates that the two random variables $X$ and $Y$ under consideration are on an equal footing.

In detail, the third main chapter deals with

the relationship between probability and entropy for »2D random variables«,
the calculation of the »informational divergence«, also known as the »Kullback–Leibler distance«,
the definition of the »joint entropy« $H(XY)$ and the »conditional entropies« $H(X\hspace{0.03cm}|\hspace{0.03cm}Y)$ and $H(Y\hspace{0.03cm}|\hspace{0.03cm}X)$,
the »mutual information« $I(X; Y)$ between two random variables,
the »information theory of digital signal transmission« and the corresponding model,
the definition and meaning of the »channel capacity« and its connection with the mutual information,
the capacity calculation for »digital memoryless channels« such as BSC, BEC and BSEC,
the »Channel Coding Theorem«, one of the highlights of Shannon's information theory.

Introductory example on the statistical dependence of random variables

Result protocol of our random experiment "Rolling with two dice"

$\text{Example 1:}$ We start from the experiment "Rolling with two dice", where both dice are distinguishable by colour. The table shows the results of the first $N = 18$ pairs of throws of this exemplary random experiment.
According to the nomenclature explained in the "following section" $R_ν$, $B_ν$ and $S_ν$ are here to be understood as random variables:

For example, the random variable $R_3 \in \{1, \ 2, \ 3, \ 4, \ 5, \ 6\}$ indicates the number of points of the red cube on the third throw as a probability event. The specification $R_3 = 6$ states that in the documented realization the red cube showed a "6" in the third throw.

In line 2, the results of the red cube $(R)$ are indicated. The mean value of this limited sequence $〈R_1$, ... , $R_{18}〉$ is with $3.39$ smaller than the expected value ${\rm E}\big[R\big] = 3.5$.
Line 3 shows the results of the blue cube $(B)$. The sequence $〈B_1$, ... , $B_{18}〉$ has a slightly larger mean value of $3.61$ than the unlimited sequence ⇒ expected value ${\rm E}\big[B\big] = 3.5$.
Line 4 contains the sum $S_ν = R_ν + B_ν$. The mean value of the sequence $〈S_1$, ... , $S_{18}〉$ is $3.39 + 3.61 = 7$. This is here (only by chance) equal to the expected value $\text{E}\big[S\big] = \text{E}\big[R\big] + \text{E}\big[B\big]$.

Now the question arises between which random variables there are statistical dependencies:

If one assumes fair dice, there are no statistical dependencies between the sequences $〈 R\hspace{0.05cm} 〉$ and $〈B \hspace{0.05cm}〉$ – whether bounded or unbounded: Even if one knows $R_ν$ for $B_ν$ all possible results $(1$, ... , $6)$ are equally probable.
If one knows $S_ν$, however, statements about $R_ν$ as well as about $B_ν$ are possible. From $S_{11} = 12$ follows directly $R_{11} = B_{11} = 6$ and the sum $S_{15} = 2$ of two dice is only possible with $R_{15} = B_{15} = 1$. Such dependencies are called »deterministic«.
From $S_7 = 10$, at least ranges for $R_7$ and $B_7$ can be given: $R_7 ≥ 4, \ B_7 ≥ 4$. Only three pairs are possible: $(R_7 = 4) ∩ (B_7 = 6)$, $(R_7 = 5) ∩ (B_7 = 5)$, $(R_7 = 6) ∩ (B_7 = 4)$. Here there is no deterministic relationship between the variables $S_ν$ and $R_ν$ $($or $B_ν)$, but rather a so-called »statistical dependence«.
Such statistical dependencies exist for $S_ν ∈ \{3, \ 4, \ 5, \ 6, \ 8, \ 9, \ 10, \ 11\}$. On the other hand, if the sum $S_ν = 7$, one cannot infer $R_ν$ and $B_ν$ from this. For both dice, all possible numbers $1$, ... , $6$ are equally probable. In this case, there are also no statistical dependencies between $S_ν$ and $R_ν$ or between $S_ν$ and $B_ν$.

Prerequisites and nomenclature

Throughout this chapter, we consider discrete random variables of the form $X = \{ x_1, \ x_2, \hspace{0.05cm}$ ... $\hspace{0.05cm},\ x_{\mu},\hspace{0.05cm}$ ... $\hspace{0.05cm},\ x_M \} \hspace{0.05cm},$ and use the following nomenclature:

The random variable itself is always denoted by a capital letter. The lower case letter $x$ indicates a possible realization of the random variable $X$.
All realizations $x_μ$ $($with $μ = 1$, ... , $M)$ are real-valued. $M$ indicates the "symbol set size" or "alphabet size" of $X$. Instead of $M$, we sometimes also use $|X|$.

Relationship between the probability space ${\it \Omega}$
and the random variable $X$

The random variable $X$ can, for example, be created by the transformation ${\it \Omega} → X$ , where ${\it \Omega}$ stands for the probability space of a random experiment.

The diagram illustrates such a transformation:

$${\it \Omega} = \{ \omega_1, \omega_2, \omega_3, ... \hspace{0.15cm} \} \hspace{0.25cm} \longmapsto \hspace{0.25cm} X = \{ x_1, \ x_2, \ x_3, \ x_4\} \subset \cal{R}\hspace{0.05cm}.$$

Each random event $ω_i ∈ Ω$ is uniquely assigned to a real numerical value $x_μ ∈ X ⊂ \cal{R}$.
In the example considered, the running variable is $1 ≤ μ ≤ 4$, i.e. the symbol set size is $M = |X| = 4$.
However, the figure is not one-to-one: The realization $x_3 ∈ X$ could have resulted from the elementary event $ω_4$ in the example, but also from $ω_6$ $($or from some other of the infinitely many elementary events $ω_i$ not drawn in the diagram).

$\text{Agreement:}$ Often one refrains from indexing both the elementary events $ω_i$ and the realizations $x_μ$. This results in the following shorthand notations, for example:

$$ \{ X = x \} \hspace{0.05cm} \equiv \hspace{0.05cm} \{ \omega \in {\it \Omega} : \hspace{0.4cm} X(\omega) = x \} \hspace{0.05cm},$$

$$ \{ X \le x \} \hspace{0.05cm} \equiv \hspace{0.05cm} \{ \omega \in {\it \Omega} : \hspace{0.4cm} X(\omega) \le x \} \hspace{0.05cm}.$$

With this agreement, the probabilities of the discrete random variable $X$ are:

$${\rm Pr}( X = x_{\mu}) = \hspace{-0.2cm} \sum_{\omega \hspace{0.1cm} \in \{ X = x_{\mu} \} } \hspace{-0.2cm}{\rm Pr} \left ( \{ \omega \} \right ) \hspace{0.05cm}.$$

Probability mass function and probability density function

$\text{Definition:}$ If the $M$ probabilities of a discrete random variable $X$ ⇒ ${\rm Pr}( X = x_{\mu})$ are combined in a similar way to a vector,
we arrive at the »probability mass function« $\rm (PMF)$:

$$P_X(X) = \big [ \hspace{0.02cm} P_X(x_1), P_X(x_2), \hspace{0.05cm}\text{...} \hspace{0.15cm}, P_X(x_{\mu}),\hspace{0.05cm} \text{...}\hspace{0.15cm}, P_X(x_M) \hspace{0.02cm} \big ] \hspace{0.05cm}.$$

The $μ$–th element of this "vector" indicates the probability $P_X(x_{\mu}) = {\rm Pr}( X = x_{\mu}) $.

In the book "Theory of Stochastic Signals", we defined a similar descriptive quantity with the $\text{probability density function}$ $(\rm PDF)$ and designated it as $f_X(x)$.

It should be noted, however:

The PDF is more suitable for characterizing continuous random variables, such as a $\text{Gaussian distribution}$ or a $\text{uniform distribution}$. Only through the use of $\text{Dirac delta functions}$ does the PDF also become applicable for discrete random variables.
The PMF provides less information about the random variable than the PDF and can also only be specified for discrete variables. However, for the discrete information theory considered in this chapter, the PMF is sufficient.

$\text{Example 2:}$ We consider a probability density function $\rm (PDF)$ without much practical relevance:

$$f_X(x) = 0.2 \cdot \delta(x+2) + 0.3 \cdot \delta(x - 1.5)+0.5 \cdot \delta(x - {\rm \pi}) \hspace{0.05cm}. $$

Thus, for the discrete random variable $x ∈ X = \{–2,\ +1.5,\ +\pi \} $ ⇒ symbol set size $M = \vert X \vert = 3$, the probability function $\rm (PMF)$ is:

$$P_X(X) = \big [ \hspace{0.1cm}0.2\hspace{0.05cm}, 0.3\hspace{0.05cm}, 0.5 \hspace{0.1cm} \big] \hspace{0.05cm}. $$

It can be seen:

The $\rm PMF$ only provides information about the probabilities $\text{Pr}(x_1)$, $\text{Pr}(x_2)$ and $\text{Pr}(x_3)$.
From the $\rm PDF$, on the other hand, the possible realizations $x_1$, $x_2$ and $x_3$ of the random variable $X$ can also be read.
The only requirement for the random variable is that it is real-valued.
The possible values $x_μ$ do not have to be positive, integer, equidistant or rational.

Probability mass function and entropy

In discrete information theory in contrast to transmission problems, knowledge of the probability mass function $P_X(X)$ is sufficient, e.g. to calculate the $\text{entropy}$.

$\text{Definition:}$ The $\rm entropy$ of a discrete random variable $X$ – i.e. its uncertainty for an observer - can be represented with the PMF $P_X(X)$ as follows:

$$H(X) = {\rm E} \big [ {\rm log} \hspace{0.1cm} \frac{1}{P_X(X)}\big ] \hspace{0.05cm}=\hspace{0.05cm} - {\rm E} \big [ {\rm log} \hspace{0.1cm} {P_X(X)}\big ] \hspace{0.05cm}=\hspace{0.05cm} \sum_{\mu = 1}^{M} P_X(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{1}{P_X(x_{\mu})} \hspace{0.05cm}=\hspace{0.05cm} - \sum_{\mu = 1}^{M} P_X(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} {P_X(x_{\mu})} \hspace{0.05cm}.$$

If you use the logarithm to base $2$, i.e. $\log_2$ (...) ⇒ "binary logarithm", the numerical value is provided with the pseudo-unit "bit". $\rm E\big[$...$\big]$ indicates the expected value.

For example, one obtains

with $P_X(X) = \big [\hspace{0.02cm}0.2, \ 0.3, \ 0.5 \hspace{0.02cm}\big ]$:

$$H(X) = 0.2 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{0.2} + 0.3 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{0.3} +0.5 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{0.5} \approx 1.485\hspace{0.15cm}{\rm bit},$$

with $P_X(X) = \big [\hspace{0.02cm}1/3, \ 1/3, \ 1/3\hspace{0.02cm}\big ]$:

$$H(X) = 3 \cdot 1/3 \cdot {\rm log}_2 \hspace{0.1cm} (3) = {\rm log}_2 \hspace{0.1cm} (3) \approx 1.585\hspace{0.15cm}{\rm bit}.$$

The second example provides the maximum of the entropy function for the symbol set size $M = 3$.

$\text{Derivation:}$ For general $M$, this result can be derived e.g. as follows – see [Meck]^[1]:

$$H(X) = -{\rm E} \big [ {\rm log} \hspace{0.1cm} {P_X(X)}\big ] \hspace{0.2cm} \le \hspace{0.2cm}- {\rm log} \big [ {\rm E} \hspace{0.1cm} \left [{P_X(X)}\right ] \big ] \hspace{0.05cm}.$$

This estimation $($»Jensens's inequality«$)$ is admissible because the logarithm is a concave function. According to "Exercise 3.2" , the following holds:

$$- {\rm E} \big [ {P_X(X)}\big ] \hspace{0.1cm} \le \hspace{0.1cm} M \hspace{0.3cm} \Rightarrow \hspace{0.3cm} H(X) \le {\rm log} \hspace{0.1cm} (M) \hspace{0.05cm}.$$

The equal sign results according to the calculation above for equal probabilities, i.e. for $P_X(x_μ) = {1}/{M}$ for all $μ$. In "Exercise 3.3", the same situation is to be proved using the estimate "${\rm ln} \hspace{0.1cm} (x) \le x-1$". The equal sign applies here only for $x = 1$.

If one of the $M$ probabilities $P_X(x_μ)$ of the PMF is equal to zero, a tighter bound can be given for the entropy:

$$H(X) \le {\rm log} \hspace{0.1cm} (M-1) \hspace{0.05cm}.$$

$\text{Agreement:}$ In the following example and on the next sections we use the following »nomenclature«:

The entropy $H(X)$ always refers to the actual probability mass function $P_X(X)$ of the discrete random variable. Experimentally, these quantities are obtained only after $N → ∞$ trials.
If the PMF is determined from a finite random sequence, we denote this probability mass function by $Q_X(X)$ and add „$N =$ ...” to the resulting entropy.
This entropy approximation is not based on probabilities, but only on the $\text{relative frequencies}$. Only for $N → ∞$ does this approximation agree with $H(X)$ .

Probability mass functions of our dice experiment

$\text{Example 3:}$ We return to our "dice experiment".

The table shows the probability mass functions $P_R(R)$ and $P_B(B)$ for the red and blue dice as well as the approximations $Q_R(R)$ and $Q_B(B)$, in each case based on the random experiment with $N = 18$ throws.
The relative frequencies $Q_R(R)$ and $Q_B(B)$ result from the $\text{exemplary random sequences}$ of $\text{Example 1}$.

The following applies to the random variable $R$ with the "binary logarithm" $($to base $2)$:

$$H(R) = H(R) \big \vert_{N \hspace{0.05cm}\rightarrow \hspace{0.05cm}\infty} = \sum_{\mu = 1}^{6} 1/6 \cdot {\rm log}_2 \hspace{0.1cm} (6) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\hspace{0.1cm} {\rm bit} ,$$

$$H(R) \big \vert_{N \hspace{0.05cm} = \hspace{0.05cm}18} = 2 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{2} \hspace{0.1cm} +\hspace{0.1cm} 2 \cdot \frac{3}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{3} \hspace{0.1cm} +\hspace{0.1cm} 2 \cdot \frac{4}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{4} \hspace{0.1cm}= 2.530\hspace{0.1cm} {\rm bit}.$$

The blue cube of course has the same entropy: $H(B) = H(R) = 2.585\ \rm bit$. Here we get a slightly larger value for the approximation based on $N = 18$ , since according to the table above $Q_B(B)$ deviates less from the discrete uniform distribution $P_B(B)$ than als $Q_R(R)$ from $P_R(R)$.

$$H(B) \big \vert_{N \hspace{0.05cm} = \hspace{0.05cm}18} = 1 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{2} \hspace{0.1cm} +\hspace{0.1cm} 4 \cdot \frac{3}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{3} \hspace{0.1cm} +\hspace{0.1cm} 1 \cdot \frac{4}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{4} \hspace{0.1cm}= 2.558\hspace{0.1cm} {\rm bit} .$$

It can be seen from the given numerical values that despite the experimental parameter $N$, which is here much too small, the deviation with regard to entropy is not very large.

It should be mentioned again that with finite $N$ the following always applies:

$$ H(R) \big \vert_{N } < H(R) = {\rm log}_2 \hspace{0.1cm} (6) \hspace{0.05cm}, \hspace{0.5cm} H(B) \big \vert_{N } < H(B) = {\rm log}_2 \hspace{0.1cm} (6)\hspace{0.05cm}.$$

Informational divergence - Kullback-Leibler distance

We consider two probability mass functions $P_X(·)$ and $P_Y(·)$ over the same alphabet $X = \{ x_1, \ x_2$, ... , $x_M \}$, and now define the following quantity:

$\text{Definition:}$ The »informational divergence« between the random variables defined by $P_X(·)$ and $P_Y(·)$ is given as follows:

$$D(P_X \hspace{0.05cm} \vert \vert \hspace{0.05cm}P_Y) = {\rm E} \left [ {\rm log} \hspace{0.1cm} \frac{P_X(X)}{P_Y(X)}\right ] \hspace{0.2cm}=\hspace{0.2cm} \sum_{\mu = 1}^{M} P_X(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{P_X(x_{\mu})}{P_Y(x_{\mu})} \hspace{0.05cm}.$$

$D(P_X \vert \vert P_Y)$ is also called »Kullback–Leibler distance« .

This provides a measure of the "similarity" between the two probability functions $P_X(·)$ and $P_Y(·)$.

When using the logarithm to base $2$ the pseudo-unit "bit" must be added again.

Similarly, a second variant of the Kullback-Leibler distance can be given:

$$D(P_Y \hspace{0.05cm} || \hspace{0.05cm}P_X) = {\rm E} \left [ {\rm log} \hspace{0.1cm} \frac{P_Y(X)}{P_X(X)}\right ] \hspace{0.2cm}=\hspace{0.2cm} \sum_{\mu = 1}^{M} P_Y(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{P_Y(x_{\mu})}{P_X(x_{\mu})} \hspace{0.05cm}.$$

Compared to the first variant, each function $P_X(·)$ is now replaced by $P_Y(·)$ and vice versa. Since in general $D(P_X || P_Y)$ and $D(P_Y || P_X)$ differ, the term "distance" is actually misleading. However, we want to leave it at this naming.

If we evaluate the two equations above, we recognize the following properties:

If the same distribution is present ⇒ $P_Y(·) ≡ P_X(·)$, then $D(P_X || P_Y) = 0$. In all other cases $D(P_X || P_Y) > 0$. The same applies to the variant $D(P_Y || P_X)$.
If $P_X(x_μ) ≠ 0$ and $P_Y(x_μ) = 0$ $($a single and arbitrary $μ$ is sufficient for this$)$, the Kullback-Leibler distance $D(P_X || P_Y)$ has an infinitely large value. In this case,
$D(P_Y || P_X)$ is not necessarily infinite either.
This statement makes it clear once again that in general $D(P_X || P_Y)$ will be unequal to $D(P_Y || P_X)$ .

Subsequently, these two definitions are clarified with our standard example "dice experiment". At the same time we refer to the following exercises:

Probability mass functions of our dice experiment

$\text{Example 4:}$ For the dice experiment, we defined in $\text{Example 3}$ the probability mass functions $P_R(·)$ and $P_B(·)$ and their approximations $Q_R(·)$ and $Q_B(·)$ .

The random variable $R$ with the probability mass function $P_R(·)$ indicates the numbers of the red cube and $B$ the numbers of the blue cube ⇒ PMF $P_B(·)$.
The approximations $Q_R(·)$ and $Q_B(·)$ result from the former experiment with $N = 18$ double throws ⇒ $\text{Example 1}$ .

Then holds:

Since $P_R(·)$ and $P_B(·)$ are identical, we obtain zero for each of the Kullback-Leibler distances $D(P_R \vert \vert P_B)$ and $D(P_B \vert \vert P_R)$.
The comparison of $P_R(·)$ and $Q_R(·)$ yields for the first variant of the Kullback-Leibler distance:

$$\begin{align*}D(P_R \hspace{0.05cm} \vert \vert \hspace{0.05cm} Q_R) & = {\rm E} \left [ {\rm log}_2 \hspace{0.1cm} \frac{P_R(\cdot)}{Q_R(\cdot)}\right ] \hspace{0.1cm} = \sum_{\mu = 1}^{6} P_R(r_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{P_R(r_{\mu})}{Q_R(r_{\mu})} = \\ & = {1}/{6} \cdot \left [ 2 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1/6}{2/18} \hspace{0.1cm} + 2 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1/6}{3/18} \hspace{0.1cm} + 2 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1/6}{4/18} \hspace{0.1cm} \right ] = 1/6 \cdot \big [ 2 \cdot 0.585 + 2 \cdot 0 - 2 \cdot 0.415 \big ] \approx 0.0570\hspace{0.15cm} {\rm bit} .\end{align*}$$

In the calculation of the expected value, the fact that $P_R(r_1) = $ ... $ = P_R(r_6)$, the factor 1/6 can be excluded. Since the logarithm to base $ 2$ was used here, the pseudo-unit "bit” is added.

For the second variant of the Kullback-Leibler distance, a slightly different value results:

$$\begin{align*}D(Q_R \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_R) & = {\rm E} \left [ {\rm log}_2 \hspace{0.1cm} \frac{Q_R(\cdot)}{P_R(\cdot)}\right ] \hspace{0.1cm} = \sum_{\mu = 1}^{6} Q_R(r_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{Q_R(r_{\mu})}{P_R(r_{\mu})} \hspace{0.05cm} = \\ & = 2 \cdot \frac{2}{18} \cdot {\rm log}_2 \hspace{0.1cm} \frac{2/18}{1/6} \hspace{0.1cm} + 2 \cdot \frac{3}{18} \cdot {\rm log}_2 \hspace{0.1cm} \frac{3/18}{1/6} \hspace{0.1cm} + 2 \cdot \frac{4}{18} \cdot {\rm log}_2 \hspace{0.1cm} \frac{4/18}{1/6} \approx 0.0544\hspace{0.15cm} {\rm bit} .\end{align*}$$

For the blue cube, one obtains $D(P_B \vert \vert Q_B) ≈ 0.0283 \hspace{0.15cm} \rm bit$ and $D(Q_B \vert \vert P_B) ≈ 0.0271 \hspace{0.15cm} \rm bit$, i.e. slightly smaller Kullback-Leibler distances, since the approximation $Q_B(·)$ of $P_B(·)$ differs less than $Q_R(·)$ of $P_R(·)$.
Comparing the frequencies $Q_R(·)$ and $Q_B(·)$, we get $D(Q_R \vert \vert Q_B) ≈ 0.0597 \hspace{0.15cm} \rm bit$ and $D(Q_B \vert \vert Q_R) ≈ 0.0608 \hspace{0.15cm} \rm bit$. Here the distances are greatest, since the differences between $Q_B(·)$ and $Q_R(·)$ are greater than between $Q_R(·)$ and $P_R(·)$ or between $Q_B(·)$ and $P_B(·)$.

Joint probability and joint entropy

For the remainder of this third chapter, we always consider two discrete random variable $X = \{ x_1, \ x_2$, ... , $x_M \}$ and $Y = \{ y_1, \ y_2$, ... , $y_K \}$, whose value ranges do not necessarily have to coincide. This means: $K ≠ M$ $($in other notation: $|Y| ≠ |X|)$ is quite permissible.

The probability mass function thus has a $K×M$ matrix form with the elements

$$P_{XY}(X = x_{\mu}\hspace{0.05cm}, \ Y = y_{\kappa}) = {\rm Pr} \big [( X = x_{\mu})\hspace{0.05cm}\cap \hspace{0.05cm} (Y = y_{\kappa}) \big ] \hspace{0.05cm}.$$

We use $P_{XY}(X, Y)$ as a shorthand notation. The new random variable $XY$ contains both the properties of $X$ and those of $Y$.

$\text{Definition:}$ The »joint entropy« can be represented with the two-dimensional probability mass function $P_{XY}(X, Y)$ as an expected value as follows:

$$H(XY) = {\rm E} \left [ {\rm log} \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] = \sum_{\mu = 1}^{M} \hspace{0.1cm} \sum_{\kappa = 1}^{K} \hspace{0.1cm} P_{XY}(x_{\mu}\hspace{0.05cm}, y_{\kappa}) \cdot {\rm log} \hspace{0.1cm} \frac{1}{P_{XY}(x_{\mu}\hspace{0.05cm}, y_{\kappa})} \hspace{0.05cm}.$$

In the following we use throughout the logarithm to the base $2$ ⇒ $\log(x) → \log_2(x)$. The numerical value is thus to be assigned the pseudo-unit "bit".

In general, the following »upper bound« can be given for the joint entropy:

$$H(XY) \le H(X) + H(Y) \hspace{0.05cm}.$$

This inequality expresses the following fact:

The equal sign only applies to the special case of statistically independent random variables, as demonstrated in the following $\text{Example 5}$ using the random variables $R$ and $B$ . Here $R$ and $B$ denote the numbers of the red and blue dice, respectively:

$$H(RB) = H(R) + H(B).$$

If, on the other hand, there are statistical dependencies as in $\text{Example 6}$ between the random variables $R$ and $S = R + B$, the "<" sign applies in the above inequality:

$$H(RS) < H(R) + H(S).$$

These examples also show to what extent the joint entropies $H(RB)$ and $H(RS)$ change if one does not determine an infinite number of pairs of throws in the dice experiment, but only $N = 18$.

Two-dimensional probability mass function $P_{RB}$ and approximation $Q_{RB}$

$\text{Example 5:}$ We return to the experiment $\text{Rolling with two dice}$ :

The random variables are the points of the

red cube: ⇒ $R = \{1, \ 2,\ 3,\ 4,\ 5,\ 6\}$,
blue cube: ⇒ $B = \{1,\ 2,\ 3,\ 4,\ 5,\ 6\}$.

The left graph shows the probabilities

$$P_{RB}(r_\mu,\ b_\kappa ) ={\rm Pr}\big [(R=r_\mu) \hspace{0.05cm}\cap \hspace{0.05cm} (B=b_\kappa)\big],$$

which for all $μ = 1$, ... , $6$ and for all $κ = 1$, ... , $6$ equally yield the value $1/36$. Thus, one obtains for the joint entropy:

$$H(RB) = H(RB) \big \vert_{N \hspace{0.05cm}\rightarrow \hspace{0.05cm}\infty} = {\rm log}_2 \hspace{0.1cm} (36) = 5.170\hspace{0.05cm} {\rm bit} .$$

One can see from the left graph and the equation given here:

Since $R$ and $B$ are statistically independent of each other, the following applies.

$$P_{RB}(R, B) = P_R(R) · P_B(B).$$

The joint entropy is the sum of the two individual entropies:

$$H(RB) = H(R) + H(B).$$

The right graph shows the approximated two-dimensional probability mass function $Q_{RB}(·)$, based on the only $N = 18$ throws of our experiment. Here, no quadratic form of the joint probability $Q_{RB}(·)$ results, and the joint entropy derived from it is significantly smaller than $H(RB)$:

$$H(RB) \big \vert_{N \hspace{0.05cm} = \hspace{0.05cm}18} = 16 \cdot \frac{1}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{1} \hspace{0.1cm} +\hspace{0.1cm} 1 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm} \frac{18}{2} \hspace{0.1cm}= 4.059\hspace{0.15cm} {\rm bit} .$$

Two-dimensional probability mass function $P_{RS}$ and approximation $Q_{RS}$

$\text{Example 6:}$ In the dice experiment, in addition to the random variables $R$ (red cube) and $B$ (blue cube) also the sum $S = R + B$ is considered. The graph on the left shows that the two-dimensional probability mass function $P_{RS}(·)$ cannot be written as a product of $P_R(·)$ and $P_S(·)$.

With the probability functions

$$P_R(R) = 1/6 \cdot \big [ 1,\ 1,\ 1,\ 1,\ 1,\ 1 \big ],$$

$$P_S(S)=1/36 \cdot \big [ 1,\ 2,\ 3,\ 4,\ 5,\ 6,\ 5,\ 4,\ 3,\ 2,\ 1 \big ] $$

one obtains for the entropies:

$$H(RS) = {\rm log}_2 \hspace{0.1cm} (36) \approx 5.170\hspace{0.15cm} {\rm bit} ,$$

$$H(R) = {\rm log}_2 \hspace{0.1cm} (6) \approx 2.585\hspace{0.15cm} {\rm bit},$$

$$H(S) = 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm}\frac{1}{36} \hspace{-0.05cm}\cdot \hspace{-0.05cm} {\rm log}_2 \hspace{0.05cm} \frac{36}{1} \hspace{0.05cm} + 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{2}{36} \hspace{-0.05cm}\cdot \hspace{-0.05cm} {\rm log}_2 \hspace{0.05cm} \frac{36}{2} \hspace{0.05cm} + 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{3}{36} \cdot {\rm log}_2 \hspace{0.05cm} \frac{36}{3} \hspace{0.05cm} + $$

$$+ 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{4}{36} \cdot {\rm log}_2 \hspace{0.05cm} \frac{36}{4} \hspace{0.05cm} +2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{5}{36} \cdot {\rm log}_2 \hspace{0.05cm} \frac{36}{5} + 1 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{6}{36} \cdot {\rm log}_2 \hspace{0.05cm} \frac{36}{6} $$

$$\Rightarrow \hspace{0.3cm} H(S) \approx 3.274\hspace{0.15cm} {\rm bit} . $$

From these numerical values one can see:

The comparison with the $\text{Example 5}$ shows that $H(RS) =H(RB)$. The reason for this is that, knowing $R$ the random variables $B$ and $S$ give exactly the same information.

Due to the statistical dependence between the red cube and the sum, $H(RS) ≈ 5.170 \hspace{0.15cm} \rm bit$ is smaller than the sum $H(R) + H(S) ≈ 5.877 \hspace{0.15cm} \rm bit.$

Shown on the right is the case where the two-dimensional probability mass function $Q_{RS}(·)$ was determined empirically $(N = 18)$. Although a completely different figure emerges due to the very small $N$ value, the approximation for $H(RS)$ provides exactly the same value as the approximation for $H(RB)$ in $\text{Example 5}$:

$$H(RS) \big \vert_{N \hspace{0.05cm} = \hspace{0.05cm}18} = H(RB) \big \vert_{N \hspace{0.05cm} = \hspace{0.05cm}18} = 4.059\,{\rm bit} \hspace{0.05cm}.$$

Exercises for the chapter

Exercise 3.1: Probabilities when Rolling Dice

Exercise 3.1Z: Drawing Cards

Exercise 3.2: Expected Value Calculations

Exercise 3.2Z: Two-dimensional Probability Mass Function

Exercise 3.3: Entropy of Ternary Quantities

Exercise 3.4: Entropy for Different PMF

Exercise 3.5: Kullback-Leibler Distance and Binomial Distribution

Exercise 3.5Z: Kullback-Leibler Distance again

Exercise 3.6: Partitioning Inequality

References

↑ Mecking, M.: Information Theory. Lecture manuscript, Chair of Communications Engineering, Technische Universität München, 2009.

[1] Mecking, M.: Information Theory. Lecture manuscript, Chair of Communications Engineering, Technische Universität München, 2009.

[1]

@@ Line 1: / Line 1: @@
 {{Header
-|Untermenü=Information zwischen zwei wertdiskreten Zufallsgrößen
+|Untermenü=Mutual Information Between Two Discrete Random Variables
 |Vorherige Seite=Weitere Quellencodierverfahren
 |Nächste Seite=Verschiedene Entropien zweidimensionaler Zufallsgrößen
 }}
-== # ÜBERBLICK ZUM DRITTEN HAUPTKAPITEL # ==
+== # OVERVIEW OF THE THIRD MAIN CHAPTER # ==
+<br>
+The focus of this third main chapter is the&nbsp; &raquo;'''mutual information'''&laquo;&nbsp; $I(X; Y)$&nbsp; between two random variables&nbsp; $X$&nbsp; and $Y$.&nbsp; With statistical dependence,&nbsp; $I(X; Y)$&nbsp;  is smaller than the individual entropies&nbsp; $H(X)$&nbsp; or&nbsp; $H(Y)$.&nbsp;
-Im Mittelpunkt dieses dritten Kapitels steht die ''Transinformation'' $I(X; Y)$ zwischen zwei Zufallsgrößen $X$ und $Y$, wofür auch andere Begriffe wie ''Mutual Information'' oder ''gegenseitige Entropie'' üblich sind. Bei statistischer Abhängigkeit ist $I(X; Y)$ kleiner als die Einzelentropien $H(X)$ bzw. $H(Y)$. Beispielsweise wird die Unsicherheit hinsichtlich der Zufallsgröße $X$  ⇒  Entropie $H(X)$ durch die Kenntnis von $Y$ vermindert, und zwar um den Betrag $H(X\hspace{0.03cm}|\hspace{0.03cm}Y)$  ⇒  bedingte Entropie von $X$, falls $Y$ bekannt ist. Der verbleibende Rest ist die Transinformation $I(X; Y)$. Gleichzeitig gilt aber auch $I(X; Y) = H(Y) - H(Y\hspace{0.03cm}|\hspace{0.03cm}X)$. Das Semikolon weist auf die Gleichberechtigung der beiden betrachteten Zufallsgrößen $X$ und $Y$ hin.
+For example, the uncertainty regarding the random variable&nbsp; $X$&nbsp;  &nbsp; ⇒  &nbsp; entropy&nbsp; $H(X)$&nbsp; is reduced by the knowledge of&nbsp; $Y$,&nbsp; by the magnitude&nbsp; $H(X\hspace{0.03cm}|\hspace{0.03cm}Y)$  &nbsp; ⇒  &nbsp;   conditional entropy of&nbsp; $X$,&nbsp; if&nbsp; $Y$&nbsp; is known.&nbsp; The remaining residue is the mutual information&nbsp;
+:$$I(X; Y)= H(X) - H(X\hspace{0.03cm}|\hspace{0.03cm}Y).$$
+At the same time, however:
+:$$I(X; Y) = H(Y) - H(Y\hspace{0.03cm}|\hspace{0.03cm}X).$$&nbsp;
-Weitere Informationen zum Thema sowie Aufgaben, Simulationen und Programmierübungen finden Sie im Versuch &bdquo;Wertdiskrete Informationstheorie&rdquo; des Praktikums „Simulation Digitaler Übertragungssysteme ”. Diese (ehemalige) LNT-Lehrveranstaltung an der TU München basiert auf
+The semicolon indicates that the two random variables&nbsp; $X$&nbsp; and&nbsp; $Y$&nbsp; under consideration are on an equal footing.
-*dem Windows-Programm [http://en.lntwww.de/downloads/Sonstiges/Programme/WDIT.zip WDIT] &nbsp;&rArr;&nbsp; Link verweist auf die ZIP-Version des Programms; und
+In detail, the third main chapter deals with
-*der zugehörigen [http://en.lntwww.de/downloads/Sonstiges/Texte/Wertdiskrete_Informationstheorie.pdf Praktikumsanleitung]  &nbsp;&rArr;&nbsp; Link verweist auf die PDF-Version.
+*the relationship between probability and entropy for&nbsp; &raquo;2D random variables&laquo;,
+*the calculation of the&nbsp; &raquo;informational divergence&laquo;,&nbsp; also known as the&nbsp; &raquo;Kullback–Leibler distance&laquo;,
+*the definition of the&nbsp; &raquo;joint entropy&laquo;&nbsp; $H(XY)$&nbsp; and the&nbsp; &raquo;conditional entropies&laquo;&nbsp; $H(X\hspace{0.03cm}|\hspace{0.03cm}Y)$&nbsp; and&nbsp; $H(Y\hspace{0.03cm}|\hspace{0.03cm}X)$,
+*the&nbsp; &raquo;mutual information&laquo;&nbsp; $I(X; Y)$&nbsp; between two random variables,
+*the&nbsp; &raquo;information theory of digital signal transmission&laquo;&nbsp;  and the corresponding model,
+*the definition and meaning of the&nbsp; &raquo;channel capacity&laquo;&nbsp; and its connection with the mutual information,
+*the capacity calculation for&nbsp; &raquo;digital memoryless channels&laquo;&nbsp; such as BSC, BEC and BSEC,
+*the&nbsp; &raquo;Channel Coding Theorem&laquo;,&nbsp; one of the highlights of Shannon's information theory.
-Der erste Abschnitt &bdquo;Allgemeine Beschreibung&rdquo; ist wie folgt gegliedert:
-==Einführungsbeispiel zur statistischen Abhängigkeit von Zufallsgrößen ==
+==Introductory example on the statistical dependence of random variables ==
-Wir gehen vom Experiment „Würfeln mit zwei Würfeln” aus, wobei beide Würfel (an der Farbe) unterscheidbar sind. Die untere Tabelle zeigt als Ergebnis die ersten $N = 18$ Wurfpaare dieses exemplarischen Zufallsexperiments.
+<br>
+[[File:EN_Inf_T_3_1_S1.png|right|frame|Result protocol of our random experiment&nbsp; "Rolling with two dice"]]
-[[File:P_ID2741__Inf_T_3_1_S1_neu.png|frame|Ergebnisprotokoll unseres Zufallsexperiments „Würfeln mit zwei Würfeln”]]
+{{GraueBox|TEXT=
+$\text{Example 1:}$&nbsp; We start from the experiment&nbsp; "Rolling with two dice", where both dice are distinguishable by colour.&nbsp; The table shows the results of the first&nbsp; $N = 18$&nbsp; pairs of throws of this exemplary random experiment.
+<br clear=all>
+According to the nomenclature explained in the&nbsp; [[Information_Theory/Some_Preliminary_Remarks_on_Two-Dimensional_Random_Variables#Prerequisites_and_nomenclature|"following section"]]&nbsp; $R_ν$,&nbsp; $B_ν$&nbsp; and&nbsp; $S_ν$&nbsp; are here to be understood as random variables:
+*For example, the random variable&nbsp; $R_3 \in  \{1, \ 2, \ 3, \ 4, \ 5, \ 6\}$&nbsp; indicates the number of points of the red cube on the third throw as a probability event.&nbsp; The specification&nbsp; $R_3 = 6$&nbsp; states that in the documented realization the red cube showed a&nbsp; "6"&nbsp; in the third throw.
-''Hinweis'': Entsprechend der im [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Voraussetzungen_und_Nomenklatur|nachfolgendem Abschnitt]]  erklärten Nomenklatur sind hier $R_ν$, $B_ν$ und $S_ν$ als Zufallsgrößen zu verstehen. Die Zufallsgröße $R_3 = \{1, 2, 3, 4, 5, 6\}$ gibt beispielsweise die Augenzahl des roten Würfels beim dritten Wurf als Wahrscheinlichkeitsereignis an. Die Angabe $R_3 = 6$ sagt aus, dass bei der dokumentierten Realisierung der rote Würfel im dritten Wurf eine „6” gezeigt hat.
+*In line 2, the results of the red cube&nbsp; $(R)$&nbsp; are indicated.&nbsp; The mean value of this limited sequence&nbsp; $〈R_1$, ... , $R_{18}〉$&nbsp; is with&nbsp; $3.39$&nbsp; smaller than the expected value&nbsp; ${\rm E}\big[R\big] = 3.5$.&nbsp;
+*Line 3 shows the results of the blue cube&nbsp; $(B)$.&nbsp; The sequence&nbsp; $〈B_1$, ... , $B_{18}〉$&nbsp; has a slightly larger mean value of&nbsp; $3.61$&nbsp; than  the unlimited sequence  &nbsp; ⇒ &nbsp; expected value ${\rm E}\big[B\big] = 3.5$.&nbsp;
+*Line 4 contains the sum&nbsp; $S_ν = R_ν + B_ν$.&nbsp; The mean value of the sequence&nbsp; $〈S_1$, ... , $S_{18}〉$&nbsp; is&nbsp; $3.39 + 3.61 = 7$.&nbsp; This is here (only by chance) equal to the expected value&nbsp; $\text{E}\big[S\big] = \text{E}\big[R\big] + \text{E}\big[B\big]$.
-*In Zeile 2 sind die Augenzahlen des roten Würfels ($R$) angegeben. Der Mittelwert dieser begrenzten Folge $〈R_1, ... , R_{18}〉$ ist mit 3.39 etwas kleiner als der Erwartungswert ${\rm E}[R] = 3.5$.
-*Die Zeile 3 zeigt die Augenzahlen des blauen Würfels ($B$). Die Folge $〈B_1, ... , B_{18}〉$ hat mit $3.61$ einen etwas größeren Mittelwert als die unbegrenzte Folge   &nbsp; ⇒ &nbsp; ${\rm E}[B] = 3.5$.
-*Zeile 4 beinhaltet die Summe $S_ν = R_ν + B_ν$. Der Mittelwert der Folge $〈S_1, ... , S_{18}〉$ ist $3.39 + 3.61 = 7$. Dieser ist hier (zufällig) gleich dem Erwartungswert $\text{E}[S] = \text{E}[R] + \text{E}[B]$.
+Now the question arises between which random variables there are statistical dependencies:
+*If one assumes fair dice, there are no statistical dependencies between the sequences&nbsp; $〈 R\hspace{0.05cm} 〉$&nbsp; and&nbsp; $〈B \hspace{0.05cm}〉$&nbsp; – whether bounded or unbounded: &nbsp; Even if one knows&nbsp; $R_ν$&nbsp; for&nbsp; $B_ν$&nbsp; all possible results&nbsp; $(1$, ... , $6)$&nbsp; are equally probable.
+*If one knows&nbsp; $S_ν$,&nbsp; however,&nbsp; statements about&nbsp; $R_ν$&nbsp; as well as about&nbsp; $B_ν$&nbsp; are possible.&nbsp; From&nbsp; $S_{11} = 12$&nbsp; follows directly&nbsp; $R_{11} = B_{11} = 6$&nbsp; and the sum&nbsp; $S_{15} = 2$&nbsp; of two dice is only possible with&nbsp; $R_{15} = B_{15} = 1$.&nbsp; Such dependencies are called&nbsp; &raquo;deterministic&laquo;.
+*From&nbsp; $S_7 = 10$,&nbsp; at least ranges for&nbsp; $R_7$&nbsp; and&nbsp; $B_7$&nbsp; can be given: &nbsp; $R_7 ≥ 4, \ B_7 ≥ 4$.&nbsp; Only three pairs are possible:&nbsp; $(R_7 = 4) ∩ (B_7 = 6)$,&nbsp; $(R_7 = 5) ∩ (B_7 = 5)$,&nbsp; $(R_7 = 6) ∩ (B_7 = 4)$.&nbsp; Here there is no deterministic relationship between the variables&nbsp; $S_ν$&nbsp; and&nbsp; $R_ν$&nbsp; $($or&nbsp; $B_ν)$, but rather a so-called&nbsp; [[Theory_of_Stochastic_Signals/Statistical_Dependence_and_Independence#General_definition_of_statistical_dependence|&raquo;statistical dependence&laquo;]].
+*Such statistical dependencies exist for&nbsp; $S_ν ∈ \{3, \ 4, \ 5, \ 6, \ 8, \ 9, \ 10, \ 11\}$.&nbsp; On the other hand, if the sum&nbsp; $S_ν = 7$, one cannot infer&nbsp; $R_ν$&nbsp; and&nbsp; $B_ν$&nbsp; from this.&nbsp; For both dice, all possible numbers&nbsp; $1$, ... , $6$&nbsp; are equally probable.&nbsp; In this case, there are also no statistical dependencies between&nbsp; $S_ν$&nbsp; and&nbsp; $R_ν$&nbsp; or between&nbsp; $S_ν$&nbsp; and&nbsp; $B_ν$.}}
-Nun stellt sich die Frage, zwischen welchen Zufallsgrößen es statistische Abhängigkeiten gibt:
+== Prerequisites and nomenclature ==
-*Setzt man faire Würfel voraus, so bestehen zwischen den Folgen $〈R〉$ und $〈B〉$ – ob begrenzt oder unbegrenzt – keine statistischen Bindungen: Auch wenn man $R_ν$ kennt, sind für $B_ν$ weiterhin alle möglichen Augenzahlen $1$, ... , $6$ gleichwahrscheinlich.
+<br>
-*Kennt man aber $S_ν$, so sind sowohl Aussagen über $R_ν$ als auch über $B_ν$ möglich. Aus $S_{11} = 12$ (siehe obige Tabelle) folgt direkt $R_{11} = B_{11} = 6$ und die Summe $S_{15} = 2$ zweier Würfel ist nur mit zwei Einsen möglich. Solche Abhängigkeiten bezeichnet man als ''deterministisch''.
+Throughout this chapter, we consider discrete random variables of the form&nbsp; $X = \{ x_1, \ x_2, \hspace{0.05cm}$ ... $\hspace{0.05cm},\ x_{\mu},\hspace{0.05cm}$ ... $\hspace{0.05cm},\ x_M \} \hspace{0.05cm},$&nbsp; and use the following nomenclature:
-*Aus $S_7 = 10$ lassen sich zumindest Bereiche für $R_7$ und $B_7$ angeben: $R_7 ≥ 4, \ B_7 ≥ 4$. Möglich sind dann nur die drei Wertepaare $(R_7 = 4) ∩ (B_7 = 6)$, $(R_7 = 5) ∩ (B_7 = 5)$ sowie $(R_7 = 6) ∩ (B_7 = 4)$. Hier besteht zwar kein deterministischer Zusammenhang zwischen den Zufallsgrößen $S_ν$ und $R_ν$ (bzw. $B_ν$), aber eine so genannte [[Stochastische_Signaltheorie/Statistische_Abhängigkeit_und_Unabhängigkeit#Allgemeine_Definition_von_statistischer_Abh.C3.A4ngigkeit|statistische Abhängigkeit]].
+*The random variable itself is always denoted by a capital letter.&nbsp; The lower case letter&nbsp; $x$&nbsp; indicates a possible realization of the random variable&nbsp; $X$.
-*Solche statistische Abhängigkeiten gibt es für alle $S_ν ∈ \{3, 4, 5, 6, 8, 9, 10, 11\}$. Ist dagegen die Summe $S_ν = 7$, so kann daraus nicht auf $R_ν$ und $B_ν$ zurückgeschlossen werden. Für beide Würfel sind dann alle möglichen Augenzahlen $1$, ... , $6$ gleichwahrscheinlich. In diesem Fall bestehen auch keine statistischen Bindungen zwischen $S_ν$ und $R_ν$ bzw. zwischen $S_ν$ und $B_ν$.
+*All realizations&nbsp; $x_μ$&nbsp; $($with&nbsp; $μ = 1$, ... , $M)$&nbsp; are real-valued.&nbsp; $M$&nbsp; indicates the&nbsp; "symbol set size"&nbsp; or&nbsp; "alphabet size"&nbsp; of&nbsp; $X$.&nbsp; Instead of&nbsp; $M$,&nbsp; we sometimes also use&nbsp; $|X|$.
-== Voraussetzungen und Nomenklatur ==
+[[File:P_ID2743__Inf_T_3_1_S2.png|right|frame|Relationship between the probability space&nbsp; ${\it \Omega}$&nbsp; <br>and the random variable&nbsp; $X$]]
-Wir betrachten im gesamten Kapitel  wertdiskrete Zufallsgrößen der Form $X = \{ x_1, x_2, \hspace{0.05cm}... \hspace{0.15cm}, x_{\mu},\hspace{0.05cm}$ ... $\hspace{0.15cm}, x_M \} \hspace{0.05cm},$ und verwenden folgende Nomenklatur:
+<br>The random variable&nbsp; $X$&nbsp; can, for example, be created by the transformation&nbsp; ${\it \Omega} → X$&nbsp;, where&nbsp; ${\it \Omega}$&nbsp; stands for the probability space of a random experiment.&nbsp;
-*Die Zufallsgröße selbst wird stets mit einem Großbuchstaben bezeichnet, und der Kleinbuchstabe $x$ weist auf eine mögliche Realisierung der Zufallsgröße $X$  hin.
-*Alle Realisierungen $x_μ$ (mit $μ = 1$, ... , $M$) sind reellwertig. $M$ gibt den Symbolumfang (englisch: ''Symbol Set Size'') von $X$ an. Anstelle von $M$ verwenden wir manchmal auch $|X|$.
-Die Zufallsgröße $X$ kann zum Beispiel durch die Transformation $\Omega → X$ entstanden sein, wobei $\Omega$ für den Wahrscheinlichkeitsraum eines Zufallsexperiments steht. Die nachfolgende Grafik verdeutlicht eine solche Transformation:
+The diagram illustrates such a transformation:
 :$${\it \Omega} = \{ \omega_1, \omega_2, \omega_3, ... \hspace{0.15cm} \}
 \hspace{0.25cm} \longmapsto \hspace{0.25cm}
-X = \{ x_1, x_2, x_3, x_4\}
+X = \{ x_1, \ x_2, \ x_3, \ x_4\}
 \subset \cal{R}\hspace{0.05cm}.$$
-Jedes Zufallsereignis $ω_i ∈ Ω$ wird eindeutig einem reellen Zahlenwert $x_μ ∈ X ⊂ ℝ$ zugeordnet. Im betrachteten Beispiel gilt für die Laufvariable $1 ≤ μ ≤ 4$, das heißt, der Symbolumfang beträgt $M = |X| = 4$.
+*Each random event&nbsp; $ω_i ∈ Ω$&nbsp; is uniquely assigned to a real numerical value&nbsp; $x_μ ∈ X ⊂ \cal{R}$.
+*In the example considered, the running variable is&nbsp; $1 ≤ μ ≤ 4$, i.e. the symbol set size is&nbsp; $M = |X| = 4$.
-:[[File:P_ID2743__Inf_T_3_1_S2.png|frame|Zusammenhang zwischen Wahrscheinlichkeitsraum und Zufallsgröße]]
+*However, the figure is not one-to-one: &nbsp;  The realization&nbsp; $x_3 ∈ X$&nbsp; could have resulted from the elementary event&nbsp; $ω_4$&nbsp; in the example, but also from&nbsp; $ω_6$&nbsp; $($or from some other of the infinitely many elementary events&nbsp; $ω_i$ not drawn in the diagram).
+<br clear=all>
-Die Abbildung ist aber nicht eineindeutig: Die Realisierung $x_3 ∈ X$ könnte sich im Beispiel aus dem Elementarereignis $ω_4$ ergeben haben, aber auch aus $ω_6$ (oder aus einem anderen der unendlich vielen, in der Grafik nicht eingezeichneten Elementarereignisse $ω_i$).
+{{BlaueBox|TEXT=
+$\text{Agreement:}$&nbsp; Often one refrains from indexing both the elementary events&nbsp; $ω_i$&nbsp; and the realizations&nbsp; $x_μ$.&nbsp; This results in the following shorthand notations, for example:
-Oft verzichtet man auf die Indizierung sowohl der Elementarereignisse $ω_i$ als auch der Realisierungen $x_μ$. Damit ergeben sich beispielsweise folgende Kurzschreibweisen:
 :$$ \{ X = x  \}
@@ Line 68: / Line 85: @@
 \hspace{0.05cm}.$$
-Mit dieser Vereinbarung gilt für die Wahrscheinlichkeiten der diskreten Zufallsgröße $X$:
+With this agreement, the probabilities of the discrete random variable&nbsp; $X$ are:
-:$${\rm Pr}( X = x_{\mu})  = \hspace{-0.2cm} \sum_{\omega \hspace{0.1cm} \in   \{ X = x_{\mu} \}}
+:$${\rm Pr}( X = x_{\mu})  = \hspace{-0.2cm} \sum_{\omega \hspace{0.1cm} \in   \{ X = x_{\mu} \} }
   \hspace{-0.2cm}{\rm Pr} \left ( \{ \omega \} \right )
-\hspace{0.05cm}.$$
+\hspace{0.05cm}.$$}}
-==Wahrscheinlichkeitsfunktion und Wahrscheinlichkeitsdichtefunktion==
+==Probability mass function and probability density function==
-{{BlaueBox|
+<br>
-TEXT='''Definition''':&nbsp; Fasst man die $M$ Wahrscheinlichkeiten einer diskreten Zufallsgröße $X$ &nbsp; ⇒ &nbsp; ${\rm Pr}( X = x_{\mu})$ ähnlich wie bei einem Vektor zusammen, so kommt man zur '''Wahrscheinlichkeitsfunktion''' (englisch: ''Probability Mass Function'', PMF):
+{{BlaueBox|TEXT=
+$\text{Definition:}$&nbsp; If the&nbsp; $M$&nbsp;  probabilities of a discrete random variable&nbsp; $X$ &nbsp; ⇒ &nbsp; ${\rm Pr}( X = x_{\mu})$&nbsp; are combined in a similar way to a vector, <br>we arrive at the &nbsp; &raquo;'''probability mass function'''&laquo;&nbsp; $\rm (PMF)$:
-:$$P_X(X) = \big [ \hspace{0.02cm} P_X(x_1), P_X(x_2), \hspace{0.05cm}... \hspace{0.15cm}, P_X(x_{\mu}),\hspace{0.05cm} ...\hspace{0.15cm}, P_X(x_M) \hspace{0.02cm} \big ] \hspace{0.05cm}.$$
+:$$P_X(X) = \big [ \hspace{0.02cm} P_X(x_1), P_X(x_2), \hspace{0.05cm}\text{...} \hspace{0.15cm}, P_X(x_{\mu}),\hspace{0.05cm} \text{...}\hspace{0.15cm}, P_X(x_M) \hspace{0.02cm} \big ] \hspace{0.05cm}.$$
+The&nbsp; $μ$&ndash;th element of this&nbsp; "vector"&nbsp; indicates the probability &nbsp; $P_X(x_{\mu}) =  {\rm Pr}( X = x_{\mu}) $.}}
-Das $μ$–te Element dieses „Vektors” gibt dabei die folgende Wahrscheinlichkeit $P_X(x_{\mu}) =  {\rm Pr}( X = x_{\mu}) $ an.}}
+In the book&nbsp; "Theory of Stochastic Signals",&nbsp; we defined a similar descriptive quantity with the&nbsp; [[Theory_of_Stochastic_Signals/Probability_Density_Function_(PDF)#Definition_of_the_probability_density_function|$\text{probability density function}$]]&nbsp; $(\rm PDF)$&nbsp; and designated it as&nbsp; $f_X(x)$.
-Im Buch&bdquo;Stochastische Signaltheorie&rdquo; haben wir mit der [[Stochastische_Signaltheorie/Wahrscheinlichkeitsdichtefunktion_(WDF)#Definition_der_Wahrscheinlichkeitsdichtefunktion|Wahrscheinlichkeitsdichtefunktion]] (WDF, englisch: ''Probability Density Function'', PDF) eine ähnliche Beschreibungsgröße definiert und diese mit $f_X(x)$ bezeichnet.
+It should be noted, however:
-Zu beachten ist aber:
+*The PDF is more suitable for characterizing continuous random variables, such as a&nbsp; [[Theory_of_Stochastic_Signals/Gaußverteilte_Zufallsgrößen|$\text{Gaussian distribution}$]]&nbsp; or a [[Theory_of_Stochastic_Signals/Gleichverteilte_Zufallsgrößen|$\text{uniform distribution}$]].&nbsp; Only through the use of&nbsp; [[Theory_of_Stochastic_Signals/Probability_Density_Function#PDF_definition_for_discrete_random_variables|$\text{Dirac delta functions}$]]&nbsp; does the PDF also become applicable for discrete random variables.
-*Die PDF eignet sich eher zur Charakterisierung kontinuierlicher Zufallsgrößen, wie zum Beispiel bei einer [[Stochastische_Signaltheorie/Gaußverteilte_Zufallsgrößen|Gaußverteilung]] oder einer [[Stochastische_Signaltheorie/Gleichverteilte_Zufallsgrößen|Gleichverteilung]]. Erst durch die [[Stochastische_Signaltheorie/Wahrscheinlichkeitsdichtefunktion#WDF-Definition_f.C3.BCr_diskrete_Zufallsgr.C3.B6.C3.9Fen|Verwendung von Diracfunktionen]] wird die PDF auch für diskrete Zufallsgrößen anwendbar.
+*The PMF provides less information about the random variable than the PDF and can also only be specified for discrete variables. &nbsp;However, for the discrete  information theory considered in this chapter, the PMF is sufficient.
-*Die PMF liefert weniger Information über die Zufallsgröße als die PDF und kann zudem nur für diskrete Größen angegeben werden. Für die wertdiskrete Informationstheorie ist die PMF allerdings ausreichend.
-{{GraueBox|
-TEXT='''Beispiel 1''':&nbsp; Wir betrachten eine Wahrscheinlichkeitsdichtefunktion (abgekürzt WDF bzw. PDF) ohne großen Praxisbezug:
+{{GraueBox|TEXT=
+$\text{Example 2:}$&nbsp; We consider a probability density function&nbsp; $\rm (PDF)$&nbsp; without much practical relevance:
-:$$f_X(x) = 0.2 \cdot (x+2) + 0.3 \cdot (x-1.5)+0.5 \cdot (x-{\rm \pi}) \hspace{0.05cm}. $$
+:$$f_X(x) = 0.2 \cdot \delta(x+2) + 0.3 \cdot \delta(x - 1.5)+0.5 \cdot \delta(x - {\rm \pi}) \hspace{0.05cm}. $$
-Für die diskrete Zufallsgröße gilt somit $x ∈ X = \{–2, +1.5, +\pi \} $ ⇒ Symbolumfang $M = \vert X \vert = 3$, und die Wahrscheinlichkeitsfunktion (PMF) lautet:
+Thus, for the discrete random variable&nbsp; $x ∈ X = \{–2,\ +1.5,\ +\pi \} $ &nbsp; ⇒ &nbsp; symbol set size&nbsp; $M = \vert X \vert = 3$, the probability function $\rm (PMF)$ is:
 :$$P_X(X) = \big [ \hspace{0.1cm}0.2\hspace{0.05cm}, 0.3\hspace{0.05cm}, 0.5 \hspace{0.1cm} \big] \hspace{0.05cm}. $$
-Man erkennt:
+It can be seen:
-*Die PMF liefert nur Informationen über die Wahrscheinlichkeiten $\text{Pr}(x_1)$, $\text{Pr}(x_2)$, $\text{Pr}(x_3)$. Aus der PDF sind dagegen auch die möglichen Realisierungen $x_1$, $x_2$, $x_3$ der Zufallsgröße $X$ ablesbar.
+*The&nbsp; $\rm PMF$&nbsp; only provides information about the probabilities&nbsp; $\text{Pr}(x_1)$,&nbsp; $\text{Pr}(x_2)$&nbsp; and&nbsp; $\text{Pr}(x_3)$.
-*Die einzige Voraussetzung für die Zufallsgröße ist, dass sie reellwertig ist. Die möglichen Werte $x_μ$ müssen weder positiv, ganzzahlig, äquidistant noch rational sein. 	}}
+*From the&nbsp; $\rm PDF$,&nbsp; on the other hand,&nbsp; the possible realizations&nbsp; $x_1$,&nbsp; $x_2$&nbsp; and&nbsp; $x_3$&nbsp; of the random variable&nbsp; $X$&nbsp; can also be read.
+*The only requirement for the random variable is that it is real-valued.
+*The possible values&nbsp; $x_μ$&nbsp; do not have to be positive, integer, equidistant or rational.	}}
-==Wahrscheinlichkeitsfunktion und Entropie==
+==Probability mass function and entropy==
+<br>
-In der wertdiskreten Informationstheorie genügt im Gegensatz zu übertragungstechnischen Problemen schon die Kenntnis der Wahrscheinlichkeitsfunktion $P_X(X)$, zum Beispiel zur Berechnung der [[Informationstheorie/Gedächtnislose_Nachrichtenquellen#Informationsgehalt_und_Entropie|Entropie]].
+In discrete  information theory in contrast to transmission problems, knowledge of the probability mass function&nbsp; $P_X(X)$ is sufficient, e.g. to calculate the&nbsp; [[Information_Theory/Gedächtnislose_Nachrichtenquellen#Information_content_and_entropy|$\text{entropy}$]].
-{{BlaueBox|
+{{BlaueBox|TEXT=
-TEXT='''Definition''':&nbsp; Die '''Entropie''' einer diskreten Zufallsgröße $X$ – also deren Unsicherheit für einen Beobachter – kann man mit der Wahrscheinlichkeitsfunktion $P_X(X)$ wie folgt darstellen:
+$\text{Definition:}$&nbsp; The&nbsp; $\rm entropy$&nbsp; of a discrete random variable&nbsp; $X$&nbsp; – i.e. its uncertainty for an observer - can be represented with the PMF&nbsp; $P_X(X)$&nbsp; as follows:
-:$$H(X) = {\rm E} \left [ {\rm log} \hspace{0.1cm} \frac{1}{P_X(X)}\right ] \hspace{0.05cm}=\hspace{0.05cm}
+:$$H(X) = {\rm E} \big [ {\rm log} \hspace{0.1cm} \frac{1}{P_X(X)}\big ] \hspace{0.05cm}=\hspace{0.05cm}
- -{\rm E} \left [ {\rm log} \hspace{0.1cm} {P_X(X)}\right ] \hspace{0.05cm}=\hspace{0.05cm}  \sum_{\mu = 1}^{M}
+  - {\rm E} \big [ {\rm log} \hspace{0.1cm} {P_X(X)}\big ] \hspace{0.05cm}=\hspace{0.05cm}  \sum_{\mu = 1}^{M}
   P_X(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{1}{P_X(x_{\mu})} \hspace{0.05cm}=\hspace{0.05cm}  - \sum_{\mu = 1}^{M}
   P_X(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} {P_X(x_{\mu})} \hspace{0.05cm}.$$
-Verwendet man den Logarithmus zur Basis 2, also $\log_2$ (...) &nbsp;  ⇒ &nbsp; ''Logarithmus dualis'', so wird der Zahlenwert mit der Pseudo–Einheit „bit” versehen. $\rm E[$...$]$ gibt den ''Erwartungswert'' an. }}
+If you use the logarithm to base&nbsp; $2$, i.e.&nbsp; $\log_2$ (...) &nbsp;  ⇒ &nbsp; "binary logarithm", the numerical value is provided with the pseudo-unit&nbsp; "bit".&nbsp; $\rm E\big[$...$\big]$ indicates the expected value. }}
-Beispielsweise erhält man
-*für $P_X(X) = \big [\hspace{0.02cm}0.2, 0.3, 0.5 \hspace{0.02cm}\big ]$:
-:$$H(X) = 0.2 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{0.2} +
+For example, one obtains
+*with&nbsp; $P_X(X) = \big [\hspace{0.02cm}0.2, \ 0.3, \ 0.5 \hspace{0.02cm}\big ]$:
+::$$H(X) = 0.2 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{0.2} +
 .3 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{0.3}
 +0.5 \cdot {\rm log}_2 \hspace{0.1cm} \frac{1}{0.5}
-\approx 1.485\,{\rm bit}\hspace{0.05cm},$$
+\approx 1.485\hspace{0.15cm}{\rm bit},$$
-*für $P_X(X) = \big [\hspace{0.02cm}1/3, 1/3, 1/3\hspace{0.02cm}\big ]$:
+*with&nbsp; $P_X(X) = \big [\hspace{0.02cm}1/3, \ 1/3, \ 1/3\hspace{0.02cm}\big ]$:
-:$$H(X) = 3 \cdot 1/3 \cdot {\rm log}_2 \hspace{0.1cm} (3) = {\rm log}_2 \hspace{0.1cm} (3)
+::$$H(X) = 3 \cdot 1/3 \cdot {\rm log}_2 \hspace{0.1cm} (3) = {\rm log}_2 \hspace{0.1cm} (3)
-\approx 1.585\,{\rm bit}\hspace{0.05cm}.$$
+\approx 1.585\hspace{0.15cm}{\rm bit}.$$
-Das zweite Beispiel liefert das Maximum der Entropiefunktion für den Symbolumfang $M = 3$. Für ein allgemeines $M$ lässt sich dieses Ergebnis beispielsweise wie folgt herleiten – siehe [Meck]<ref>Mecking, M.: Information Theory. Vorlesungsmanuskript, Lehrstuhl für Nachrichtentechnik, Technische Universität München, 2009.</ref>:
+The second example provides the maximum of the entropy function for the symbol set size&nbsp; $M = 3$.
+{{BlaueBox|TEXT=
+$\text{Derivation:}$&nbsp;
+For general&nbsp; $M$,&nbsp; this result can be derived e.g. as follows – see&nbsp;  [Meck]<ref>Mecking, M.: Information Theory. Lecture manuscript, Chair of Communications Engineering, Technische Universität München, 2009.</ref>:
-:$$H(X) = -{\rm E} \left [ {\rm log} \hspace{0.1cm} {P_X(X)}\right ] \hspace{0.2cm} \le \hspace{0.2cm}- {\rm log} \left [ {\rm E} \hspace{0.1cm} \left [{P_X(X)}\right ] \right ] \hspace{0.05cm}.$$
+:$$H(X) = -{\rm E} \big [ {\rm log} \hspace{0.1cm} {P_X(X)}\big ] \hspace{0.2cm} \le \hspace{0.2cm}- {\rm log} \big [ {\rm E} \hspace{0.1cm} \left [{P_X(X)}\right ] \big ] \hspace{0.05cm}.$$
-Diese Abschätzung (Jensens's Ungleichung) ist zulässig, da der Logarithmus eine konkave Funktion ist. Entsprechend [[Aufgaben:3.2_Erwartungswertberechnungen|Aufgabe 3.2]] gilt:
+This estimation&nbsp;  $($&raquo;'''Jensens's inequality'''&laquo;$)$&nbsp; is admissible because the logarithm is a concave function.&nbsp; According to&nbsp; [[Aufgaben:Exercise_3.2:_Expected_Value_Calculations|"Exercise 3.2"]]&nbsp;, the following holds:
-$$-{\rm E} \left [  {P_X(X)}\right ] \hspace{0.1cm} \le \hspace{0.1cm} M \hspace{0.3cm} \Rightarrow \hspace{0.3cm}
+:$$- {\rm E} \big [  {P_X(X)}\big ] \hspace{0.1cm} \le \hspace{0.1cm} M \hspace{0.3cm} \Rightarrow \hspace{0.3cm}
 H(X) \le {\rm log} \hspace{0.1cm} (M)  \hspace{0.05cm}.$$
-Das Gleichheitszeichen ergibt sich nach der oberen Rechnung für gleiche Wahrscheinlichkeiten, also für $P_X(x_μ) = {1}/{M}$ für alle $μ$. In der [[Aufgaben:3.3_Entropie_von_Ternärgrößen|Aufgabe 3.3]] soll der gleiche Sachverhalt unter Verwendung der Abschätzung
+The equal sign results according to the calculation above for equal probabilities, i.e. for&nbsp; $P_X(x_μ) = {1}/{M}$&nbsp; for all&nbsp; $μ$.&nbsp; In &nbsp; [[Aufgaben:Exercise_3.3:_Entropy_of_Ternary_Quantities|"Exercise 3.3"]],&nbsp; the same situation is to be proved using the estimate&nbsp; &nbsp; "${\rm ln} \hspace{0.1cm} (x)   \le x-1$".&nbsp; &nbsp; The equal sign applies here only for&nbsp; $x = 1$.}}
-$${\rm ln} \hspace{0.1cm} (x)   \le x-1$$
-nachgewiesen werden. Das Gleichheitszeichen gilt hier nur für $x = 1$.
-Ist eine der $M$ Wahrscheinlichkeiten $P_X(x_μ)$ der Wahrscheinlichkeitsfunktion gleich 0, so lässt sich für die Entropie eine engere Schranke angeben:
+If one of the&nbsp; $M$&nbsp; probabilities&nbsp; $P_X(x_μ)$&nbsp; of the PMF is equal to zero, a tighter bound can be given for the entropy:
 :$$H(X) \le {\rm log} \hspace{0.1cm} (M-1)  \hspace{0.05cm}.$$
+{{BlaueBox|TEXT=
+$\text{Agreement:}$&nbsp; In the following example and on the next sections we use the following&nbsp; &raquo;'''nomenclature'''&laquo;:
+*The entropy&nbsp; $H(X)$&nbsp; always refers to the actual probability mass function&nbsp; $P_X(X)$&nbsp; of the discrete random variable.&nbsp;  Experimentally, these quantities are obtained only after&nbsp; $N → ∞$&nbsp; trials.
+*If the PMF is determined from a finite random sequence, we denote this probability mass function by&nbsp; $Q_X(X)$&nbsp; and add&nbsp; „$N =$ ...” to the resulting entropy.
+*This entropy approximation is not based on probabilities, but only on the&nbsp; [[Theory_of_Stochastic_Signals/From_Random_Experiment_to_Random_Variable#Bernoulli.27s_law_of_large_numbers|$\text{relative frequencies}$]].&nbsp; Only for&nbsp; $N → ∞$&nbsp; does this approximation agree with&nbsp; $H(X)$&nbsp;.}}
-Für das folgende Beispiel und die nächsten Seiten vereinbaren wir die folgende Nomenklatur:
-*Die Entropie $H(X)$ bezieht sich stets auf die tatsächliche Wahrscheinlichkeitsfunktion $P_X(X)$ der diskreten Zufallsgröße. Experimentell erhält man diese Größen erst nach $N → ∞$ Versuchen.
-*Ermittelt man die Wahrscheinlichkeitsfunktion aus einer endlichen Zufallsfolge, so bezeichnen wir diese mit $Q_X(X)$ und die daraus resultierende Entropie versehen wir mit dem Zusatz „$N =$ ...”.
-*Diese Entropie–Näherung basiert nicht auf Wahrscheinlichkeiten, sondern nur auf den [[Stochastische_Signaltheorie/Wahrscheinlichkeit_und_relative_Häufigkeit#Bernoullisches_Gesetz_der_gro.C3.9Fen_Zahlen|relativen Häufigkeiten]]. Erst für $N → ∞$ stimmt diese Näherung mit $H(X)$ überein.
+[[File:EN_Inf_T_3_1_S3.png|right|frame|Probability mass functions of our dice experiment]]
+{{GraueBox|TEXT=
+$\text{Example 3:}$&nbsp; We return to our&nbsp; "dice experiment".&nbsp;
+*The table shows the probability mass functions&nbsp; $P_R(R)$&nbsp; and&nbsp; $P_B(B)$&nbsp; for the red and blue dice as well as the approximations&nbsp; $Q_R(R)$&nbsp; and&nbsp; $Q_B(B)$,&nbsp; in each case based on the random experiment with&nbsp; $N = 18$&nbsp; throws.&nbsp;
+*The relative frequencies&nbsp; $Q_R(R)$&nbsp; and&nbsp; $Q_B(B)$&nbsp; result from the&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|$\text{exemplary random sequences}$]]&nbsp; of&nbsp; $\text{Example 1}$.
-{{GraueBox|
-TEXT='''Beispiel 2''':&nbsp; Wir kommen auf unser ''Würfel–Experiment'' zurück. Die folgende Tabelle zeigt die Wahrscheinlichkeitsfunktionen $P_R(R)$ und $P_B(B)$ für den roten und den blauen Würfel sowie die Näherungen $Q_R(R)$ und $Q_B(B)$, jeweils basierend auf dem Zufallsexperiment mit $N = 18$ Würfen. Die relativen Häufigkeiten $Q_R(R)$ und $Q_B(B)$ ergeben sich aus den [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|beispielhaften Zufallsfolgen]] vom Beginn dieses Kapitels.
-[[File:P_ID2744__Inf_T_3_1_S3_neu.png|frame|Wahrscheinlichkeitsfunktionen unseres Würfelexperiments]]
+The following applies to the random variable&nbsp; $R$&nbsp; with the&nbsp; "binary logarithm"&nbsp; $($to base&nbsp; $2)$:
-Für die Zufallsgröße $R$ gilt mit dem ''Logarithmus dualis'' (zur Basis 2):
-:$$H(R) = H(R) \big \vert_{N \hspace{0.05cm}\rightarrow \hspace{0.05cm}\infty} = \sum_{\mu = 1}^{6} 1/6 \cdot {\rm log}_2 \hspace{0.1cm} (6) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\,{\rm bit} \hspace{0.05cm},$$
+:$$H(R) = H(R) \big \vert_{N \hspace{0.05cm}\rightarrow \hspace{0.05cm}\infty} = \sum_{\mu = 1}^{6} 1/6 \cdot {\rm log}_2 \hspace{0.1cm} (6) = {\rm log}_2 \hspace{0.1cm} (6) = 2.585\hspace{0.1cm} {\rm bit} ,$$
-:$$H(R) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = 2 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{2} \hspace{0.1cm} +\hspace{0.1cm} 2 \cdot \frac{3}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{3} \hspace{0.1cm} +\hspace{0.1cm}  2 \cdot \frac{4}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{4} \hspace{0.1cm}= 2.530\,{\rm bit} \hspace{0.05cm}.$$
+:$$H(R) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = 2 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{2} \hspace{0.1cm} +\hspace{0.1cm} 2 \cdot \frac{3}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{3} \hspace{0.1cm} +\hspace{0.1cm}  2 \cdot \frac{4}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{4} \hspace{0.1cm}= 2.530\hspace{0.1cm} {\rm bit}.$$
-Der blaue Würfel hat natürlich die gleiche Entropie: $H(B) = H(R) = 2.585\, \rm bit$. Hier erhält man für die auf $N = 18$ basierende Näherung einen etwas größeren Wert, da nach obiger Tabelle $Q_B(B)$ von der diskreten Gleichverteilung $P_B(B)$ weniger abweicht als $Q_R(R)$ von $P_R(R)$.
+The blue cube of course has the same entropy:&nbsp; $H(B) = H(R) = 2.585\ \rm bit$.&nbsp; Here we get a slightly larger value for the approximation based on&nbsp; $N = 18$&nbsp;, since according to the table above&nbsp; $Q_B(B)$&nbsp; deviates less from the discrete uniform distribution&nbsp; $P_B(B)$&nbsp; than&nbsp; als $Q_R(R)$&nbsp; from&nbsp; $P_R(R)$.
-:$$H(B) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = 1 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{2} \hspace{0.1cm} +\hspace{0.1cm} 4 \cdot \frac{3}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{3} \hspace{0.1cm} +\hspace{0.1cm}  1 \cdot \frac{4}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{4} \hspace{0.1cm}= 2.558\,{\rm bit} \hspace{0.05cm}.$$
+:$$H(B) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = 1 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{2} \hspace{0.1cm} +\hspace{0.1cm} 4 \cdot \frac{3}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{3} \hspace{0.1cm} +\hspace{0.1cm}  1 \cdot \frac{4}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{4} \hspace{0.1cm}= 2.558\hspace{0.1cm} {\rm bit} .$$
-Man erkennt aus den angegebenen Zahlenwerten, dass trotz des eigentlich viel zu kleinen Experimentenparameters $N$ die Verfälschungen hinsichtlich der Entropie nicht sehr groß sind.
+:It can be seen from the given numerical values that despite the experimental parameter&nbsp; $N$,&nbsp; which is here much too small, the deviation with regard to entropy is not very large.
-Es soll nochmals erwähnt werden, dass bei endlichem $N$ stets gilt:
+It should be mentioned again that with finite&nbsp; $N$&nbsp; the following always applies:
 :$$ H(R) \big \vert_{N } < H(R) =  {\rm log}_2 \hspace{0.1cm} (6) \hspace{0.05cm}, \hspace{0.5cm}
@@ Line 181: / Line 205: @@
-==Relative Entropie – Kullback–Leibler–Distanz ==
+==Informational divergence - Kullback-Leibler distance ==
-Wir betrachten die beiden Wahrscheinlichkeitsfunktionen $P_X(·)$ und $P_Y(·)$ über dem gleichen Alphabet $X = \{ x_1, x_2$, ... , $x_M \}$, und definieren nun die folgende informationstheoretische Größe:
+<br>
+We consider two probability mass functions&nbsp; $P_X(·)$&nbsp; and&nbsp; $P_Y(·)$&nbsp; over the same alphabet&nbsp; $X = \{ x_1, \ x_2$, ... ,&nbsp; $x_M \}$,&nbsp; and now define the following quantity:
-{{BlaueBox|TEXT='''Definition''':&nbsp; Die '''relative Entropie''' (englisch: ''Informational Divergence'') zwischen den durch $P_X(·)$ und $P_Y(·)$ definierten Zufallsgrößen ist durch folgende Gleichung gegeben:
+{{BlaueBox|TEXT=
+$\text{Definition:}$&nbsp; The&nbsp; &raquo;'''informational divergence'''&laquo;&nbsp; between the random variables defined by &nbsp; $P_X(·)$&nbsp; and&nbsp; $P_Y(·)$&nbsp; is given as follows:
 :$$D(P_X \hspace{0.05cm} \vert \vert  \hspace{0.05cm}P_Y) = {\rm E} \left [ {\rm log} \hspace{0.1cm} \frac{P_X(X)}{P_Y(X)}\right ] \hspace{0.2cm}=\hspace{0.2cm} \sum_{\mu = 1}^{M}
   P_X(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{P_X(x_{\mu})}{P_Y(x_{\mu})} \hspace{0.05cm}.$$
-Man bezeichnet $D(P_X \vert \vert P_Y)$ auch als die '''Kullback–Leibler–Distanz''' (oder kurz ''KL–Distanz''). Diese liefert ein Maß für die „Ähnlichkeit” zwischen den beiden Wahrscheinlichkeitsfunktionen $P_X(·)$ und $P_Y(·)$.
+&nbsp; $D(P_X \vert \vert P_Y)$&nbsp; is also called&nbsp; &raquo;'''Kullback–Leibler distance'''&laquo;&nbsp;.
+*This provides a measure of the&nbsp; "similarity"&nbsp; between the two probability functions&nbsp; $P_X(·)$&nbsp; and&nbsp; $P_Y(·)$.
-Bei Verwendung des Logarithmus zur Basis 2 ist wieder die Pseudo–Einheit „bit” hinzuzufügen. }}
+*When using the logarithm to base&nbsp; $2$&nbsp; the pseudo-unit&nbsp; "bit"&nbsp; must be added again. }}
-In ähnlicher Weise lässt sich auch eine zweite Variante der Kullback–Leibler–Distanz angeben:
+Similarly, a second variant of the Kullback-Leibler distance can be given:
 :$$D(P_Y \hspace{0.05cm} ||  \hspace{0.05cm}P_X) = {\rm E} \left [ {\rm log} \hspace{0.1cm} \frac{P_Y(X)}{P_X(X)}\right ] \hspace{0.2cm}=\hspace{0.2cm} \sum_{\mu = 1}^{M}
   P_Y(x_{\mu}) \cdot {\rm log} \hspace{0.1cm} \frac{P_Y(x_{\mu})}{P_X(x_{\mu})} \hspace{0.05cm}.$$
-Gegenüber der ersten Variante wird nun jede Funktion $P_X(·)$ durch $P_Y(·)$ ersetzt und umgekehrt. Da sich im allgemeinen $D(P_X || P_Y)$ und $D(P_Y || P_X)$ unterscheiden, ist der Begriff „Distanz” eigentlich irreführend. Wir wollen es aber bei dieser Namensgebung belassen.
+Compared to the first variant, each function&nbsp; $P_X(·)$&nbsp; is now replaced by&nbsp; $P_Y(·)$&nbsp; and vice versa.&nbsp;  Since in general&nbsp; $D(P_X || P_Y)$&nbsp; and&nbsp; $D(P_Y || P_X)$&nbsp; differ, the term&nbsp; "distance"&nbsp;  is actually misleading.&nbsp; However, we want to leave it at this naming.
+If we evaluate the two equations above, we recognize the following properties:
+*If the same distribution is present  &nbsp; ⇒  &nbsp; $P_Y(·) ≡ P_X(·)$,&nbsp; then &nbsp; $D(P_X || P_Y) = 0$.&nbsp; In all other cases&nbsp; $D(P_X || P_Y) > 0$.&nbsp; The same applies to the variant&nbsp; $D(P_Y || P_X)$.
+*If&nbsp; $P_X(x_μ) ≠ 0$&nbsp; and&nbsp; $P_Y(x_μ) = 0$&nbsp; $($a single and arbitrary&nbsp; $μ$&nbsp; is sufficient for this$)$,&nbsp; the Kullback-Leibler distance&nbsp; $D(P_X || P_Y)$&nbsp; has an infinitely large value.&nbsp; In this case, &nbsp; <br>$D(P_Y || P_X)$&nbsp; is not necessarily infinite either.
+*This statement makes it clear once again that in general&nbsp; $D(P_X || P_Y)$&nbsp; will be unequal to&nbsp; $D(P_Y || P_X)$&nbsp;.
-Wertet man die beiden obigen Gleichungen aus, so erkennt man folgende Eigenschaften:
-*Liegt genau die gleiche Verteilung vor  ⇒  $P_Y(·) ≡ P_X(·)$, so ist $D(P_X || P_Y) = 0$. In allen anderen Fällen ist $D(P_X || P_Y) > 0$. Gleiches gilt für die zweite Variante $D(P_Y || P_X)$.
-*Gilt $P_X(x_μ) ≠ 0$ und $P_Y(x_μ) = 0$ (hierfür genügt ein einziges und beliebiges $μ$), so ergibt sich für die Kullback–Leibler–Distanz $D(P_X || P_Y)$ ein unendlich großer Wert.
-*In diesem Fall ist $D(P_Y || P_X)$ nicht notwendigerweise ebenfalls unendlich. Diese Aussage macht nochmals deutlich, dass im allgemeinen $D(P_X || P_Y)$ ungleich $D(P_Y || P_X)$ sein wird.
-Anschließend Seite werden diese beiden Definitionen an unserem Standardbeispiel ''Würfel–Experiment'' verdeutlicht. Gleichzeitig verweisen wir auf folgende Aufgaben:
+Subsequently, these two definitions are clarified with our standard example&nbsp; "dice experiment".&nbsp; At the same time we refer to the following exercises:
-*[[Aufgaben:3.5_Kullback-Leibler-Distanz_%26_Binominalverteilung|Aufgabe 3.5: Kullback–Leibler–Distanz & Binomialverteilung]]
+*[[Exercise_3.5:_Kullback-Leibler_Distance_and_Binomial_Distribution|"Exercise 3.5: Kullback-Leibler Distance and Binomial Distribution"]]
-*[[Aufgaben:3.5Z_Nochmals_Kullback-Leibler-Distanz|Aufgabe 3.5Z: Nochmals Kullback–Leibler–Distanz]]
+*[[Aufgaben:Exercise_3.5Z:_Kullback-Leibler_Distance_again|"Exercise 3.5Z: Kullback-Leibler Distance again"]]
-*[[Aufgaben:3.6_Partitionierungsungleichung|A3.6: Partitionierungsungleichung]]
+*[[Aufgaben:Exercise_3.6:_Partitioning_Inequality|"Exercise 3.6: Partitioning Inequality"]]
-{{GraueBox|
+[[File:EN_Inf_T_3_1_S3.png|right|frame|Probability mass functions of our dice experiment]]
-TEXT='''Beispiel 3''':&nbsp; Für das Würfel–Experiment haben wir die Wahrscheinlichkeitsfunktionen $P_R(·)$, $P_B(·)$ und deren Näherungen $Q_R(·)$, $Q_B(·)$ definiert.
+{{GraueBox|TEXT=
-*Die Zufallsgröße $R$ mit PMF  $P_R(·)$ gibt die Augenzahl des roten Würfels an und $B$  mit PMF  $P_B(·)$ die Augenzahl des blauen Würfels.
+$\text{Example 4:}$&nbsp; For the dice experiment, we defined in&nbsp; $\text{Example 3}$&nbsp; the probability mass functions&nbsp; $P_R(·)$&nbsp; and&nbsp; $P_B(·)$&nbsp; and their approximations &nbsp; $Q_R(·)$&nbsp; and&nbsp; $Q_B(·)$&nbsp;.
-*Die Näherungen $Q_R(·)$ und $Q_B(·)$ ergeben sich aus dem früher beschriebenen Experiment mit lediglich $N = 18$ Doppelwürfen.
+*The random variable&nbsp; $R$&nbsp; with the probability mass function&nbsp;  $P_R(·)$&nbsp; indicates the numbers of the red cube and&nbsp; $B$&nbsp;  the numbers of the blue cube &nbsp; &rArr; &nbsp;  PMF&nbsp;  $P_B(·)$.
+*The approximations&nbsp; $Q_R(·)$&nbsp; and&nbsp; $Q_B(·)$&nbsp; result from the former experiment with&nbsp;  $N = 18$&nbsp; double throws&nbsp; &rArr; &nbsp; [[Information_Theory/Some_Preliminary_Remarks_on_Two-Dimensional_Random_Variables#Introductory_example_on_the_statistical_dependence_of_random_variables|$\text{Example 1}$]] .
-[[File:P_ID2745__Inf_T_3_1_S3_neu.png|frame|Wahrscheinlichkeitsfunktionen unseres Würfelexperiments]]
-Dann gilt:
+Then holds:
-*Da die Wahrscheinlichkeitsfunktionen $P_R(·)$ und $P_B(·)$ identisch sind, erhält man für die oben definierten Kullback–Leibler–Distanzen $D(P_R \vert \vert P_B)$ und $D(P_B \vert \vert P_R)$ jeweils den Wert $0$.
+*Since&nbsp; $P_R(·)$&nbsp; and&nbsp; $P_B(·)$&nbsp; are identical, we obtain zero for each of the Kullback-Leibler distances&nbsp; $D(P_R \vert \vert P_B)$&nbsp; and&nbsp; $D(P_B \vert \vert P_R)$.
-*Der Vergleich von $P_R(·)$, $Q_R(·)$ ergibt für die erste Variante der Kullback–Leibler–Distanz:
+*The comparison of&nbsp; $P_R(·)$&nbsp; and&nbsp;  $Q_R(·)$&nbsp; yields for the first variant of the Kullback-Leibler distance:
 :$$\begin{align*}D(P_R \hspace{0.05cm} \vert \vert \hspace{0.05cm} Q_R) & =
@@ Line 232: / Line 258: @@
 \cdot  {\rm log}_2 \hspace{0.1cm}  \frac{1/6}{4/18} \hspace{0.1cm}
   \right  ] =  1/6 \cdot \big [
-\cdot  0.585 + 2 \cdot  0 - 2 \cdot  0.415 \big ] \approx 0.0570\,{\rm bit} \hspace{0.05cm}.\end{align*}$$
+\cdot  0.585 + 2 \cdot  0 - 2 \cdot  0.415 \big ] \approx 0.0570\hspace{0.15cm} {\rm bit} .\end{align*}$$
-Hierbei wurde bei der vorzunehmenden Erwartungswertbildung die Tatsache ausgenutzt, dass wegen $P_R(r_1)$ = ... = $P_R(r_6)$ der Faktor $1/6$ ausgeklammert werden kann. Da hier der Logarithmus zur Basis 2 verwendet wurde, ist die Pseudo–Einheit „bit” angefügt.
+:In the calculation of the expected value, the fact that&nbsp; $P_R(r_1) = $ &nbsp;...&nbsp; $ = P_R(r_6)$,&nbsp; the factor 1/6 can be excluded.&nbsp; Since the logarithm to base&nbsp;$ 2$&nbsp; was used here, the pseudo-unit&nbsp; "bit”&nbsp; is added.
-*Für die zweite Variante der Kullback–Leibler–Distanz ergibt sich ein etwas anderer Wert:
+*For the second variant of the Kullback-Leibler distance, a slightly different value results:
 :$$\begin{align*}D(Q_R \hspace{0.05cm}\vert \vert \hspace{0.05cm} P_R)  & =
@@ Line 244: / Line 270: @@
 \cdot \frac{3}{18} \cdot {\rm log}_2 \hspace{0.1cm}  \frac{3/18}{1/6} \hspace{0.1cm} +
 \cdot \frac{4}{18} \cdot {\rm log}_2 \hspace{0.1cm}  \frac{4/18}{1/6}
- \text{...}  \approx 0.0544\,{\rm bit} \hspace{0.05cm}.\end{align*}$$
+  \approx 0.0544\hspace{0.15cm} {\rm bit} .\end{align*}$$
-*Für den blauen Würfel erhält man $D(P_B \vert \vert Q_B) ≈ 0.0283 \ \rm bit$ und $D(Q_B \vert \vert P_B) ≈ 0.0271 \ \rm bit$, also etwas kleinere Kullback–Leibler–Distanzen, da sich die Approximation $Q_B(·)$ von $P_B(·)$ weniger unterscheidet als $Q_R(·)$ von $P_R(·)$.
+*For the blue cube,&nbsp; one obtains&nbsp; $D(P_B \vert \vert Q_B) ≈ 0.0283 \hspace{0.15cm} \rm bit$&nbsp; and&nbsp; $D(Q_B \vert \vert P_B) ≈ 0.0271 \hspace{0.15cm} \rm bit$, i.e. slightly smaller Kullback-Leibler distances, since the approximation&nbsp; $Q_B(·)$&nbsp; of&nbsp; $P_B(·)$&nbsp; differs less than&nbsp; $Q_R(·)$&nbsp; of&nbsp; $P_R(·)$.
-*Vergleicht man die Häufigkeiten $Q_R(·)$ und $Q_B(·)$, so erhält man $D(Q_R \vert \vert Q_B) ≈ 0.0597 \ \rm bit$, $D(Q_B \vert \vert Q_R) ≈ 0.0608 \ \rm bit$. Hier sind die Distanzen am größten, da die Unterschiede zwischen $Q_B(·)$ und $Q_R(·)$ größer sind als zwischen $Q_R(·)$, $P_R(·)$ oder zwischen $Q_B(·)$, $P_B(·)$.}}
+*Comparing the frequencies&nbsp; $Q_R(·)$&nbsp; and&nbsp; $Q_B(·)$, we get&nbsp; $D(Q_R \vert \vert Q_B) ≈ 0.0597 \hspace{0.15cm} \rm bit$&nbsp; and&nbsp; $D(Q_B \vert \vert Q_R) ≈ 0.0608 \hspace{0.15cm} \rm bit$.&nbsp; Here the distances are greatest, since the differences between &nbsp; $Q_B(·)$&nbsp; and&nbsp; $Q_R(·)$&nbsp; are greater than between&nbsp; $Q_R(·)$&nbsp; and&nbsp; $P_R(·)$&nbsp; or between&nbsp; $Q_B(·)$&nbsp; and&nbsp; $P_B(·)$.}}
-==Verbundwahrscheinlichkeit und Verbundentropie  ==
+==Joint probability and joint entropy  ==
-Für den Rest diese dritten Kapitels betrachten wir stets zwei diskrete Zufallsgrößen $X = \{ x_1, x_2$, ... , $x_M \}$ und $Y = \{ y_1, y_2$, ... , $y_K \}$, deren Wertebereiche nicht notwendigerweise übereinstimmen müssen. Das heißt: $K ≠ M$ (in anderer Notation: $|Y| ≠ |X|$) ist durchaus erlaubt.
+<br>
+For the remainder of this third chapter, we always consider two discrete random variable&nbsp; $X = \{ x_1, \ x_2$, ... ,&nbsp; $x_M \}$&nbsp; and&nbsp; $Y = \{ y_1, \ y_2$, ... ,&nbsp; $y_K \}$, whose value ranges do not necessarily have to coincide.&nbsp; This means: &nbsp; $K ≠ M$ $($in other notation:&nbsp; $|Y| ≠ |X|)$&nbsp; is quite permissible.
-Die Wahrscheinlichkeitsfunktion hat somit eine $K × M$–Matrixform mit den Elementen
+The probability mass function thus has a&nbsp; $K×M$ matrix form with the elements
-:$$P_{XY}(X = x_{\mu}\hspace{0.05cm}, Y = y_{\kappa}) = {\rm Pr} \big [( X = x_{\mu})\hspace{0.05cm}\cap \hspace{0.05cm} (Y = y_{\kappa}) \big ] \hspace{0.05cm}.$$
+:$$P_{XY}(X = x_{\mu}\hspace{0.05cm}, \ Y = y_{\kappa}) = {\rm Pr} \big [( X = x_{\mu})\hspace{0.05cm}\cap \hspace{0.05cm} (Y = y_{\kappa}) \big ] \hspace{0.05cm}.$$
-Als Kurzschreibweise verwenden wir $P_{XY}(X, Y)$. Die neue Zufallsgröße $XY$ beinhaltet sowohl die Eigenschaften von $X$ als auch diejenigen von $Y$.
+We use&nbsp; $P_{XY}(X, Y)$ as a shorthand notation.&nbsp; The new random variable&nbsp; $XY$&nbsp; contains both the properties of&nbsp; $X$&nbsp; and those of&nbsp; $Y$.
-{{BlaueBox|
+{{BlaueBox|TEXT=
-TEXT='''Definition''':&nbsp; Die '''Verbundentropie''' (englisch: ''Joint Entropy'') lässt sich als ein Erwartungswert mit der 2D–Wahrscheinlichkeitsfunktion $P_{XY}(X, Y)$ wie folgt darstellen:
+$\text{Definition:}$&nbsp; The&nbsp; &raquo;'''joint entropy'''&laquo; can be represented with the two-dimensional probability mass function&nbsp; $P_{XY}(X, Y)$&nbsp; as an expected value as follows:
 :$$H(XY) = {\rm E} \left [ {\rm log} \hspace{0.1cm} \frac{1}{P_{XY}(X, Y)}\right ] = \sum_{\mu = 1}^{M}  \hspace{0.1cm} \sum_{\kappa = 1}^{K} \hspace{0.1cm}
-  P_{XY}(x_{\mu}\hspace{0.05cm}, y_{\kappa}) \cdot {\rm log} \hspace{0.1cm} \frac{1}{P_{XY}(x_{\mu}\hspace{0.05cm}, y_{\kappa})} \hspace{0.05cm}.$$ }}
+  P_{XY}(x_{\mu}\hspace{0.05cm}, y_{\kappa}) \cdot {\rm log} \hspace{0.1cm} \frac{1}{P_{XY}(x_{\mu}\hspace{0.05cm}, y_{\kappa})} \hspace{0.05cm}.$$
+In the following we use  throughout the logarithm to the base&nbsp; $2$  &nbsp; ⇒  &nbsp; $\log(x) → \log_2(x)$.&nbsp; The numerical value is thus to be assigned the pseudo-unit&nbsp; "bit".
-Im Folgenden verwenden wir durchgehend den Logarithmus zur Basis 2  ⇒  $\log(x) → \log_2(x)$ = $\text{ld}(x)$  ⇒ '' Logarithmus dualis''. Der Zahlenwert ist somit mit der Pseudo–Einheit „bit” zu versehen.
+In general, the following &nbsp; &raquo;'''upper bound'''&laquo;&nbsp; can be given for the joint entropy:
+:$$H(XY) \le H(X) + H(Y)  \hspace{0.05cm}.$$}}
-Allgemein kann für die Verbundentropie die folgende '''obere Schranke''' angegegeben werden:
-:$$H(XY) \le H(X) + H(Y)  \hspace{0.05cm}.$$
-Diese Ungleichung drückt folgenden Sachverhalt aus:
+This inequality expresses the following fact:
-*Das Gleichheitszeichen gilt nur für den Sonderfall statistisch unabhängiger Zufallsgrößen, wie im folgenden '''Beispiel 4''' anhand der Zufallsgrößen $R$ und $B$ demonstriert wird. Hierbei bezeichnen $R$ und $B$ die Augenzahlen eines roten und eines blauen Würfels.
+*The equal sign only applies to the special case of statistically independent random variables, as demonstrated in the following&nbsp; $\text{Example 5}$&nbsp; using the random variables&nbsp; $R$&nbsp; and&nbsp; $B$&nbsp;.&nbsp; Here&nbsp; $R$&nbsp; and&nbsp; $B$&nbsp; denote the numbers of the red and blue dice, respectively:
-*Gibt es dagegen statistische Abhängigkeiten wie im '''Beispiel 5''' zwischen den Zufallsgrößen $R$ und $S = R + B$, so gilt in obiger Gleichung das „<”–Zeichen: $H(RS) < H(R) + H(S)$.
+:$$H(RB) = H(R) + H(B).$$
-*In den Beispielen wird auch gezeigt, in wie weit sich die Verbundentropien $H(RB)$ und $H(RS)$ ändern, wenn man beim Würfel–Experiment nicht unendlich viele Wurfpaare ermittelt, sondern lediglich $N = 18$.
+*If, on the other hand, there are statistical dependencies as in&nbsp; $\text{Example 6}$&nbsp; between the random variables&nbsp; $R$&nbsp; and&nbsp; $S = R + B$, the "<" sign applies in the above inequality:
+:$$H(RS) < H(R) + H(S).$$
+These examples also show to what extent the joint entropies&nbsp; $H(RB)$&nbsp; and&nbsp; $H(RS)$&nbsp; change if one does not determine an infinite number of pairs of throws in the dice experiment, but only&nbsp; $N = 18$.
+[[File:EN_Inf_T_3_1_S5a.png|right|frame|Two-dimensional probability mass function&nbsp; $P_{RB}$&nbsp; and approximation&nbsp; $Q_{RB}$]]
+{{GraueBox|TEXT=
+$\text{Example 5:}$&nbsp; We return to the experiment&nbsp; [[Information_Theory/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Introductory_example_on_the_statistical_dependence_of_random_variables|$\text{Rolling with two dice}$]]&nbsp;:
-{{GraueBox|TEXT='''Beispiel 4''':&nbsp; Wir kommen wieder auf das [[Informationstheorie/Einige_Vorbemerkungen_zu_zweidimensionalen_Zufallsgrößen#Einf.C3.BChrungsbeispiel_zur_statistischen_Abh.C3.A4ngigkeit_von_Zufallsgr.C3.B6.C3.9Fen|Würfel–Experiment]] zurück: Die Zufallsgrößen sind die Augenzahlen des roten Würfels &nbsp; &rArr; &nbsp; $R = \{1, 2, 3, 4, 5, 6\}$ und des blauen Würfels:&nbsp; &rArr; &nbsp; $B = \{1, 2, 3, 4, 5, 6\}$.
+The random variables are the points of the
+*red cube: &nbsp; &rArr; &nbsp; $R = \{1, \ 2,\ 3,\ 4,\ 5,\ 6\}$,
+*blue cube:&nbsp; &rArr; &nbsp; $B = \{1,\ 2,\ 3,\ 4,\ 5,\ 6\}$.
-[[File:P_ID2747__Inf_T_3_1_S5a.png|frame|2D–PMF <i>P<sub>RB</sub></i> und Näherung <i>Q<sub>RB</sub></i>]]
-Die linke Grafik zeigt die Wahrscheinlichkeiten $P_{RB}(·)$, die sich für alle $μ = 1$, ... , $6$ und für alle $κ = 1$, ... , $6$ gleichermaßen zu $1/36$ ergeben. Damit erhält man für die Verbundentropie:
+The left graph shows the probabilities&nbsp;
+:$$P_{RB}(r_\mu,\ b_\kappa ) ={\rm Pr}\big [(R=r_\mu) \hspace{0.05cm}\cap \hspace{0.05cm} (B=b_\kappa)\big],$$
+which for all&nbsp; $μ = 1$, ... , $6$&nbsp; and for all&nbsp; $κ = 1$, ... , $6$&nbsp; equally yield the value&nbsp; $1/36$.&nbsp; Thus, one obtains for the joint entropy:
-:$$H(RB) = H(RB) \big \vert_{N \hspace{0.05cm}\rightarrow \hspace{0.05cm}\infty} =  {\rm log}_2 \hspace{0.1cm} (36) = 5.170\,{\rm bit} \hspace{0.05cm}.$$
+:$$H(RB) = H(RB) \big \vert_{N \hspace{0.05cm}\rightarrow \hspace{0.05cm}\infty} =  {\rm log}_2 \hspace{0.1cm} (36) = 5.170\hspace{0.05cm} {\rm bit} .$$
-Man erkennt aus der linken Grafik und der hier angegebenen Gleichung:
+One can see from the left graph and the equation given here:
-*Da $R$ und $B$ statistisch voneinander unabhängig sind, gilt $P_{RB}(R, B) = P_R(R) · P_B(B)$.
+*Since&nbsp; $R$&nbsp; and&nbsp; $B$&nbsp; are statistically independent of each other, the following applies.
-*Die Verbundentropie ist die Summe der beiden Einzelentropien: $H(RB) = H(R) + H(B)$.
+:$$P_{RB}(R, B) = P_R(R) · P_B(B).$$
+*The joint entropy is the sum of the two individual entropies:   &nbsp;
+:$$H(RB) = H(R) + H(B).$$
-Die rechte Grafik zeigt die angenäherte 2D&ndash;PMF $Q_{RB}(·)$, basierend auf den nur $N = 18$ Wurfpaaren unseres Experiments:
+The right graph shows the approximated two-dimensional probability mass function&nbsp; $Q_{RB}(·)$, based on the only&nbsp; $N = 18$&nbsp; throws of our experiment.&nbsp; Here, no quadratic form of the joint probability&nbsp; $Q_{RB}(·)$&nbsp; results, and the joint entropy derived from it is significantly smaller than&nbsp; $H(RB)$:
-*Hier ergibt sich keine quadratische Form der Verbundwahrscheinlichkeit $Q_{RB}(·)$, und die daraus abgeleitete Verbundentropie ist deutlich kleiner als $H(RB)$:
-:$$H(RB) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = 16 \cdot \frac{1}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{1} \hspace{0.1cm} +\hspace{0.1cm} 1 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{2} \hspace{0.1cm}= 4.059\,{\rm bit} \hspace{0.05cm}.$$}}
+:$$H(RB) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = 16 \cdot \frac{1}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{1} \hspace{0.1cm} +\hspace{0.1cm} 1 \cdot \frac{2}{18}\cdot {\rm log}_2 \hspace{0.1cm}  \frac{18}{2} \hspace{0.1cm}= 4.059\hspace{0.15cm} {\rm bit} .$$}}
-{{GraueBox|TEXT='''Beispiel 5''':&nbsp; Beim Würfel–Experiment haben wir neben den Zufallsgrößen $R$ (roter Würfel) und $B$ (blauer Würfel) auch die Summe $S = R + B$ betrachtet. Die linke Grafik zeigt, dass $P_{RS}(·)$ nicht als Produkt von $P_R(·)$ und $P_S(·)$ geschrieben werden kann. Mit den Wahrscheinlichkeitsfunktionen
+[[File:EN_Inf_T_3_1_S5b.png|right|frame|Two-dimensional probability mass function&nbsp; $P_{RS}$&nbsp; and approximation&nbsp; $Q_{RS}$]]
+{{GraueBox|TEXT=
+$\text{Example 6:}$&nbsp; In the dice experiment, in addition to the random variables&nbsp; $R$&nbsp; (red cube) and&nbsp; $B$&nbsp; (blue cube) also the sum&nbsp; $S = R + B$&nbsp; is considered.&nbsp; The graph on the left shows that the two-dimensional probability mass function&nbsp; $P_{RS}(·)$&nbsp; cannot be written as a product of&nbsp; $P_R(·)$&nbsp; and&nbsp; $P_S(·)$.
+With the probability functions
-:$$P_R(R) = \left [ \hspace{0.02cm} 1/6\hspace{0.05cm}, 1/6\hspace{0.05cm}, 1/6\hspace{0.05cm}, 1/6\hspace{0.05cm}, 1/6\hspace{0.05cm}, 1/6 \hspace{0.02cm} \right ] \hspace{0.05cm},$$
+:$$P_R(R) = 1/6 \cdot \big [ 1,\ 1,\ 1,\ 1,\ 1,\ 1 \big ],$$
-:$$P_S(S)=\left [ \hspace{0.02cm}  1/36\hspace{0.05cm},2/36\hspace{0.05cm}, 3/36\hspace{0.05cm}, 4/36\hspace{0.05cm}, 5/36\hspace{0.05cm}, 6/36\hspace{0.05cm}, 5/36\hspace{0.05cm}, 4/36\hspace{0.05cm}, 3/36\hspace{0.05cm}, 2/36\hspace{0.05cm}, 1/36\hspace{0.02cm} \right ] \hspace{0.05cm}$$
+:$$P_S(S)=1/36 \cdot \big [ 1,\ 2,\ 3,\ 4,\ 5,\ 6,\ 5,\ 4,\ 3,\ 2,\ 1 \big ] $$
-erhält man für die Entropien:
+one obtains for the entropies:
-:$$H(S) = 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm}\frac{1}{36} \hspace{-0.05cm}\cdot \hspace{-0.05cm} {\rm log}_2 \hspace{0.05cm}  \frac{36}{1} \hspace{0.05cm} + 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{2}{36} \hspace{-0.05cm}\cdot \hspace{-0.05cm} {\rm log}_2 \hspace{0.05cm}  \frac{36}{2} \hspace{0.05cm} + 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{3}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{3} \hspace{0.05cm} + 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{4}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{4} \hspace{0.05cm} +2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{5}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{5}
+:$$H(RS) = {\rm log}_2 \hspace{0.1cm} (36) \approx 5.170\hspace{0.15cm} {\rm bit} ,$$
-+ 1 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{6}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{6} \approx 3.274\,{\rm bit} \hspace{0.05cm}, $$
+:$$H(R) = {\rm log}_2 \hspace{0.1cm} (6) \approx 2.585\hspace{0.15cm} {\rm bit},$$
-[[File:P_ID2748__Inf_T_3_1_S5b_neu.png|right|frame|2D–PMF <i>P<sub>RS</sub></i> und Näherung <i>Q<sub>RS</sub></i>]]
+$$H(S) = 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm}\frac{1}{36} \hspace{-0.05cm}\cdot \hspace{-0.05cm} {\rm log}_2 \hspace{0.05cm}  \frac{36}{1} \hspace{0.05cm} + 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{2}{36} \hspace{-0.05cm}\cdot \hspace{-0.05cm} {\rm log}_2 \hspace{0.05cm}  \frac{36}{2} \hspace{0.05cm} + 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{3}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{3} \hspace{0.05cm} + $$
-:$$H(R) = {\rm log}_2 \hspace{0.1cm} (6) \approx 2.585\,{\rm bit} \hspace{0.05cm},$$
-:$$H(RS) = {\rm log}_2 \hspace{0.1cm} (36) \approx 5.170\,{\rm bit} \hspace{0.05cm}.$$
+::$$+ 2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{4}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{4} \hspace{0.05cm} +2 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{5}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{5}
++ 1 \hspace{-0.05cm}\cdot \hspace{-0.05cm} \frac{6}{36} \cdot {\rm log}_2 \hspace{0.05cm}  \frac{36}{6}  $$
-Aus diesen Zahlenwerten erkennt man:
+:$$\Rightarrow \hspace{0.3cm} H(S) \approx 3.274\hspace{0.15cm} {\rm bit} . $$
-*Aufgrund der statistischen Abhängigkeit zwischen dem roten Würfel und der Summe ist die Verbundentropie $H(RS) ≈ 5.170 \ \rm bit$ kleiner als $H(R) + H(S) ≈ 5.877 \ \rm bit$.
+<br clear=all>
-*Der Vergleich mit dem''' Beispiel 4''' zeigt, dass $H(RS) =H(RB)$ ist. Der Grund ist, dass bei Kenntnis von $R$ die Zufallsgrößen $B$ und $S$ genau die gleichen Informationen liefern.
+From these numerical values one can see:
+*The comparison with the&nbsp; $\text{Example 5}$&nbsp; shows that&nbsp; $H(RS) =H(RB)$.&nbsp; The reason for this is that, knowing&nbsp; $R$&nbsp; the random variables&nbsp; $B$&nbsp; and&nbsp; $S$&nbsp; give exactly the same information.
+*Due to the statistical dependence between the red cube and the sum, &nbsp; $H(RS) ≈ 5.170 \hspace{0.15cm} \rm bit$&nbsp; is smaller than the sum&nbsp; $H(R) + H(S) ≈ 5.877 \hspace{0.15cm} \rm bit.$
-Rechts dargestellt ist der Fall, dass die 2D–PMF $Q_{RS}(·)$ empirisch ermittelt wurde ($N = 18$). Obwohl sich aufgrund des sehr kleinen $N$–Wertes ein völlig anderes Bild ergibt, liefert die Näherung für $H(RS)$ den exakt gleichen Wert wie die Näherung für $H(RB)$ im Beispiel 4:
+<br clear=all>
+Shown on the right is the case where the two-dimensional probability mass function&nbsp; $Q_{RS}(·)$&nbsp; was determined empirically&nbsp; $(N = 18)$.&nbsp; Although a completely different figure emerges due to the very small&nbsp; $N$&nbsp; value, the approximation for&nbsp; $H(RS)$&nbsp; provides exactly the same value as the approximation for&nbsp; $H(RB)$&nbsp; in&nbsp; $\text{Example 5}$:
 :$$H(RS) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = H(RB) \big \vert_{N \hspace{0.05cm} =  \hspace{0.05cm}18} = 4.059\,{\rm bit} \hspace{0.05cm}.$$}}
@@ Line 320: / Line 364: @@
-==Aufgaben zum Kapitel==
+==Exercises for the chapter==
+<br>
-[[Aufgaben:3.1 Wahrscheinlichkeiten beim Würfeln|Aufgabe 3.1: &nbsp; Wahrscheinlichkeiten beim Würfeln]]
+[[Aufgaben:Exercise_3.1:_Probabilities_when_Rolling_Dice|Exercise 3.1: Probabilities when Rolling Dice]]
-[[Aufgaben:3.1Z Karten ziehen|Zusatzaufgabe 3.1Z: &nbsp; Karten ziehen]]
+[[Aufgaben:Exercise_3.1Z:_Drawing_Cards|Exercise 3.1Z: Drawing Cards]]
-[[Aufgaben:3.2 Erwartungswertberechnungen|Aufgabe 3.2: &nbsp; Erwartungswertberechnungen]]
+[[Aufgaben:Exercise_3.2:_Expected_Value_Calculations|Exercise 3.2: Expected Value Calculations]]
-[[Aufgaben:3.2Z 2D–Wahrscheinlichkeitsfunktion|Zusatzaufgabe 3.2Z: &nbsp; 2D–Wahrscheinlichkeitsfunktion]]
+[[Aufgaben:Exercise_3.2Z:_Two-dimensional_Probability_Mass_Function|Exercise 3.2Z: Two-dimensional Probability Mass Function]]
-[[Aufgaben:3.3 Entropie von Ternärgrößen|Aufgabe 3.3: &nbsp; Entropie von Ternärgrößen]]
+[[Aufgaben:Exercise_3.3:_Entropy_of_Ternary_Quantities|Exercise 3.3: Entropy of Ternary Quantities]]
-[[Aufgaben:3.4 Entropie für verschiedene Wahrscheinlichkeiten|Aufgabe 3.4: &nbsp; Entropie für verschiedene Wahrscheinlichkeiten]]
+[[Aufgaben:Exercise_3.4:_Entropy_for_Different_PMF|Exercise 3.4: Entropy for Different PMF]]
-[[Aufgaben:3.5 Kullback-Leibler-Distanz & Binominalverteilung|Aufgabe 3.5: &nbsp; Kullback-Leibler-Distanz & Binominalverteilung]]
+[[Exercise_3.5:_Kullback-Leibler_Distance_and_Binomial_Distribution|Exercise 3.5: Kullback-Leibler Distance and Binomial Distribution]]
-[[Aufgaben:3.5Z Nochmals Kullback-Leibler-Distanz|Zusatzaufgabe 3.5Z: &nbsp; Nochmals Kullback-Leibler-Distanz]]
+[[Aufgaben:Exercise_3.5Z:_Kullback-Leibler_Distance_again|Exercise 3.5Z: Kullback-Leibler Distance again]]
-[[Aufgaben:3.6 Partitionierungsungleichung|Aufgabe 3.6: &nbsp; Partitionierungsungleichung]]
+[[Aufgaben:Exercise_3.6:_Partitioning_Inequality|Exercise 3.6: Partitioning Inequality]]
-==Quellenverzeichnis==
+==References==
 <references />