Exercise 3.5: GSM Full Rate Vocoder
This codec called "GSM Full-Rate Vocoder" (which was standardized for the GSM system in 1991) stands for a joint realization of coder and decoder and combines three methods for the compression of speech signals:
- Linear Predictive Coding $\rm (LPC)$,
- Long Term Prediction $\rm (LTP)$, and
- Regular Pulse Excitation $\rm (RPE)$.
The numbers shown in the graphic indicate the number of bits generated by the three units of this Full-Rate speech codec per frame of $20$ millisecond duration each.
It should be noted that LTP and RPE, unlike LPC, do not work frame by frame, but with sub-blocks of $5$ milliseconds. However, this has no influence on solving the task.
The input signal in the above graphic is the digitalized speech signal $s_{\rm R}(n)$.
This results from the analog speech signal $s(t)$ by
- a suitable limitation to the bandwidth $B$,
- sampling at the sampling rate $f_{\rm A} = 8 \ \rm kHz$,
- quantization with $13 \ \rm bit$,
- following segmentation into blocks of each $20 \ \rm ms$.
The further tasks of preprocessing will not be discussed in detail here.
You can also take questions and solutions (almost completely?) from " Exercise 3.4Z"
Hint:
- This exercise belongs to the chapter "Voice Coding".
Questions
Solution
(1) To satisfy the sampling theorem, the bandwidth must not exceed $f_{\rm A}/2 \hspace{0.15cm} \underline{= 4 \ \rm kHz}$.
(2) From the given sampling rate $f_{\rm A} = 8 \ \rm kHz$ results a distance between individual samples of $T_{\rm A} = 0.125 \ \rm ms$.
- Thus, a speech frame $(20 \ \rm ms)$ consists of $N_{\rm R} = 20/0.125\hspace{0.15cm} \underline{= 160 \ \rm samples}$, each quantised with $13 \ \rm bits$.
- The data rate is thus
- $$R_{\rm In} = \frac{160 \cdot 13}{20 \,{\rm ms}} \hspace{0.15cm} \underline {= 104\,{\rm kbit/s}}\hspace{0.05cm}.$$
(3) From the graph, it can be seen that $36$ (LPC) $+ 36$ (LTP) $+ 188$ (RPE) $= 260 \ \rm bits$ are output per speech frame.
- From this, the output data rate is calculated to be
- $$R_{\rm Out} = \frac{260}{20 \,{\rm ms}} \hspace{0.15cm} \underline {= 13\,{\rm kbit/s}}\hspace{0.05cm}.$$
- The compression factor achieved by the full rate speech codec is thus $104/13 = 8$.
(4) Correct are statements 1 and 2:
- The $36$ LPC bits describe a total of eight filter coefficients of a non-recursive filter, where eight AKF values are determined from the short-time analysis and these are converted into reflection coefficients $r_{k}$ according to the so-called Schur recursion.
- From these, the eight LAR coefficients are calculated according to the function ${\rm ln}[(1 - r_{k})/(1 + r_{k})]$, quantised with a different number of bits and passed on to the receiver.
- The LPC output signal has a significantly smaller amplitude compared to its input $s_{\rm R}(n)$, has a significantly reduced dynamic range and a flatter spectrum.
(5) Correct are statements 1 and 3, but not the second:
- The LTP analysis and filtering is done in blocks every $5 \rm ms \ (40 \rm samples)$, i.e. four times per speech frame.
- To do this, the cross-correlation function (CCF) is formed between the current and the three preceding sub-blocks.
- For each sub-block, an LTP delay and an LTP gain are determined that best fit the sub-block.
- A correction signal of the subsequent component "RPE" is also taken into account.
- In the case of long-term prediction, as with LPC, the output is redundancy-reduced compared to the input.
(6) Correct are statements 2 and 3:
- That statement 1 is false can already be seen from the graph on the statements page, since $188$ of the $260$ output bits come from the RPE.
- To the last statement: The RPE searches for the subsequence with the maximum energy.
- This parameter "RPE pulses" alone occupies $156$ of the $260$ output bits.