Vocoder for Legal Eavesdropping Conversation Recording R F B Sotero Filho H M de Oliveira qPGOM R Campello de Souza Signal Processing Group Federal University of ID: 613711
Download Presentation The PPT/PDF document "A Full Frequency Masking" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording
R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza Signal Processing Group, Federal University of Pernambuco – UFPE E-mail: rsotero@hotmail.com.br, {hmo,ricardo}@ufpe.brSlide2
Abstract
:
N
ew approach for a vocoder Based on: full frequency masking by octaves Useful to save bandwidth (applications requiring intelligibility) Recommended for: legal eavesdropping of long conversations.Slide3
Introduction Vocoder = contraction from voice encoder
: waveform not recreate the original waveform in appearance, (but it should be perceptually similar to it) first described by Homer Dudley at Bell Telephone Laboratory in 1939 Parameters are extracted from the spectrum and updated every 10-25 ms
Properties of
voice
: limitation of the human auditory system physiology of the voice generation process Slide4
Psycho-Acoustics of the Human Auditory System
•
Frequency Masking
: Masking in frequency or "reduced audibility of a sound due to the presence of another"• Insensitivity to the phase: The human ear has little sensitivity to the phase of signals Slide5
Simplification of the spectrum via frequency masking
For each voice segment:
FFT of
blocklength 160 (frame of 20 ms) The spectrum is segmented into regions of influence (octaves). The range 32 - 64 Hz is removed. 64 Hz-128 Hz, 128 Hz-512 Hz, and so on. Each spectral sample corresponds to a multiple of 50 Hz Slide6
Table 1
. Number of spectral lines per octave (
DFT of length
N
=160, sample rate 8 kHz)Octave (Hz)# spectral samples/octave32-64164-1281128-256
3
256-512
5
512-1024
101024-2048202048-409639A total of 79 frequencies (DFT with N=160) is reduced to 4 survivors! (holding less than 5% of the spectral components). Slide7
Figure
1. The spectrum of a voice frame computed by the FFT: Original spectrum Simplified full-masking spectrum
This technique is called full frequency masking. Slide8
Signal synthesis via spectral filling
The beta distribution is a probability distribution defined over 0≤x≤1, characterized by a pair of parameters
α
and
β :P(x)=1/B(α,β) x(α-1) (1-x)(β-1), 1<
α
,
β
<+∞, whose normalized factor is B(α,
β)=(Γ(α)Γ(β))/(Γ(α+β)), where Γ(.) is the generalized Euler factorial function and B(.,.) is the Beta function. Figure 2.
Envelope shape of survivor tone different parameters α and b. Slide9
By making the fitting:
newmode= (
α
-1)/(
α+β-2) (fM - fm)+ fm. upper limit is equivalent to the difference between the normalized cutoff frequency exceeding (fM) and lower (fm) of each octave, i.e., fM - fm. To fulfill the spectral algorithm each frame:
P
(
x
)= 1/( fM -
fm)(α+β-2) (x- fm)(α-1) (fM -x)(β-1). Slide10
( – piece of speech from radio)A few audio files generated by this vocoder are available at the URL http://www2.ee.ufpe.br/codec/vocoder.html
(
vocoder
with Hamming windowing) Figure 3. Full masking and spectral fillingSlide11
Quantization and Coding of Speech Signals
The maximum excursion of the full-spectrum
was divided into 256 intervals of equal length, each represented by one byte.
N
o negative samples to be quantized => the quantizer cannot be bipolar.Table 2. Bit allocation in a voice frame (20 ms). The required number of bits is expressed as A + P, where A is the number of bits for spectral line amplitude and P the number of bits to express the relative position within the octave.Relevant octave#possible survivor componentsBits A+P
#1
(256-512 Hz)
5
8 + 3
#2 (512-1024 Hz)108 + 4#3 (1024-2048 Hz)208 + 5#4 (2048-4096 Hz)398 + 6Slide12
E
ach voice frame needs 50 bits (18 for identifying positions and 32 for identifying masking tones), The vocoder
rate is 50 bits/20 ms=
2.5 kbps
The binary format .voz The representation of a voice frame in this format (extension .voz): The 50 bits are distributed into four sub blocks, indicating the value of the spectral sample followed by its respective position in the spectrum. The voice files registered in the .wav format are converted to this binary format, by a Matlab routine. Slide13
Figure 4.
Frame of files in the format .voz (20 ms).
Table 3.
MOS scores for the voice signals synthesized by four different techniques
Vocoder techniqueMOS scoreSynthesized signals with no spectral filling3.0Vocoder signals reconstructed via beta spectral filling technique2.5Synthesized voice signals combining 1 and 2 techniques (linear)
2.8
Voice signals from item 2, but with an extra Hamming windowing
3.0
I
ntelligibility and voice quality versus bit rateVoice quality is estimated using the "Mean Opinion Score (MOS)"Slide14
Conclusions
N
ew
vocoder
: voice signal using fewer samples of the spectrum. Voice (acceptable quality) at a rate of a few kbits/s. A new technique of spectral filling:not helpful in improving the voice quality, but naturalnessAPPLICATIONS maintenance voice channels in large plants
speaker recognition system
monitoring voice conversation from authorized eavesdropping
THAT’S ALL FOLKS! TKS.Slide15
Pre-signal processing
Shannon sampling theorem (a signal band limited to
f
m
Hz is sampled at a rate of at least 2fm equally spaced samples per second). LPF.Voice Segmentation and Windowingpartition of the speech signal into pieces (stationary frames): (~10 - 40 ms).Hamming window chosen due to softness at the edges. Pre-emphasis -6dB/octave, radiated from the lips during speech. This spectral distortion can be eliminated by applying a filter response approximately +6 dB/octavey(n)= x(n)-a.x(n-1), for 1 ≤ n < M, where M is the number of samples of x(n), y(n) is the emphasized signal and the constant "a" is normally set 0.95.