Problem Our recognizer translates the audio to a possible string of text How do we know the translation is correct Problem How do we handle a string of text containing words that are not in the dictionary ID: 480363
Download Presentation The PPT/PDF document "Recognizer Issues" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Recognizer Issues
Problem
: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct.
Problem:
How do we handle a string of text containing words that are not in the dictionary.
Problem
: How do we handle strings with valid words, but which do not form sentences with semantics that makes senseSlide2
Correcting Recognizer Ambiguities
Problem
: Resolving words not in the dictionary
Question
: How different is a recognized word from those that are in the dictionary?
Solution
: Count the single step transformations necessary to convert one word into another.
Example
: caat
cat with removal of one letter
Example
: flpc fireplace requires adding the letters ire after f and a before c and e at the endSlide3
Spelling Error Types
Levenshtein distance:
the smallest number of insertion, deletion, or substitution operations that transforms one string into another
Examples: differences from the word, “cat”
Insertion: cattDeletion: ctSubstitution: carTransposition: cta
Note
: There are similar well-defined rules for pronunciation variationsSlide4
Spell Check Algorithm
FOR
unrecognized words or those out of context
Generate a list of candidates
Those that differ by a one step transformation
Those that exist in the lexiconOrder possibilities using language-based statisticsRETURN most likely wordNote
: We could extend the algorithm to consider multiple transformations when we cannot find a single step solution Slide5
Example
Mispelled word: acress
Candidates – with probabilities of use and use within context
Context
Context * P(c)Slide6
Which correction is most likely?
Word frequency percentage is not enough
We need p(typo|candidate) * p(candidate)
How likely is the particular error?
Deletion of a t after a c and before an r
Insertion of an a at the beginningTranspose a c and an aSubstitute a c for an rSubstitute an o for an eInsert an s before the last s, or after the last sContext of the word within a sentence or paragraph
Misspelled word: accressSlide7
Additional Issues to Consider
What if there is more than one error per word?
Possible Solution:
Compute minimum edit distance and allow more than one transformation
Some words in the lexicon may not appear in the frequency corpus that we use
Possible Solution: Smoothing algorithmsIs P(typo|candidate) accurate?Possible Solution: Train the algorithmChoice often depends on contextPossible Solution: Context sensitive algorithmsNames and places likely do not exist in the lexiconThis is a difficult issueSlide8
8
Dynamic Programming
Definition
: Nesting small decision problems inside larger decisions (Richard Bellman)
Two approaches:
1.
Top Down
:
Start out with a method call to compute a final solution. The method proceeds to solve sub-problems of the same kind. Eventually the algorithm reaches and solves the “base case.” The algorithm can avoid repetitive calculations by storing the answers to the sub-problems for later access.
2.
Bottom Up
:
Start with the base cases, and work upwards, using previous results to reach the final answer.
The examples on the slides that follow use the bottom up approachSlide9
Dynamic Programming Algorithms
Recursion
is popular for solving ‘divide and conquer’ problems
Divide problem into smaller problems of the same kind
Define the base case
Works from the original problem down to the base caseDynamic programming is another approach that uses arraysInitialize a variable or tableUse loops fill in entries the table. The algorithm always fills in entries in the table before they are neededWorks top-down or bottom-up towards the final solutionAvoids repetitive calculations because the technique stores results in a table for later use
Eliminates recursion activation record overhead
Implement recursive algorithms using arrays
Dynamic programming is a widely used algorithmic techniqueSlide10
Fibonacci Sequence
Bottom Up
int
fibo
(n) { if (n <=1) return n; int
fMinusTwo
= 0;
int
fMinusOne
= 1;
for
(
int
i = 2; i <= n;
i++) { f = fAMinusTwo + fMinusOne; fMinusTwo = fMinusOne; fAtNMinusOne = f; } return(f)}Top Downint[] fibs = new int[MAX];fibs[0] = 0;fibs[1] = 1;int max = 1;
int fibo(n){
if (n<=max) return fibs[n]; else
fibs[n] = fibo(n-1)+fibo(n-2); max = n; return fibs[n];}{0 1 1 2 3 5 8 13 21 …} where F(n)=F(n-1)+F(n-2)Slide11
Example: Knapsack Problem
Values[v]
containing values of the N items
Weights[v]
contains the weights of the N items
K = knapsack capacitydynTable[v][w] contains the dynamic algorithm tableA thief enters a house with a knapsack of a particular size. Which items of different values does he choose to fill his knapsack.Slide12
KnapSack Algorithm
knapSack
(
int
[] w,
int
[] v,
int
K)
{
int
[][]
dynTable
[v.length+1][K];
for (
int
w = 0 to K) a[0][w] = 0;
for (int v=1; v<=v.length; v++){ for (int w=0; w<W; w++) { dyn[v][w] = dynTable[v-1][w]; // Copy up if (w[v] <= w && ) dynTable[v][w] = // Try for better choice Math.max(dynTable[v][w], values[v]+dynTable[v-1,w-weights[v]]);
}}return
dynTable[N,W];
}Note: Row for each item to consider; column for integer weightsSlide13
Knapsack Example
Weight
I
T
EM
0
1
2
3
4
5
6
7
8
9
10
11
12
131415000
00
00
00000000001003333333333333
320033355555555555300336699911111111
11111140033610101313161619191921215003361010
131316161919192122012345Weights?23458Values?3
26109Goal: Maximum profit with a 15kg knapsack?Array cells represent value stuffed into knapsack of that weightSlide14
Example: Minimum Edit Distance
Problem
: How can we measure how different one word is from another word (ie spell checker)?
How many operations will transform one word into another?
Examples:
caat --> cat, fplc --> fireplaceDefinition: Levenshtein distance: smallest number of insertion, deletion, or substitution operations to transform one string into anotherEach insertion, deletion, or substitution is one operation, with a cost of 1.Requires a two dimension arrayRows: source word positions, Columns: spelled word positionsCells: distance[r][c] is the distance (cost) up to that point
A useful dynamic programming algorithmSlide15
Pseudo Code (minDistance(target, source))
n
= character in source
m
= characters in target
Create array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = r
FOR
c=0
TO
m distance[0,c] = c
FOR
each
row
r
FOR each column c
IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]Slide16
Example
Source
: GAMBOL,
Target
: GUMBO
Algorithm Step: Initialization
G
U
M
B
O
0
1
2
3
4
5
G
1
A
2
M
3
B
4
O
5
L
6Slide17
Example
Source
: GAMBOL,
Target
: GUMBO
Algorithm Step: Column 1
G
U
M
B
O
0
1
2
3
4
5
G
1
0
A
2
1
M
3
2
B
43
O
54
L
65Slide18
Example
Source
: GAMBOL,
Target
: GUMBO
Algorithm Step: Column 2
G
U
M
B
O
0
1
2
3
4
5
G
1
0
1
A
2
1
1
M32
2
B43
3
O54
4
L65
5Slide19
Example
Source
: GAMBOL,
Target
: GUMBO
Algorithm Step: Column 3
G
U
M
B
O
0
1
2
3
4
5
G
1
0
1
2
A
2
112
M32
21
B43
3
2O54
43
L65
54Slide20
Example
Source
: GAMBOL,
Target
: GUMBO
Algorithm Step: Column 4
G
U
M
B
O
0
1
2
3
4
5
G
1
0
1
2
3
A
21123
M32
212
B43
3
21O54
432
L65
543Slide21
Example
Source
: GAMBOL,
Target
: GUMBO
Algorithm Step: Column 5Result: Distance equals 2
G
U
M
B
O
0
1
2
3
4
5
G
1
0
1
2
3
4A
211234
M32
2123
B43
3
212O5
443
21L65
543
2Slide22
Another Example
E
X
E
C
U
T
I
O
N
0
1
2
3
4
5
6
7
8
9
I
1
1
2
3
456
678N
222
3
45677
7T3
33345
567
8
E4343
456
678
N
5
4444
567
7
7T655
555
567
8
I
7666
666
567O87
777
7765
6N9
8
8
8888
7
6
5Slide23
Comparing Audio Frames
Patterns
: Database of audio samples. Match to the ones that are
closest
Templates
: Database of features extracted from audio samplesTraining: Use a training set to create a vector of patterns or templatesDistortion Measure: algorithm to measure how far a pair of templates or patterns are apartSlide24
Will Minimum Edit Distance Work?
Maybe:
D
istances to higher array indices are functions
of the costs of smaller
indices; a dynamic programming approach may workIssuesThe algorithm may be too slowA binary equal or not equal comparison does not workA distance metric is neededSpeaking rates change, even with a single speakerDo we compare the raw data or frame-based features?Do we assign cost to adjacent cells or to those further away?Other issues
: Phase, energy, pitch misalignment, Presence of noise,
length
of vowels, Phoneme pronunciation variances,
etc.
Incorrect comparisons occur when the algorithm isn’t carefully designedSlide25
Dynamic Time Warping
Goal: Find “best” alignment between pairs of audio frames
(A)
(B)
time (frame) of (A)
time (frame) of (
B
)
The matrix to the right shows the optimal alignment path (warping) between frames from utterance A with those of utterance BSlide26
26
Dynamic Time Warping (
DTW
) Overview
Computes the “
distance” between 2 frames of
speech
Measures frame by frame distances to compute dissimilarities between speech
Allows the comparison to warp the comparison to account for differences in speaking rates
Requires a cost function to account for different paths through frames of speech
Uses
dynamic programming algorithm to find best
warping.
Computes a
total “distortion score” for best warped path.
Assumptions
Constrain begin and end times to be (1,1) and (
T
A,TB)
Allow only monotonically increasing timeDon’t allow too many frames to be skipped
Can express results in terms of “paths” with “slope weights”Slide27
27
Assumptions
Does
not
require that both patterns have the same length
O
ne
speech pattern
is
the “input” and
the other
speech pattern
is
the “template
” to compare against
Divide
speech signal into equally-spaced frames (10-30ms)
with approximately 50% overlap and
compute a frame-based feature setThe local distance measure (
d) is the distance between features at a pair of frames (one from A, one from B).The
Global distortion from beginning of utterance until current pair of frames called G.Slide28
Algorithm Efficiency
The algorithm complexity is O(m*n) where m and n are the respective number of frames between the two utterances. If m=n, the algorithm is O(n
2
). Why?: Count the number of cells that need to be filled in.
O(n
2) may be too slow. Alternate solutions have been devised.Don’t fill in all of the cells.Use a multi-level approachSlide29
Don’t Fill in all of the Cells
Disadvantage:
The algorithm may miss the optimal pathSlide30
The Multilevel Approach
Concept
Coarsen the array of features
Run the algorithm
Refine the array
Adjust the solution
Repeat steps 3-4 till the original array of features is restored
Notes
The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n
2
) to O(n
lg
n)
Example
: partitioning a graph to balance work loads among threads or processorsSlide31
Which Audio Features?
Cepstrals
: They are statistically independent and phase differences are removed
ΔCepstrals
, or
ΔΔCepstrals: Reflects how the signal is changing from one frame to the nextEnergy: Distinguish the frames that are voiced verses those that are unvoicedNormalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers.
These are some of the popular speech recognition featuresSlide32
32
Distance Metric Requirements
Definition: Measure similarity of two
frames of
speech.
The vector
x
t
,
y
t
contain the
features from
frames of two signals
A distance measure should have the following properties:
0
d(xt,yt)
0 = d(x
t,yt) iff xt = yt d(xt,yt) = d(xt,yt) (symmetry) d(xt,yt) d(xt,
zt) + d(zt,yt) (triangle inequality)A speech distance metric should correlate with perceived distance. Perceptually-warped spectral features work well in practice
(positive definiteness)Slide33
Which Distance Metric?
General Formula:
array[
i,j
] = distance(
i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}Assumption : There is no cost assessed for duplicate or eliminated frames. Distance Formula:Euclidian: sum the square of one metric minus another squaredLinear: sum the absolute value of the distance between featuresExample of a distance metric using linear distance
∑
w
i
|(
f
a
[
i
] –
fb
[i])| where f[i] is a particular audio feature for signals a
and b. w[i] is that feature’s weightSlide34
Which Local Distance Measure?
O
ther distance measures
Itakura
-Saito distortion,
COSH
, likelihood ratio, etc…
where
x
t
(
f
), and
y
t
(
f) are frame frequency values at time t; f is a feature indexEuclidean distance:
Mahalanobis
distance:
Note: we can weight the features by multiplying differences by weighting factors to emphasize/deemphasize certain featuresSlide35
35
Dynamic Time Warping
Termination
Step
Divide the total computed cost
by
a normalizing
factor
The normalizing factor is
necessary to
compare
results
between input speech and various
templates to which it is compared
O
ne quick and effective normalizing method divides
by the number
of frames in the template. Another method is divide the result by
the length of the path taken, where we adjust the length by the slope weights at each transition
. This requires backtracking to sum the slope values, but can sometimes be more accurate.Slide36
36
P
1
P
2
P
3
P1=(1,0)
P2=(1,1)
P3=(1,2)
Frame
transition
cost heuristics
Path
P
and slope weight
m
determined heuristically
Paths considered backward
from
target frame
Larger weight values for less preferable paths
Optimal paths always go up, right (monotonically forward in time)
Only evaluate P if all frames have meaningful values (e.g. don’t evaluate a path if one frame is at time 1, because there is no data for time 1).P1=(1,1)(1,0)P2=(1,1)P3=(1,1)(0,1)
P
1
P2P3
1½½½
½
Heuristic 1Heuristic 2Slide37
37
Dynamic Time Warping (
DTW
) Algorithm
1
.
Initialization
(time 1 is first time frame)
D
(1,1) = d(1,1
)
2.
Recursion
3.
Termination
where (
=zeta
) is a function of previous distances and slopesM sometimes defined as Tx, or Tx+Ty,or (T
x 2+ Ty 2)½A convenient value for M is the length of the template.Slide38
38
Dynamic Time Warping (
DTW
) Example
1
1
2
2
3
3
1
1
1
2
2
2
3
1
2
2
3
3
3
1
2
1
1
2
3
2
1
2
1
2
33321
2
P1=(1,0)
P2=(1,1)
P3=(1,2)
11
1heuristic paths:begin at (1,1), end at (7,6)
1
2
5
8
11
14
2
3
4
6
9
2
4
5
5
8
4
4
6
7
5
5
5
66
67D(1,1) = D(2,1) = D(3,1) = D(4,1) = D(1,2) = D(2,2) = D(3,2) = D(4,2) = D(2,3) = D(3,3) = D(4,3) = D(5,3) = D(3,4) = D(4,4) = D(5,4) = D(6,4) = D(4,5) = D(5,5) = D(6,5) = D(7,5) = D(4,6) = D(5,6) = D(6,6) = D(7,6) =
normalized distortion = 8/6 = 1.333
3
3
21
2
17
12
11
9
78normalized distortion = 8/7 = 1.14Slide39
39
Dynamic Time Warping (
DTW
) Example
1
2
2
2
3
3
2
1
8
2
2
2
3
8
9
2
3
3
3
1
2
1
1
2
3
2
1
2
1
2
333212
P1=(1,0)
P2=(1,1)
P3=(0,1)
1
1heuristic paths:
begin at (1,1), end at (6,6)
1
3
6
9
2
10
7
10
11
9
910
D(1,1) = 1 D(2,1) = 3 D(3,1) = 6 D(4,1) = 9 …D(1,2) = 3 D(2,2) = 2 D(3,2) = 10 D(4,2) = 7 …D(1,3) = 5 D(2,3) = 10 D(3,3) = 11 D(4,3) = 9 …D(1,4) = 7 D(2,4) = 7 D(3,4) = 9 D(4,4) = 10 …D(1,5) = 10 D(2,5) = 9 D(3,5) = 10 D(4,5) = 10 …
D(1,6) = 13 D(2,6) = 11 D(3,6) = 12 D(4,6) = 12 … normalized distortion = 13/6 = 2.17
12
15912
8
11
10
10
101011
12111213
13
5
7
7
109
13
11
12Slide40
40
Dynamic Time Warping (
DTW
) Example
4
5
7
8
7
9
2
3
5
6
7
8
3
2
2
1
4
5
1
2
1
2
1
3
3
4
3
3
3
1
76454
2heuristic paths:begin at (1,1), end at (6,6)
P1=(1,1)(1,0)P2=(1,1)P3=(1,1)(0,1)
1
½
½
½
½
D(1,1) = D(2,1) = D(3,1) = D(4,1) =
D(1,2) = D(2,2) = D(3,2) =
D(4,2) =
D(2,3) = D(3,3) = D(4,3) =
D(5,3) =
D(3,4) = D(4,4) = D(5,4) =
D(6,4) = D(3,5) = D(4,5) = D(5,5) = D(6,5) = D(3,6) = D(4,6) = D(5,6) =
D(6,6) = Slide41
Singularities
Assumption
The minimum distance comparing two signals only depends on the previous adjacent entries
The cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be
well-defined
Problem: The algorithm can compute incorrectly due to mismatched alignmentsPossible solutions:Compare based on the change of feature values between windows instead of the values themselvesPre-process to eliminate the causes of the mismatchesSlide42
Possible Preprocessing
Normalize
the energy of voiced audio
:
Compute the energy of both signals
Multiply the larger by the percentage differenceBrick Wall Normalize the peaks and valleys: Find the average peak and valley valueSet values larger than the average equal to the averageNormalize the pitch: Use PSOLA to align the pitch of the two
signals
Remove duplicate frames
:
Auto correlate frames at pitch
points
Implement a noise removal algorithm
Normalize the
speaking rateSlide43
43
Dynamic Time Warping/Hidden Markov Models
Dynamic Time Warping (
DTW
)
Comparing speech with a number of templates
The algorithm selects the template with the lowest normalized distortion to determine the recognized word.
Hidden Markov Models (
HMMs
)
Refines the
DTW
technology.
HMMs
compare speech against “probabilistic templates”
HMMs
compute most likely paths using probabilitiesSlide44
Phoneme Marking
Goal
:
Mark the start and end of phoneme boundaries
ResearchUnsupervised text (language) independent algorithms have been proposedAccuracy: 75% to 80%, which is 5-10% lower than supervised algorithms that make assumptions about the languageIf successful, a database of phonemes can be used in conjunction with dynamic time warping to simplify the speech recognition problem