/
Recognizer Issues Recognizer Issues

Recognizer Issues - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
378 views
Uploaded On 2016-10-25

Recognizer Issues - PPT Presentation

Problem Our recognizer translates the audio to a possible string of text How do we know the translation is correct Problem How do we handle a string of text containing words that are not in the dictionary ID: 480363

algorithm distance frames time distance algorithm time frames dynamic word int speech features warping cost frame values source step

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Recognizer Issues" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Recognizer Issues

Problem

: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct.

Problem:

How do we handle a string of text containing words that are not in the dictionary.

Problem

: How do we handle strings with valid words, but which do not form sentences with semantics that makes senseSlide2

Correcting Recognizer Ambiguities

Problem

: Resolving words not in the dictionary

Question

: How different is a recognized word from those that are in the dictionary?

Solution

: Count the single step transformations necessary to convert one word into another.

Example

: caat

 cat with removal of one letter

Example

: flpc  fireplace requires adding the letters ire after f and a before c and e at the endSlide3

Spelling Error Types

Levenshtein distance:

the smallest number of insertion, deletion, or substitution operations that transforms one string into another

Examples: differences from the word, “cat”

Insertion: cattDeletion: ctSubstitution: carTransposition: cta

Note

: There are similar well-defined rules for pronunciation variationsSlide4

Spell Check Algorithm

FOR

unrecognized words or those out of context

Generate a list of candidates

Those that differ by a one step transformation

Those that exist in the lexiconOrder possibilities using language-based statisticsRETURN most likely wordNote

: We could extend the algorithm to consider multiple transformations when we cannot find a single step solution Slide5

Example

Mispelled word: acress

Candidates – with probabilities of use and use within context

Context

Context * P(c)Slide6

Which correction is most likely?

Word frequency percentage is not enough

We need p(typo|candidate) * p(candidate)

How likely is the particular error?

Deletion of a t after a c and before an r

Insertion of an a at the beginningTranspose a c and an aSubstitute a c for an rSubstitute an o for an eInsert an s before the last s, or after the last sContext of the word within a sentence or paragraph

Misspelled word: accressSlide7

Additional Issues to Consider

What if there is more than one error per word?

Possible Solution:

Compute minimum edit distance and allow more than one transformation

Some words in the lexicon may not appear in the frequency corpus that we use

Possible Solution: Smoothing algorithmsIs P(typo|candidate) accurate?Possible Solution: Train the algorithmChoice often depends on contextPossible Solution: Context sensitive algorithmsNames and places likely do not exist in the lexiconThis is a difficult issueSlide8

8

Dynamic Programming

Definition

: Nesting small decision problems inside larger decisions (Richard Bellman)

Two approaches:

1.

Top Down

:

Start out with a method call to compute a final solution. The method proceeds to solve sub-problems of the same kind. Eventually the algorithm reaches and solves the “base case.” The algorithm can avoid repetitive calculations by storing the answers to the sub-problems for later access.

2.

Bottom Up

:

Start with the base cases, and work upwards, using previous results to reach the final answer.

The examples on the slides that follow use the bottom up approachSlide9

Dynamic Programming Algorithms

Recursion

is popular for solving ‘divide and conquer’ problems

Divide problem into smaller problems of the same kind

Define the base case

Works from the original problem down to the base caseDynamic programming is another approach that uses arraysInitialize a variable or tableUse loops fill in entries the table. The algorithm always fills in entries in the table before they are neededWorks top-down or bottom-up towards the final solutionAvoids repetitive calculations because the technique stores results in a table for later use

Eliminates recursion activation record overhead

Implement recursive algorithms using arrays

Dynamic programming is a widely used algorithmic techniqueSlide10

Fibonacci Sequence

Bottom Up

int

fibo

(n) { if (n <=1) return n; int

fMinusTwo

= 0;

int

fMinusOne

= 1;

for

(

int

i = 2; i <= n;

i++) { f = fAMinusTwo + fMinusOne; fMinusTwo = fMinusOne; fAtNMinusOne = f; } return(f)}Top Downint[] fibs = new int[MAX];fibs[0] = 0;fibs[1] = 1;int max = 1;

int fibo(n){

if (n<=max) return fibs[n]; else

fibs[n] = fibo(n-1)+fibo(n-2); max = n; return fibs[n];}{0 1 1 2 3 5 8 13 21 …} where F(n)=F(n-1)+F(n-2)Slide11

Example: Knapsack Problem

Values[v]

containing values of the N items

Weights[v]

contains the weights of the N items

K = knapsack capacitydynTable[v][w] contains the dynamic algorithm tableA thief enters a house with a knapsack of a particular size. Which items of different values does he choose to fill his knapsack.Slide12

KnapSack Algorithm

knapSack

(

int

[] w,

int

[] v,

int

K)

{

int

[][]

dynTable

[v.length+1][K];

for (

int

w = 0 to K) a[0][w] = 0;

for (int v=1; v<=v.length; v++){ for (int w=0; w<W; w++) { dyn[v][w] = dynTable[v-1][w]; // Copy up if (w[v] <= w && ) dynTable[v][w] = // Try for better choice Math.max(dynTable[v][w], values[v]+dynTable[v-1,w-weights[v]]);

}}return

dynTable[N,W];

}Note: Row for each item to consider; column for integer weightsSlide13

Knapsack Example

Weight

I

T

EM

0

1

2

3

4

5

6

7

8

9

10

11

12

131415000

00

00

00000000001003333333333333

320033355555555555300336699911111111

11111140033610101313161619191921215003361010

131316161919192122012345Weights?23458Values?3

26109Goal: Maximum profit with a 15kg knapsack?Array cells represent value stuffed into knapsack of that weightSlide14

Example: Minimum Edit Distance

Problem

: How can we measure how different one word is from another word (ie spell checker)?

How many operations will transform one word into another?

Examples:

caat --> cat, fplc --> fireplaceDefinition: Levenshtein distance: smallest number of insertion, deletion, or substitution operations to transform one string into anotherEach insertion, deletion, or substitution is one operation, with a cost of 1.Requires a two dimension arrayRows: source word positions, Columns: spelled word positionsCells: distance[r][c] is the distance (cost) up to that point

A useful dynamic programming algorithmSlide15

Pseudo Code (minDistance(target, source))

n

= character in source

m

= characters in target

Create array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = r

FOR

c=0

TO

m distance[0,c] = c

FOR

each

row

r

FOR each column c

IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]Slide16

Example

Source

: GAMBOL,

Target

: GUMBO

Algorithm Step: Initialization

G

U

M

B

O

0

1

2

3

4

5

G

1

A

2

M

3

B

4

O

5

L

6Slide17

Example

Source

: GAMBOL,

Target

: GUMBO

Algorithm Step: Column 1

G

U

M

B

O

0

1

2

3

4

5

G

1

0

A

2

1

M

3

2

B

43

O

54

L

65Slide18

Example

Source

: GAMBOL,

Target

: GUMBO

Algorithm Step: Column 2

G

U

M

B

O

0

1

2

3

4

5

G

1

0

1

A

2

1

1

M32

2

B43

3

O54

4

L65

5Slide19

Example

Source

: GAMBOL,

Target

: GUMBO

Algorithm Step: Column 3

G

U

M

B

O

0

1

2

3

4

5

G

1

0

1

2

A

2

112

M32

21

B43

3

2O54

43

L65

54Slide20

Example

Source

: GAMBOL,

Target

: GUMBO

Algorithm Step: Column 4

G

U

M

B

O

0

1

2

3

4

5

G

1

0

1

2

3

A

21123

M32

212

B43

3

21O54

432

L65

543Slide21

Example

Source

: GAMBOL,

Target

: GUMBO

Algorithm Step: Column 5Result: Distance equals 2

G

U

M

B

O

0

1

2

3

4

5

G

1

0

1

2

3

4A

211234

M32

2123

B43

3

212O5

443

21L65

543

2Slide22

Another Example

E

X

E

C

U

T

I

O

N

0

1

2

3

4

5

6

7

8

9

I

1

1

2

3

456

678N

222

3

45677

7T3

33345

567

8

E4343

456

678

N

5

4444

567

7

7T655

555

567

8

I

7666

666

567O87

777

7765

6N9

8

8

8888

7

6

5Slide23

Comparing Audio Frames

Patterns

: Database of audio samples. Match to the ones that are

closest

Templates

: Database of features extracted from audio samplesTraining: Use a training set to create a vector of patterns or templatesDistortion Measure: algorithm to measure how far a pair of templates or patterns are apartSlide24

Will Minimum Edit Distance Work?

Maybe:

D

istances to higher array indices are functions

of the costs of smaller

indices; a dynamic programming approach may workIssuesThe algorithm may be too slowA binary equal or not equal comparison does not workA distance metric is neededSpeaking rates change, even with a single speakerDo we compare the raw data or frame-based features?Do we assign cost to adjacent cells or to those further away?Other issues

: Phase, energy, pitch misalignment, Presence of noise,

length

of vowels, Phoneme pronunciation variances,

etc.

Incorrect comparisons occur when the algorithm isn’t carefully designedSlide25

Dynamic Time Warping

Goal: Find “best” alignment between pairs of audio frames

(A)

(B)

time (frame) of (A)

time (frame) of (

B

)

The matrix to the right shows the optimal alignment path (warping) between frames from utterance A with those of utterance BSlide26

26

Dynamic Time Warping (

DTW

) Overview

Computes the “

distance” between 2 frames of

speech

Measures frame by frame distances to compute dissimilarities between speech

Allows the comparison to warp the comparison to account for differences in speaking rates

Requires a cost function to account for different paths through frames of speech

Uses

dynamic programming algorithm to find best

warping.

Computes a

total “distortion score” for best warped path.

Assumptions

Constrain begin and end times to be (1,1) and (

T

A,TB)

Allow only monotonically increasing timeDon’t allow too many frames to be skipped

Can express results in terms of “paths” with “slope weights”Slide27

27

Assumptions

Does

not

require that both patterns have the same length

O

ne

speech pattern

is

the “input” and

the other

speech pattern

is

the “template

” to compare against

Divide

speech signal into equally-spaced frames (10-30ms)

with approximately 50% overlap and

compute a frame-based feature setThe local distance measure (

d) is the distance between features at a pair of frames (one from A, one from B).The

Global distortion from beginning of utterance until current pair of frames called G.Slide28

Algorithm Efficiency

The algorithm complexity is O(m*n) where m and n are the respective number of frames between the two utterances. If m=n, the algorithm is O(n

2

). Why?: Count the number of cells that need to be filled in.

O(n

2) may be too slow. Alternate solutions have been devised.Don’t fill in all of the cells.Use a multi-level approachSlide29

Don’t Fill in all of the Cells

Disadvantage:

The algorithm may miss the optimal pathSlide30

The Multilevel Approach

Concept

Coarsen the array of features

Run the algorithm

Refine the array

Adjust the solution

Repeat steps 3-4 till the original array of features is restored

Notes

The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n

2

) to O(n

lg

n)

Example

: partitioning a graph to balance work loads among threads or processorsSlide31

Which Audio Features?

Cepstrals

: They are statistically independent and phase differences are removed

ΔCepstrals

, or

ΔΔCepstrals: Reflects how the signal is changing from one frame to the nextEnergy: Distinguish the frames that are voiced verses those that are unvoicedNormalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers.

These are some of the popular speech recognition featuresSlide32

32

Distance Metric Requirements

Definition: Measure similarity of two

frames of

speech.

The vector

x

t

,

y

t

contain the

features from

frames of two signals

A distance measure should have the following properties:

0

 d(xt,yt) 

0 = d(x

t,yt) iff xt = yt d(xt,yt) = d(xt,yt) (symmetry) d(xt,yt)  d(xt,

zt) + d(zt,yt) (triangle inequality)A speech distance metric should correlate with perceived distance. Perceptually-warped spectral features work well in practice

(positive definiteness)Slide33

Which Distance Metric?

General Formula:

array[

i,j

] = distance(

i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}Assumption : There is no cost assessed for duplicate or eliminated frames. Distance Formula:Euclidian: sum the square of one metric minus another squaredLinear: sum the absolute value of the distance between featuresExample of a distance metric using linear distance

w

i

|(

f

a

[

i

] –

fb

[i])| where f[i] is a particular audio feature for signals a

and b. w[i] is that feature’s weightSlide34

Which Local Distance Measure?

O

ther distance measures

Itakura

-Saito distortion,

COSH

, likelihood ratio, etc…

 

where

x

t

(

f

), and

y

t

(

f) are frame frequency values at time t; f is a feature indexEuclidean distance:

Mahalanobis

distance:

 Note: we can weight the features by multiplying differences by weighting factors to emphasize/deemphasize certain featuresSlide35

35

Dynamic Time Warping

Termination

Step

Divide the total computed cost

by

a normalizing

factor

The normalizing factor is

necessary to

compare

results

between input speech and various

templates to which it is compared

O

ne quick and effective normalizing method divides

by the number

of frames in the template. Another method is divide the result by

the length of the path taken, where we adjust the length by the slope weights at each transition

. This requires backtracking to sum the slope values, but can sometimes be more accurate.Slide36

36

P

1

P

2

P

3

P1=(1,0)

P2=(1,1)

P3=(1,2)

Frame

transition

cost heuristics

Path

P

and slope weight

m

determined heuristically

Paths considered backward

from

target frame

Larger weight values for less preferable paths

Optimal paths always go up, right (monotonically forward in time)

Only evaluate P if all frames have meaningful values (e.g. don’t evaluate a path if one frame is at time 1, because there is no data for time 1).P1=(1,1)(1,0)P2=(1,1)P3=(1,1)(0,1)

P

1

P2P3

1½½½

½

Heuristic 1Heuristic 2Slide37

37

Dynamic Time Warping (

DTW

) Algorithm

1

.

Initialization

(time 1 is first time frame)

D

(1,1) = d(1,1

)

2.

Recursion

3.

Termination

where (

=zeta

) is a function of previous distances and slopesM sometimes defined as Tx, or Tx+Ty,or (T

x 2+ Ty 2)½A convenient value for M is the length of the template.Slide38

38

Dynamic Time Warping (

DTW

) Example

1

1

2

2

3

3

1

1

1

2

2

2

3

1

2

2

3

3

3

1

2

1

1

2

3

2

1

2

1

2

33321

2

P1=(1,0)

P2=(1,1)

P3=(1,2)

11

1heuristic paths:begin at (1,1), end at (7,6)

1

2

5

8

11

14

2

3

4

6

9

2

4

5

5

8

4

4

6

7

5

5

5

66

67D(1,1) = D(2,1) = D(3,1) = D(4,1) = D(1,2) = D(2,2) = D(3,2) = D(4,2) = D(2,3) = D(3,3) = D(4,3) = D(5,3) = D(3,4) = D(4,4) = D(5,4) = D(6,4) = D(4,5) = D(5,5) = D(6,5) = D(7,5) = D(4,6) = D(5,6) = D(6,6) = D(7,6) =

normalized distortion = 8/6 = 1.333

3

3

21

2

17

12

11

9

78normalized distortion = 8/7 = 1.14Slide39

39

Dynamic Time Warping (

DTW

) Example

1

2

2

2

3

3

2

1

8

2

2

2

3

8

9

2

3

3

3

1

2

1

1

2

3

2

1

2

1

2

333212

P1=(1,0)

P2=(1,1)

P3=(0,1)

1

1heuristic paths:

begin at (1,1), end at (6,6)

1

3

6

9

2

10

7

10

11

9

910

D(1,1) = 1 D(2,1) = 3 D(3,1) = 6 D(4,1) = 9 …D(1,2) = 3 D(2,2) = 2 D(3,2) = 10 D(4,2) = 7 …D(1,3) = 5 D(2,3) = 10 D(3,3) = 11 D(4,3) = 9 …D(1,4) = 7 D(2,4) = 7 D(3,4) = 9 D(4,4) = 10 …D(1,5) = 10 D(2,5) = 9 D(3,5) = 10 D(4,5) = 10 …

D(1,6) = 13 D(2,6) = 11 D(3,6) = 12 D(4,6) = 12 … normalized distortion = 13/6 = 2.17

12

15912

8

11

10

10

101011

12111213

13

5

7

7

109

13

11

12Slide40

40

Dynamic Time Warping (

DTW

) Example

4

5

7

8

7

9

2

3

5

6

7

8

3

2

2

1

4

5

1

2

1

2

1

3

3

4

3

3

3

1

76454

2heuristic paths:begin at (1,1), end at (6,6)

P1=(1,1)(1,0)P2=(1,1)P3=(1,1)(0,1)

1

½

½

½

½

D(1,1) = D(2,1) = D(3,1) = D(4,1) =

D(1,2) = D(2,2) = D(3,2) =

D(4,2) =

D(2,3) = D(3,3) = D(4,3) =

D(5,3) =

D(3,4) = D(4,4) = D(5,4) =

D(6,4) = D(3,5) = D(4,5) = D(5,5) = D(6,5) = D(3,6) = D(4,6) = D(5,6) =

D(6,6) = Slide41

Singularities

Assumption

The minimum distance comparing two signals only depends on the previous adjacent entries

The cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be

well-defined

Problem: The algorithm can compute incorrectly due to mismatched alignmentsPossible solutions:Compare based on the change of feature values between windows instead of the values themselvesPre-process to eliminate the causes of the mismatchesSlide42

Possible Preprocessing

Normalize

the energy of voiced audio

:

Compute the energy of both signals

Multiply the larger by the percentage differenceBrick Wall Normalize the peaks and valleys: Find the average peak and valley valueSet values larger than the average equal to the averageNormalize the pitch: Use PSOLA to align the pitch of the two

signals

Remove duplicate frames

:

Auto correlate frames at pitch

points

Implement a noise removal algorithm

Normalize the

speaking rateSlide43

43

Dynamic Time Warping/Hidden Markov Models

Dynamic Time Warping (

DTW

)

Comparing speech with a number of templates

The algorithm selects the template with the lowest normalized distortion to determine the recognized word.

Hidden Markov Models (

HMMs

)

Refines the

DTW

technology.

HMMs

compare speech against “probabilistic templates”

HMMs

compute most likely paths using probabilitiesSlide44

Phoneme Marking

Goal

:

Mark the start and end of phoneme boundaries

ResearchUnsupervised text (language) independent algorithms have been proposedAccuracy: 75% to 80%, which is 5-10% lower than supervised algorithms that make assumptions about the languageIf successful, a database of phonemes can be used in conjunction with dynamic time warping to simplify the speech recognition problem