/
Paul De Palma Paul De Palma

Paul De Palma - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
383 views
Uploaded On 2017-06-23

Paul De Palma - PPT Presentation

Ph D Candidate Department of Linguistics University of New Mexico Slides available at wwwcsgonzagaedudepalma 1 Syllables and Concepts in Large Vocabulary Continuous Speech Recognition An Engineered Artifact ID: 562584

model language syllable ref language model ref syllable word syllables words tuw perplexity acoustic frahm paul thesis wrote ser

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Paul De Palma" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Paul De PalmaPh. D. CandidateDepartment of LinguisticsUniversity of New MexicoSlides available at: www.cs.gonzaga.edu/depalma

1

Syllables and Concepts in Large Vocabulary Continuous Speech RecognitionSlide2

An Engineered Artifact2

SyllablesPrincipled word segmentation scheme No claim about human syllabification

Concepts

Words and phrases with similar meanings

No claim about cognitionSlide3

Reducing the Search Space3ASR answers the question:

What is the most likely sequence of words given an acoustic signal?Considers many candidate word sequences

To Reduce the Search Space

Reduce number of candidates

Using Syllables in the Language Model

Using Concepts in a Concept ComponentSlide4

Syllables in LM: Why?4

Switchboard

(Greenberg, 1999, p. 167)

Cumulative Frequency as a Function of Frequency RankSlide5

Most Frequent Words are Monosyllabic5

Number

Syllablesper

Word

% of Corpus by

Token

% of Corpus

by Type

1

81.04

22.39

2

14.30

39.76

3

3.50

24.264.969.915.183.216.02.40

Polysyllabic words are easier to recognize (Hamalainen, et al. , 2007)And (of course) fewer syllables than words

(Greenberg, 1999, p. 167)Slide6

Reduce the Search Space 2:Concept Component6

Word

Map:GO

Syllable

Map:GO

A flight

ax f_l_ay_td

A ticket

ax t_ih k_ax_td

book airline travel

b_uh_kd eh_r l_ay_n t_r_ae v_ax_l

book reservations

b_uh_kd r_eh s_axr v_ey sh_ax_n_z

Create a reservation

k_r_iy

ey_td

ax r_eh z_axr v_ey

sh_ax_n

Departing

d_ax

p_aa_r

dx_ix_ng

Fly

f_l_ay

Flying

f_l_ay

ix_ng

get

g_eh_td

I am leaving

ay

ae_m

l_iy

v_ix_ngSlide7

The (Simplified) Architecture of an LVCSR System7

Feature ExtractorTransforms an acoustic signal into a collection of 39 feature vectors

The province of digital signal processing

Acoustic Model

Collection of probabilities of acoustic observations given word sequences

Language Model

Collection of probabilities of word sequences

Decoder

Guesses a probable sequence of words given an acoustic signal by searching the product of the probabilities found in the acoustic and language modelsSlide8

Simplified Schematic8

Feature

Extractor

signal

Words

Decoder

Acoustic Model

Language ModelSlide9

Enhanced Recognizer9

Feature

Extractor

signal

Syllables

Decoder

Acoustic Model

Syllable Language Model

assumed

assumed

assumed

My Work

Concept Component

Syllables,

Concepts

My Work

P(O|S)

P(S)Slide10

How ASR Works 10

Input is a sequence of acoustic observations:O = o1

, o

2

,…,

o

t

Output is a string of words:

W = w

1

, w

2

,…,

w

n

Then

“The hypothesized word sequence is that string W in the target language with the greatest probability given a sequence of acoustic observations.” (1)Slide11

Operationalizing Equation 111

(1)

(2)

Using

Bayes

’ Rule:

(3)

Since the acoustic signal is the same for each candidate, (3) can be rewritten

(4)

Acoustic Model (likelihood O|W)

Language Model

(prior probability)

DecoderSlide12

LM: A Closer Look12A collection of probabilities of word sequences

p(W) = p(w1…w

n

)

(5)

Can be written by the probability chain rule:

(6)Slide13

Markov Assumption13Approximate the full decomposition of (6) by looking only a specified number of words into the past

Bigram1 word into the past

Trigram 2 words into the past

N-gram n words into the pastSlide14

Bigram Language Model 14

Def. Bigram Probability: p(w

n

| w

n-1

) = count(w

n-1

w

n

)/count(w

n-1

) (7)

Minicorpus

<s>paul wrote his thesis</s>

<s>

james

wrote a different thesis</s><s>paul wrote a thesis suggested by george</s><s>the thesis</s><s>jane wrote the poem</s>(e.g., ) p(paul|<s>) = count(<s>paul)/count(<s>) = 2/5P(paul wrote a thesis) = p(paul|<s>) * p(wrote|paul) * p(a|wrote) * p(thesis|a) * p(</s>|thesis) = .075P(paul wrote the thesis) = p(paul|<s>) * p(wrote|paul) * p(the|wrote

) * p(thesis|the) * p(</s>|thesis) = .0375Slide15

Experiment 1: Perplexity15Perplexity: PP(X)Functionally related to entropy: H(X)

Entropy is a measure of informationHypothesis

PPX(of syllable LM) < PPX (of word LM)

Syllable LM contains more informationSlide16

Definitions16

Let X be a random variable p(x) be its probability function

Defs

:

H(X) = -∑

x∈X

p(x) * lg(p(x)) (1)

PP(X) = 2

H(X)

(2)

Given certain assumptions

1

and def. of H(X),

PP(X) can be transformed to: p(w

1

…wn)-1/nPerplexity is the nth inverse root of the probability of a word sequence1. X is an ergodic and stationary process, n is arbitrarily largeSlide17

Entropy As Information17Suppose the letters of a polynesian

alphabet are distributed as follows:1

p

t

k

a

i

u

1/8

¼

1/8

¼

1/8

1/8

Calculate the per letter entropy

H(P) = -∑

i∈{p,t,k,a,i,u} p(i) * lg(p(i))

= = 2 ½ bits2.5 bits on average required to encode a letter (p: 100, t: 00, etc)1. Manning, C., Schutze

, H. (1999).

Foundations of

Statistiical

Natural Language Processing

. Cambridge: MIT Press.Slide18

Reducing the Entropy18

SupposeThis language consists of entirely of CV syllables

We know their distribution

We can compute the conditional entropy of syllables in the language

H(V|C), where V ∈ {

a,i,u

} and C ∈ {

p,t,k

}

H(V|C) = 2.44 bits

Entropy for two letters, letter model: 5 bits

Conclusion: The syllable model contains more information than the letter modelSlide19

Perplexity As Weighted Average Branching Factor19

Suppose:letters in alphabet occur with equal frequency

At every fork we have 26 choicesSlide20

Reducing the Branching Factor20

Suppose ‘E’ occurs 75 times more frequently than any other letter

p(any other letter) = x

75 * x + 25*x = 1, since there are 25 such letters

x

= .01.

Since any letter,

w

i

, is either E or one of the other 25 letters

p(

w

i

) = .75 + .01 = .76 and

Still 26 choices at each fork

‘E’ is 75 times more likely than any other choice Perplexity is reducedModel contains more informationSlide21

Perplexity Experiment21Reduced perplexity in a language model is used as an indicator that an experiment with real data might be fruitful

Technique (for both syllable and word corpora)Randomly choose 10% of the utterances from a corpus as a test set

Generate a language model from the remaining 90%

Compute the perplexity of the test set given the language model

Compute the mean over twenty runs of step 3Slide22

The Corpora22Air Travel Information System (Hemphill, et al., 2009)

Word types: 1604 Word tokens: 219,009 Syllable types: 1314

Syllable Tokens:

317,578

Transcript of simulated human-computer speech (NextIt, 2008)

Word types:

482

Word tokens:

5,782

Syllable types:

537 (This will have repercussions in Exp. 2)

Syllable tokens:

8,587 Slide23

Results23

Bigrams

Trigrams

Mean Words

NextIt

39.44

31.44

Mean Syllables

NextIt

35.96

22.26

Mean Words

ATIS

41.35

31.43

Mean Syllables

ATIS21.1514.74Notice drop in perplexity from words to syllables.Perplexity of 14.74 for trigram syllable ATIS  At every turn, less than ½ as many choices as for trigram word ATISSlide24

Experiment 2: Syllables in the language Model24Hypothesis:

A syllable language model will perform better than a word-based language modelBy what Measure?Slide25

Symbol Error Rate25SER = (100 * (I + S + D))/T

Where:I is the number of insertionsS is the number of substitutions]

D is the number of deletions

T is the total number of symbols

SER = 100(2+1+1)/5 = .8

1

1. Alignment performed by a dynamic programming

algorithm

Minimum

Edit DistanceSlide26

Technique26Phonetically transcribe corpus and reference files

Syllabify corpus and references filesBuild language models Run a recognizer on 18 short human-computer telephone monologues

Compute mean, median, std of SER for 1-gram, 2-gram, 3-gram, 4-gram over all monologuesSlide27

Results27

Means

Substitution

Insertion

Deletion

SER

N=2,3,4

73.14%

103.00 %

81.74%

85.3%

Syllables

Normed

by Words

Syllables Compared to Words

Mean SER

Words

Mean SER

Syllables

N=2

46.4

41.0

N=3

46.8

39.4

N=4

46.7

39.0Slide28

Experiment 3: A Concept Component28Hypothesis:

A recognizer equipped with a post-processor that transforms syllable output to syllable/concept output will perform better than one not equipped with such a processorSlide29

Technique29Develop equivalence classes from the training transcript : BE, WANT, GO, RETURN

Map the equivalence classes onto the reference files used to score the output of the recognizer.

Map the equivalence classes onto the output of the recognizer

Determine the SER of the modified output in step 3 with respect to the reference files in step 2.Slide30

Results30

Mean SER

Syllables

Mean

SER

Concepts

N=2

41.0

41.3

N=3

39.4

40.6

N=4

39.0

40.3

Means

SubstitutionInsertionDeletion SERN=2,3,4

1.06%0.95%

1.09%

1.02%

Concepts

Normed

by Syllables

2% decline overall. Why?

Concepts Compared to SyllablesSlide31

Mapping Was Intended to Produce an Upper Bound On SER31

For each distinct syllable string that appears in the hypothesis or reference files, search each of the concepts for a matchIf match, substitute concept for the syllable string: ay

w_uh_dd

l_ay_kd

WANT

Misrecognition of a single syllable no insertionSlide32

Misalignment Between Training and Reference Files32

Equivalence classes constructed using only the LM training model transcriptMore frequent in reference files:

1

st

person singular (

I want

)

Imperatives (

List all flights

)

Less frequent in reference files:

1

st

person plural (

My husband and me want

)

Polite forms (I would like)BE does not appear (There should be, There’s going to be, etc.)Slide33

Summary331. Perplexity: syllable language model contains more information than a word language model (and probably will perform better)

2. Syllable language model results in a 14.7% mean improvement in SER3. The very slight increase in mean SER for a concept language model justifies further research Slide34

Further Research34Test the given system over a large production corpus

Develop of a probabilistic concept language model

Develop necessary software to pass the output of the concept language model on to an expert systemSlide35

The (Almost, Almost) Last Word

“But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one under any known interpretation of the term.”

Cited in

Jurafsky

and Martin (2009) from a 1969 essay on

Quine

.Slide36

36The (Almost) Last WordHe just never thought to count.Slide37

37The Last Word

ThanksTo my generous committee:Bill Croft, Department of Linguistics

George Luger, Department of Computer Science

Caroline Smith, Department of Linguistics

Chuck Wooters, U.S. Department of DefenseSlide38

References38Cover, T., Thomas, J. (1991).

Elements of Information Theory. Hoboken, NJ: John Wiley & Sons. Greenberg, S. (1999) Speaking in shorthand—A syllable-centric perspective for understanding pronunciation variation.

Speech Communication

, 29, 159-176.

Hemphill, C., Godfrey, J.,

Doddington

, G. (2009). The ATIS Spoken Language Systems Pilot Corpus. Retrieved 6/17/09 from:

http://www.ldc.upenn.edu/Catalog/readme_files/atis/sspcrd/corpus.html

Hamalainen

, A.,

Boves

, L., de

Veth

, J., ten Bosch, L. (2007) On the utility of syllable-based acoustic models for pronunciation variation modeling.

EURASIP Journal on Audio, Speech, and Music Processing.

46460, 1-11.

Jurafsky, D., Martin, J. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Jurafsky, D., Martin, J. (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Manning, C., Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.NextIt. (2008). Retrieved 4/5/08 from: http:/www.nextit.com.NIST. (2007) Syllabification software. National Institute of Standards: NIST Spoken Language Technology Evaluation and Utility. Retrieved 11/30/07 from:

http://www.nist.gov/speech/tools/. Slide39

Additional Slides39Slide40

Transcription of a recording40

REF: (3.203,5.553) GIVE ME A FLIGHT BETWEEN SPOKANE AND SEATTLE

REF: (15.633,18.307) UM OCTOBER SEVENTEENTH

REF: (26.827,29.606) OH I NEED A PLANE FROM SPOKANE TO SEATTLE

REF: (43.337,46.682) I WANT A ROUNDTRIP FROM MINNEAPOLIS TO

REF: (58.050,61.762) I WANT TO BOOK A TRIP FROM MISSOULA TO

PORTLAND

REF: (73.397,77.215) I NEED A TICKET FROM ALBUQUERQUE TO NEW

YORK

REF: (87.370,94.098) YEAH RIGHT UM I NEED A TICKET FROM SPOKANE

SEPTEMBER

THIRTIETH TO SEATTLE RETURNING OCTOBER

THIRD

REF: (107.381,113.593) I WANT TO GET FROM ALBUQUERQUE TO NEW

ORLEANS

ON OCTOBER THIRD TWO THOUSAND SEVENSlide41

Transcribed and Segmented141

REF: (3.203,5.553) GIHV MIY AX FLAYTD BAX TWIYN SPOW KAEN

AENDD

SIY AE DXAXL

REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH

REF: (26.827,29.606) OW AY NIYDD AX PLEYN FRAHM SPOW KAEN

TUW SIY

AE DXAXL

REF: (43.337,46.682) AY WAANTD AX RAWNDD TRIHPD FRAHM

MIH NIY

AE PAX LAXS TUW

REF: (58.050,61.762) AY WAANTD TUW BUHKD AX TRIHPD FRAHM

MIH ZUW

LAX TUW PAORTD LAXNDD

REF: (73.397,77.215) AY NIYDD AX TIH KAXTD FRAHM AEL BAX

KAXR KIY

TUW NUW YAORKDREF: (87.370,94.098) YAE RAYTD AHM AY NIYDD AX TIH KAXTD FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RAX TER NIXNG AAKD TOW BAXR THERDDREF: (107.381,113.593) AY WAANTD TUW GEHTD FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN1. Produced by University of Colorado transcription software (to a version of ARPAbet) , National Institute of Standards (NIST) syllabifier, and my own Python classes that coordinate the two. Slide42

With Inserted Equivalence Classes1 42

REF: (3.203,5.553) GIHV MIY GO BAX TWIYN SPOW KAEN AENDD SIY AE DXAXL

REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH

REF: (26.827,29.606) OW WANT AX PLEYN FRAHM SPOW KAEN

TUW SIY

AE DXAXL

REF: (43.337,46.682) WANT AX RAWNDD TRIHPD FRAHM MIH NIY

AE PAX

LAXS TUW

REF: (58.050,61.762) WANT TUW BUHKD AX TRIHPD FRAHM MIH

ZUW LAX

TUW PAORTD LAXNDD

REF: (73.397,77.215) WANT GO FRAHM AEL BAX KAXR KIY TUW

NUW YAORKD

REF: (87.370,94.098) YAE RAYTD AHM AY WANT GO FRAHM SPOW

KAEN SEHPD

TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RETURN AAKD TOW BAXR THERDDREF: (107.381,113.593) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN1. A subset of a set in which all members share an equivalence relation. WANT is an equivalence class with members, I need, I would like, and so on.Slide43

Including Flat Language ModelWord Perplexity43

Flat LM

N = 1

N = 2

N = 3

N = 4

Mean Appendix C

482

112.56

39.44

31.44

30.42

Median, Appendix C

482

110.82

38.74

28.96

29.31

Std Dev Appendix C

0.0

13.61

7.21

5.82

4.05

Mean ATIS

1604

191.84

41.35

31.51

31.43

Median ATIS

1604

191.74

38.93

28.2

30.19

Std Dev ATIS

0.0

.57

8.64

7.6

4.32Slide44

Including Flat LMSyllable Perplexity44

Flat LM

N = 1

N = 2

N = 3

N = 4

Mean Appendix C

537

177.99

35.96

22.26

21.04

Median, Appendix C

537

177.44

35.81

22.71

20.75

Std Dev Appendix C

0.0

1.77

8.49

3.73

3.53

Mean ATIS

1314

231.13

22.42

14.91

14.11

Median ATIS

1314

231.26

21.15

14.74

13.2

Std Dev ATIS

0.0

.097

4.36

1.83

1.79Slide45

Words and Syllables Normed by Flat LM45

N = 1

N = 2

N = 3

N = 4

Mean Appendix C

33.14%

6.69%

4.15%

3.92%

Mean ATIS

17.58%

1.71%

1.13%

1.07%

N = 1

N = 2

N = 3

N = 4

Mean Appendix C

23.35%

8.18%

6.52%

6.31%

Mean ATIS

11.95%

11.96%

1.96%

1.95%

Words Data

Normed

by Flat LM

Syllable Data

Normed

by Flat LMSlide46

Syllabifier from National Institute of Standards and Technology (NIST, 2007)Based on Daniel Kahn’s 1976 dissertation from MIT (Kahn, 1976)Generative in nature and English-biasedSyllabifiers

46Slide47

Estimates of the number of English syllables range from 1000 to 30,000Suggests that there is some difficulty in pinning down what a syllable is.Usual hierarchical approachSyllables

syllable

Coda (C)

onset (C)

Nucleus (V)

rhymeSlide48

Sonority rises to the nucleus and falls to the codaSpeech sounds appear to form a sonority hierarchy (from highest to lowest) vowels, glides, liquids, nasals, obstruentsUseful but not absolute: e.g, both depth and spit

seem to violate the sonority hierarchySonority