/
A new A new

A new - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
367 views
Uploaded On 2016-12-01

A new - PPT Presentation

Golden Age of phonetics Mark Liberman University of Pennsylvania mylcisupennedu A bit of technical history Recording devices visible speech wax cylinders wire recorders tape recorders ID: 495831

syllable amp 2009 liberman amp syllable liberman 2009 duration yuan interspeech data speech darkness word final boundary intervocalic dark

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A new" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A new Golden Age of phonetics?

Mark

Liberman

University of Pennsylvania

myl@cis.upenn.eduSlide2

A bit of technical historyRecording devices

“visible speech”

wax cylinders

wire recorders

tape recorders

digital audio

Analysis devices

mechanical (

oscillograph

, …)

electro-mechanical (spectrograph, …)

digital (…)Slide3

A quick course in acoustic phonetics

Letters, phonemes, segments

Duration

Pitch

Vowel quality

Source-filter model

Formant modelSlide4

The promise

We see that the computer has opened up to linguists a host of challenges, partial insights, and potentialities. We believe these can be aptly compared with the challenges, problems, and insights of particle physics. Certainly, language is second to no phenomenon in importance. And the tools of computational linguistics are considerably less costly than the multibillion-volt accelerators of particle physics. The new linguistics presents an attractive as well as an extremely important challenge.

There is every reason to believe that facing up to this challenge will ultimately lead to important contributions in many fields.

Language and Machines: Computers in Translation and Linguistics

Report by the Automatic Language Processing Advisory Committee (ALPAC), National Academy of Sciences, 1966Slide5

The paradoxALPAC (and Pierce 1969):

computers → new language science

language science → language engineering

What actually happened:

computers → new language engineering

engineering → new language science ???Slide6

Focusing on speech science…

Plenty of computer use

minicomputers in the 1960s

micro- and super-computers in the 1980s

ubiquitous laptops today

Applications:

replaced tape splicing

replaced sound spectrograph

easier pitch tracking, formant tracking

more convenient statistical analysis

and so on

BUT…Slide7

No phonetic quantum mechanics

Great speech science by smart people

But surprisingly little change

in style and scale of research 1966-2009

in scientific questions about speech

in the rate of progress compared to 1946-1966

(the first golden age of phonetics)

…at least on the acoustic analysis side.

Peterson & Barney 1951

data is still relevant

many contemporary publications

are similar in style and scaleSlide8

“Challenges, partial insights,

and potentialities”

In speech science

there are many unmet challenges,

not enough new insights,

and the potentialities are still mostly potential

Something was missing in 1966

adequate accessible digital speech data

tools for large-scale automated analysis

applicable research paradigms

We now have two out of three…Slide9

“Challenges…”

Variation and invariants

Individual

Contextual

Social

Communicative

Problem of correlated variables

Factorial design vs. MLMLaboratory vs. natural dataDescriptive dimensions?e.g. F0/amplitude/time/etc. for prosodySlide10

“…partial insights…”

[…in fact we know a lot about speech…]Slide11

“…potentialities”

4-6 orders of magnitude more speech

“Found data” as well as gifts from DARPA

Purely bottom-up analysis

F0, Voice/noise/silence, speaking rate

New descriptive dimensions

Analysis based on forced alignment

Good results from OK transcriptsQualitative: “pronunciation modeling”Quantitative: old and new dimensions

Better statistical methodsSlide12

“…important contributions

in many fields…”

Social sciences

Sociolinguistics and dialect geography

New approaches to survey data

Phonetics of rhetoric

Speech pathology/therapy

Language teaching/learningSlide13

Two kinds of science

Explore, observe, explain

Yogi Berra:

“Sometimes you can observe a lot

just by watching”

“Botanizing” / exploratory data analysis

2. Hypothesize and testSlide14

Data from CallHome M/F conversations; about 1M F0 values per category.Slide15

(11,700 conversational sides; mean=173, sd=27)

(Male mean 174.3, female 172.6: difference 1.7, effect size d=0.06)Slide16
Slide17
Slide18

Data from Switchboard; phrases defined by silent pauses

(Yuan, Liberman & Cieri, ICSLP 2006)Slide19
Slide20
Slide21
Slide22
Slide23
Slide24
Slide25

Evanini, Isard & Liberman, “Automatic formant extraction for sociolinguistic analysis of large corpora”,

Interspeech 2009Slide26

Yuan &

Liberman

Interspeech

2009

Orthographically-transcribed natural speech

is available in very large quantities

With pronunciation modeling and forced alignment

we can use this data for phonetics research

Automatic acoustic measures

based on simple statistical models

can sometimes be helpful

Here we examine

the distribution of /l/-darkness

in ~26 (out of ~9000) hours

of U.S. Supreme Court oral arguments

… ~22,000 tokens of /l/Slide27

Yuan & Liberman: Interspeech 2009

27

Introduction

English /l/ is traditionally classified into at least two allophones:

“dark /l/”

, which appears in syllable rimes

“clear /l/”,

which appears in syllable onsets.

Sproat and Fujimura (1993) :

clear and dark allophones are not categorically distinct;

single phonological entity /l/ involves two gestures –

a vocalic dorsal gesture and a consonantal apical gesture.

The two gestures are inherently asynchronous:

the vocalic gesture is attracted to the nucleus of the syllable

the consonantal gesture is attracted to the margin (“gestural affinity”).

In a syllable-final /l/, the tongue dorsum gesture

shifts left to the syllable nucleus,

making the vocalic gesture

precede the consonantal, tongue apex gesture.

In a syllable-initial /l/, the apical gesture precedes the dorsal gesture.Slide28

Yuan & Liberman: Interspeech 2009

28

Introduction

Clear /l/ has a relatively high F

2

and a low F

1

;

Dark /l/ has a lower F

2

and a higher F

1

;

Intervocalic /l/s are intermediate between the clear and dark variants (Lehiste 1964).

An important piece of evidence for the “gestural affinity” proposal:

Sproat and Fujimura (1993) found that the backness of pre-boundary intervocalic /l/ (in /i -

ɪ

/) is correlated with the duration of the pre-boundary rime. The /l/ in longer rimes is darker.

S&F (1993) devised a set of boundaries with a variety of strengths,

to ‘elicit’ different rime durations in laboratory speech:

Major intonation boundary:

Beel, equate the actors.

“|”

VP phrase boundary:

Beel equates the actors.

“V”

Compound-internal boundary:

The beel-equator’s amazing.

“C”

‘#’ boundary:

The beel-ing men are actors.

“#”

No boundary:

Mr Beelik wants actors.

“%”Slide29

Yuan & Liberman: Interspeech 2009

29

Introduction

Figure 1 in Sproat and Fujimura (1993):

Relation between

F

2

-F

1

(in Hz) and

pre-boundary rime duration

(in s)

for (a) speaker CS and (b) speaker RS.Slide30

Yuan & Liberman: Interspeech 2009

30

Introduction

Figure 4 in Sproat and Fujimura (1993): A schematic illustration of the effects of rime duration on pre-boundary post-nuclear /l/.Slide31

Yuan & Liberman: Interspeech 2009

31

Introduction

Huffman (1997) showed that onset [l]s also vary in backness:

the dorsum gesture for the intervocalic onset [l]s (e.g., in

below

)

may be shifted leftward in time relative to the apical gesture,

which makes a dark(er) /l/.

The data utilized in these studies (both F&S and Huffman)

comprised only a few hundred tokens of /l/ in laboratory speech.

“The relation of duration and backness can be complicated

by differences in coarticulatory effects of neighboring vowels,

or by speaker-specific constraints on absolute degree of backness.”

(Huffman 1997).

=> Our study uses a very large speech corpus

where these complications average out.

Automatic formant tracking is error-prone, and it is time-consuming to measure formants by hand.

=> We develop a new method to quantify /l/ backness

without formant tracking.Slide32

Yuan & Liberman: Interspeech 2009

32

Our Data

The SCOTUS corpus includes more than 50 years of oral arguments from the Supreme Court of the United States – nearly 9,000 hours in total. For this study, we used only the Justices’ speech (25.5 hours) from the 2001-term arguments, along with the orthographic transcripts.

The phone boundaries were automatically aligned using the PPL forced aligner trained on the same data, with the HTK toolkit and the CMU pronouncing dictionary.

The

dataset contains

21,706 tokens of /l/, including

3,410 word-initial [l]s,

7,565 word-final [l]s, and

10,731 word-medial [l]s.

Slide33

Yuan & Liberman: Interspeech 2009

33

The Penn Phonetics Lab Forced Aligner

The aligner’s acoustic models are GMM-based monophone HMMs

on 39 PLP coefficients. The monophones include:

speech segments

:

/t/, /l/, /aa1/, /ih0/, … (ARPAbet)

non-speech segments

:

{sil}

silence;

{LG}

laugh;

{NS}

noise;

{BR}

breath;

{CG}

cough;

{LS}

lip smack

{sp }

is a “tee” model with a direct transition

from the entry to the exit node in the HMM

(so “sp” can have 0 length)

.... used for handling possible inter-word silence.

The mean absolute difference between manual and automatically-aligned phone boundaries in TIMIT is about 12 milliseconds.

http://www.ling.upenn.edu/phonetics/p2fa

/Slide34

Yuan & Liberman: Interspeech 2009

34

Forced Alignment Architecture

Word and phone boundaries locatedSlide35

Yuan & Liberman: Interspeech 2009

35

Method

To measure the “darkness” of /l/ through forced alignment,

we first split /l/ into two phones,

L1 for the clear /l/ and L2 for the dark /l/,

and retrained the acoustic models for the new phone set.

In training, word-initial [l]’s (e.g.,

like, please

) were categorized as L1 (clear); the word-final [l]s (e.g.,

full, felt

) were L2 (dark).

All other [l]’s were ambiguous, which could be either L1 or L2.

During each iteration of training, the ‘real’ pronunciations of the ambiguous [l]’s were automatically determined,

and then the acoustic models of L1 and L2 were updated.

The new acoustic models were tested on both the training data

and on a data subset that had been set aside for testing.

During the tests, all [l]’s were treated as ambiguous –

the aligner determined whether a given [l] was L1 or L2.

Slide36

Yuan & Liberman: Interspeech 2009

36

Method

An example of L1/L2 classification through forced alignment:Slide37

Yuan & Liberman: Interspeech 2009

37

Method

If we use word-initial vs. word-final as the gold standard,

the accuracy of /l/ classification by forced alignment is

93.8%

on the training data and

92.8%

on the test data.

L1 L2

L1 2987 235 (training data)

L2 414 6757

gold-standard by word position

L1 169 19

L2 23 371 (test data)

classified by the aligner

These results suggest that acoustic fit

to clear/dark allophones in forced alignment

is a plausible way to estimate the darkness of /l/.Slide38

Yuan & Liberman: Interspeech 2009

38

Method

To compute a metric to measure the degree of /l/-darkness,

we therefore ran forced alignment twice.

All [l]’s were first aligned with L1 model, and then with the L2 model.

The difference in log likelihood scores between L2 and L1 alignments – the

D score

– measures the darkness of [l].

The larger the

D

score, the darker the [l].

The histograms of the

D

scores:Slide39

Yuan & Liberman: Interspeech 2009

39

Results

To study the relation between rime duration and /l/-darkness, we use the [l]s that follow a primary-stress vowel (denoted as ‘1’).

Such [l]s can precede a word boundary (‘#’), or a consonant (‘C’) or a non-stress vowel (‘0’) within the word.Slide40

Yuan & Liberman: Interspeech 2009

40

Results

From the figure we can see that:

The [l]s in longer rimes have larger

D

scores, and hence are darker. This result is consistent with Sproat and Fujimura (1993).

The [l]s preceding a non-stress vowel (1_L_0) are less dark

than the [l]s preceding a word boundary (1_L_#)

or a consonant (1_L_C).

The relation between duration and darkness for 1_L_C is non-linear.

The /l/ reaches maximum darkness

when the stressed vowel plus /l/ is about 150-200 ms.

The syllable-final [l]s are always dark, even in very short rimes.

This contradicts Sproat and Fujimura (1993)’s finding

that the syllable-final /l/ in very short rimes

is as clear as the canonical clear /l/.Slide41

Yuan & Liberman: Interspeech 2009

41

Results

To further examine the difference between clear and dark /l/,

we compare the intervocalic (1_L_0) –

syllable-final or "ambisyllabic"

with the intervocalic (0_L_1) -

syllable-initial

The “rime” duration here means

the duration of the previous vowel plus the duration of [l]

regardless of putative syllabic affinity….Slide42

Yuan & Liberman: Interspeech 2009

42

Results

From the figures we can see that:

The intervocalic syllable-final [l]s have positive

D

scores

whereas the intervocalic syllable-initial [l]s have negative

D

scores.

There is a positive correlation between darkness and rime duration

(i.e., the duration of /l/ and its preceding vowel)

for the intervocalic syllable-final [l]s,

but no correlation for the intervocalic syllable-initial [l]s.

For the intervocalic syllable-final /l/, there is a positive correlation between /l/ duration and darkness. No correlation between /l/ duration and darkness was found, however, for the intervocalic syllable-initial /l/.

These results suggest that there is a clear difference

between the intervocalic syllable-final and syllable-initial /l/s.Slide43

Yuan & Liberman: Interspeech 2009

43

Conclusions

We found a strong correlation between the rime duration

and /l/-darkness for syllable-final /l/.

This result is consistent with

Sproat

and Fujimura (1993).

We found no correlation between /l/ duration

and darkness for syllable-initial /l/.

This result is different from Huffman (1997).

We found a clear difference in /l/ darkness

between the 0_1 and 1_0 stress contexts,

across all values of V+/l/ duration and of /l/ duration.

We found that the syllable-final /l/ preceding a non-stress vowel

was less dark than preceding a consonant or a word boundary.

Also, there was a non-linear relationship between timing and quality

for the /l/ preceding a consonant

and following a primary-stress vowel.

These segments reach a

peak of darkness

when the duration of the stressed vowel plus /l/ is about 150-200

ms.

Further research is needed to confirm and explain these results.Slide44

Meta-Conclusion

Large “found” collections of speech

can be used effectively in phonetics research.

Better pronunciation modeling

and better forced alignment

will be helpful.

But the existing technology

is good enough to start with.

Related Contents


Next Show more