Mark Liberman University of Pennsylvania mylcisupennedu The promise We see that the computer has opened up to linguists a host of challenges partial insights and potentialities We believe these can be aptly compared with the challenges problems and insights of particle physics Cert ID: 642901
Download Presentation The PPT/PDF document "A new Golden Age of phonetics?" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A new Golden Age of phonetics?
Mark
Liberman
University of Pennsylvania
myl@cis.upenn.eduSlide2
The promise
We see that the computer has opened up to linguists a host of challenges, partial insights, and potentialities. We believe these can be aptly compared with the challenges, problems, and insights of particle physics. Certainly, language is second to no phenomenon in importance. And the tools of computational linguistics are considerably less costly than the multibillion-volt accelerators of particle physics. The new linguistics presents an attractive as well as an extremely important challenge.
There is every reason to believe that facing up to this challenge will ultimately lead to important contributions in many fields.
Language and Machines: Computers in Translation and Linguistics
Report by the Automatic Language Processing Advisory Committee (ALPAC), National Academy of Sciences, 1966Slide3
The paradoxALPAC (and Pierce 1969):
computers → new language science
language science → language engineering
What actually happened:computers → new language engineeringengineering → new language science ???Slide4
Focusing on speech science…
Plenty of computer use
minicomputers in the 1960s
micro- and super-computers in the 1980subiquitous laptops todayApplications:replaced tape splicing
replaced sound spectrograph
easier pitch tracking, formant tracking
more convenient statistical analysis
and so on
BUT…Slide5
No phonetic quantum mechanics
Great speech science by smart people
But surprisingly little change
in style and scale of research 1966-2009in scientific questions about speechin the rate of progress compared to 1946-1966
(the first golden age of phonetics)
…at least on the acoustic analysis side.
Peterson & Barney 1951
data is still relevant
many contemporary publications
are similar in style and scaleSlide6
“Challenges, partial insights,
and potentialities”
In speech science
there are many unmet challenges,not enough new insights,and the potentialities are still mostly potentialSomething was missing in 1966adequate accessible digital speech data
tools for large-scale automated analysis
applicable research paradigms
We now have two out of three…Slide7
“Challenges…”
Variation and invariants
Individual
ContextualSocialCommunicativeProblem of correlated variablesFactorial design vs. MLMLaboratory vs. natural dataDescriptive dimensions?e.g. F0/amplitude/time/etc. for prosodySlide8
“…partial insights…”
[…in fact we know a lot about speech…]Slide9
“…potentialities”
4-6 orders of magnitude more speech
“Found data” as well as gifts from DARPA
Purely bottom-up analysisF0, Voice/noise/silence, speaking rateNew descriptive dimensionsAnalysis based on forced alignmentGood results from OK transcriptsQualitative: “pronunciation modeling”Quantitative: old and new dimensions
Better statistical methodsSlide10
“…important contributions
in many fields…”
Social sciences
Sociolinguistics and dialect geographyNew approaches to survey dataPhonetics of rhetoricSpeech pathology/therapyLanguage teaching/learningSlide11
Two kinds of science
Explore, observe, explain
Yogi Berra:
“Sometimes you can observe a lot just by watching”“Botanizing” / exploratory data analysis
2. Hypothesize and testSlide12
(11,700 conversational sides; mean=173, sd=27)
(Male mean 174.3, female 172.6: difference 1.7, effect size d=0.06)Slide13Slide14Slide15
Data from Switchboard; phrases defined by silent pauses
(Yuan, Liberman & Cieri, ICSLP 2006)Slide16Slide17
Data from CallHome M/F conversations; about 1M F0 values per category.Slide18Slide19Slide20Slide21Slide22Slide23
Evanini, Isard & Liberman, “Automatic formant extraction for sociolinguistic analysis of large corpora”,
Interspeech 2009Slide24
Yuan &
Liberman
Interspeech
2009Orthographically-transcribed natural speech
is available in very large quantities
With pronunciation modeling and forced alignment
we can use this data for phonetics research
Automatic acoustic measures
based on simple statistical models
can sometimes be helpful
Here we examine
the distribution of /l/-darkness
in ~26 (out of ~9000) hours
of U.S. Supreme Court oral arguments
… ~22,000 tokens of /l/Slide25
Yuan & Liberman: Interspeech 2009
25
Introduction
English /l/ is traditionally classified into at least two allophones:
“dark /l/”
, which appears in syllable rimes
“clear /l/”,
which appears in syllable onsets.
Sproat and Fujimura (1993) :
clear and dark allophones are not categorically distinct;
single phonological entity /l/ involves two gestures –
a vocalic dorsal gesture and a consonantal apical gesture.
The two gestures are inherently asynchronous:
the vocalic gesture is attracted to the nucleus of the syllable
the consonantal gesture is attracted to the margin (“gestural affinity”).
In a syllable-final /l/, the tongue dorsum gesture
shifts left to the syllable nucleus,
making the vocalic gesture
precede the consonantal, tongue apex gesture.
In a syllable-initial /l/, the apical gesture precedes the dorsal gesture.Slide26
Yuan & Liberman: Interspeech 2009
26
Introduction
Clear /l/ has a relatively high F
2
and a low F
1
;
Dark /l/ has a lower F
2
and a higher F
1
;
Intervocalic /l/s are intermediate between the clear and dark variants (Lehiste 1964).
An important piece of evidence for the “gestural affinity” proposal:
Sproat and Fujimura (1993) found that the backness of pre-boundary intervocalic /l/ (in /i -
ɪ
/) is correlated with the duration of the pre-boundary rime. The /l/ in longer rimes is darker.
S&F (1993) devised a set of boundaries with a variety of strengths,
to ‘elicit’ different rime durations in laboratory speech:
Major intonation boundary:
Beel, equate the actors.
“|”
VP phrase boundary:
Beel equates the actors.
“V”
Compound-internal boundary:
The beel-equator’s amazing.
“C”
‘#’ boundary:
The beel-ing men are actors.
“#”
No boundary:
Mr Beelik wants actors.
“%”Slide27
Yuan & Liberman: Interspeech 2009
27
Introduction
Figure 1 in Sproat and Fujimura (1993):
Relation between
F
2
-F
1
(in Hz) and
pre-boundary rime duration
(in s)
for (a) speaker CS and (b) speaker RS.Slide28
Yuan & Liberman: Interspeech 2009
28
Introduction
Figure 4 in Sproat and Fujimura (1993): A schematic illustration of the effects of rime duration on pre-boundary post-nuclear /l/.Slide29
Yuan & Liberman: Interspeech 2009
29
Introduction
Huffman (1997) showed that onset [l]s also vary in backness:
the dorsum gesture for the intervocalic onset [l]s (e.g., in
below
)
may be shifted leftward in time relative to the apical gesture,
which makes a dark(er) /l/.
The data utilized in these studies (both F&S and Huffman)
comprised only a few hundred tokens of /l/ in laboratory speech.
“The relation of duration and backness can be complicated
by differences in coarticulatory effects of neighboring vowels,
or by speaker-specific constraints on absolute degree of backness.”
(Huffman 1997).
=> Our study uses a very large speech corpus
where these complications average out.
Automatic formant tracking is error-prone, and it is time-consuming to measure formants by hand.
=> We develop a new method to quantify /l/ backness
without formant tracking.Slide30
Yuan & Liberman: Interspeech 2009
30
Our Data
The SCOTUS corpus includes more than 50 years of oral arguments from the Supreme Court of the United States – nearly 9,000 hours in total. For this study, we used only the Justices’ speech (25.5 hours) from the 2001-term arguments, along with the orthographic transcripts.
The phone boundaries were automatically aligned using the PPL forced aligner trained on the same data, with the HTK toolkit and the CMU pronouncing dictionary.
The
dataset contains
21,706 tokens of /l/, including
3,410 word-initial [l]s,
7,565 word-final [l]s, and
10,731 word-medial [l]s.
Slide31
Yuan & Liberman: Interspeech 2009
31
The Penn Phonetics Lab Forced Aligner
The aligner’s acoustic models are GMM-based monophone HMMs
on 39 PLP coefficients. The monophones include:
speech segments
:
/t/, /l/, /aa1/, /ih0/, … (ARPAbet)
non-speech segments
:
{sil}
silence;
{LG}
laugh;
{NS}
noise;
{BR}
breath;
{CG}
cough;
{LS}
lip smack
{sp }
is a “tee” model with a direct transition
from the entry to the exit node in the HMM
(so “sp” can have 0 length)
.... used for handling possible inter-word silence.
The mean absolute difference between manual and automatically-aligned phone boundaries in TIMIT is about 12 milliseconds.
http://www.ling.upenn.edu/phonetics/p2fa
/Slide32
Yuan & Liberman: Interspeech 2009
32
Forced Alignment Architecture
Word and phone boundaries locatedSlide33
Yuan & Liberman: Interspeech 2009
33
Method
To measure the “darkness” of /l/ through forced alignment,
we first split /l/ into two phones,
L1 for the clear /l/ and L2 for the dark /l/,
and retrained the acoustic models for the new phone set.
In training, word-initial [l]’s (e.g.,
like, please
) were categorized as L1 (clear); the word-final [l]s (e.g.,
full, felt
) were L2 (dark).
All other [l]’s were ambiguous, which could be either L1 or L2.
During each iteration of training, the ‘real’ pronunciations of the ambiguous [l]’s were automatically determined,
and then the acoustic models of L1 and L2 were updated.
The new acoustic models were tested on both the training data
and on a data subset that had been set aside for testing.
During the tests, all [l]’s were treated as ambiguous –
the aligner determined whether a given [l] was L1 or L2.
Slide34
Yuan & Liberman: Interspeech 2009
34
Method
An example of L1/L2 classification through forced alignment:Slide35
Yuan & Liberman: Interspeech 2009
35
Method
If we use word-initial vs. word-final as the gold standard,
the accuracy of /l/ classification by forced alignment is
93.8%
on the training data and
92.8%
on the test data.
L1 L2
L1 2987 235 (training data)
L2 414 6757
gold-standard by word position
L1 169 19
L2 23 371 (test data)
classified by the aligner
These results suggest that acoustic fit
to clear/dark allophones in forced alignment
is a plausible way to estimate the darkness of /l/.Slide36
Yuan & Liberman: Interspeech 2009
36
Method
To compute a metric to measure the degree of /l/-darkness,
we therefore ran forced alignment twice.
All [l]’s were first aligned with L1 model, and then with the L2 model.
The difference in log likelihood scores between L2 and L1 alignments – the
D score
– measures the darkness of [l].
The larger the
D
score, the darker the [l].
The histograms of the
D
scores:Slide37
Yuan & Liberman: Interspeech 2009
37
Results
To study the relation between rime duration and /l/-darkness, we use the [l]s that follow a primary-stress vowel (denoted as ‘1’).
Such [l]s can precede a word boundary (‘#’), or a consonant (‘C’) or a non-stress vowel (‘0’) within the word.Slide38
Yuan & Liberman: Interspeech 2009
38
Results
From the figure we can see that:
The [l]s in longer rimes have larger
D
scores, and hence are darker. This result is consistent with Sproat and Fujimura (1993).
The [l]s preceding a non-stress vowel (1_L_0) are less dark
than the [l]s preceding a word boundary (1_L_#)
or a consonant (1_L_C).
The relation between duration and darkness for 1_L_C is non-linear.
The /l/ reaches maximum darkness
when the stressed vowel plus /l/ is about 150-200 ms.
The syllable-final [l]s are always dark, even in very short rimes.
This contradicts Sproat and Fujimura (1993)’s finding
that the syllable-final /l/ in very short rimes
is as clear as the canonical clear /l/.Slide39
Yuan & Liberman: Interspeech 2009
39
Results
To further examine the difference between clear and dark /l/,
we compare the intervocalic (1_L_0) –
syllable-final or "ambisyllabic"
with the intervocalic (0_L_1) -
syllable-initial
The “rime” duration here means
the duration of the previous vowel plus the duration of [l]
regardless of putative syllabic affinity….Slide40
Yuan & Liberman: Interspeech 2009
40
Results
From the figures we can see that:
The intervocalic syllable-final [l]s have positive
D
scores
whereas the intervocalic syllable-initial [l]s have negative
D
scores.
There is a positive correlation between darkness and rime duration
(i.e., the duration of /l/ and its preceding vowel)
for the intervocalic syllable-final [l]s,
but no correlation for the intervocalic syllable-initial [l]s.
For the intervocalic syllable-final /l/, there is a positive correlation between /l/ duration and darkness. No correlation between /l/ duration and darkness was found, however, for the intervocalic syllable-initial /l/.
These results suggest that there is a clear difference
between the intervocalic syllable-final and syllable-initial /l/s.Slide41
Yuan & Liberman: Interspeech 2009
41
Conclusions
We found a strong correlation between the rime duration
and /l/-darkness for syllable-final /l/.
This result is consistent with
Sproat
and Fujimura (1993).
We found no correlation between /l/ duration
and darkness for syllable-initial /l/.
This result is different from Huffman (1997).
We found a clear difference in /l/ darkness
between the 0_1 and 1_0 stress contexts,
across all values of V+/l/ duration and of /l/ duration.
We found that the syllable-final /l/ preceding a non-stress vowel
was less dark than preceding a consonant or a word boundary.
Also, there was a non-linear relationship between timing and quality
for the /l/ preceding a consonant
and following a primary-stress vowel.
These segments reach a
peak of darkness
when the duration of the stressed vowel plus /l/ is about 150-200
ms.
Further research is needed to confirm and explain these results.Slide42
Meta-Conclusion
Large “found” collections of speech
can be used effectively in phonetics research.
Better pronunciation modeling
and better forced alignment
will be helpful.
But the existing technology
is good enough to start with.