Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique Language Emergence Competition Usage and Analyses 20190606 2 No overt amp unambiguous wordmorpheme boundaries in the input ID: 792075
Download The PPT/PDF document "Modeling infant word segmentation: Anoth..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Modeling infant word segmentation: Another example of discovery fueled by CHILDES
Alejandrina CristiaLaboratoire de Sciences Cognitives et Psycholinguistique@Language Emergence: Competition, Usage, and Analyses, 2019-06-06
Slide22
No
overt & unambiguous word/morpheme boundaries in the input…
“no silences”
Kuhl 2004
Slide3Tincoff
&
Jusczyk
2012;
Bergelson
&
Swingle
y 2012;
Ngon
et al. 2014
Kuhl 2004
“no silences”
3
…
yet
by the end of the first
year
, infants know
some
words
/
morphemes
‘Feet’ ‘mommy’ ‘baby’
‘
alldone’ ‘tobed’
Slide4How to study segmentability?
mommy talking
… cute … something shiny go by?
Let’s just get to the facts.
Slide5Today’s menu
A methodology for studying word form segmentation using modelsSegmentability differences for child-directed versus adult-directed register (in French
)… bilingual versus monolingual settings (English, Spanish, & Catalan)Implications for infant studies
Slide6Today’s menu
A methodology for studying word form segmentation using modelsSegmentability differences for child-directed versus adult-directed register (in French
)… bilingual versus monolingual settings (English, Spanish, & Catalan)Implications for infant studies
Slide7Input representation
Acoustic+ realistic…… provided representations match babies’few appropriate corpora (natural discourse & good quality audio)
only one (reproducible) algorithmSymbolic (‘Phonological text’)+ lots of corpora can be used+ lots of algorithms proposed + algorithms represent a wide range of strategiesassumes babies represent input abstract, with zero errors
Slide8Example
*MOT: look at the doggie
lUk At D2 dOgi
Phonologizel U k A t D 2 d O g i
Remove
word
boundaries
&
unitize
lU
kAt
D2
dO
gi
Segment
with
some
algorithm
Token F-score =
2* (
Precision
*
Recall
)
Precision
+
Recall
Evaluate
Precision
= 1 of the 5
words
found
were
words
in the input = .2
Recall
=
1
of the 4
words
in the input
was
recovered
= .25
Note -- one can also unitize at the syllable level:
lUk
At D2
dO
gi
(input)
lUk
At
D2
dO
gi
(output
)
Slide9Goal is to “cut” using local cues
2. Sub-lexical Package: wordseg.readthedocs.io Preprint: https://osf.io/nx49h/
Bernard et al. 2019 Beh Res MethTransitional Probabilities (TP)
TP_abs
TP_rel
x Absolute/Relative
threshold
Goal is to learn a set of “minimal
recombinable
units”
Adaptor Grammar
(AG)
Phonotactics
from Utterances Determine Distributional Lexical Elements
(Puddle
)
3.
Lexical
Simplest strategies
1.
Baseline
Every sentence is a word (
SentBase
)
Every syllable is a word (
SyllBase
)
Johnson +
2007;
Monaghan + 2010
Diphone
-Based Segmentation
(
DiBS
)
Example algorithms
Daland
+ 2009;
Saksida
+ 2016
Lignos
2012
Slide10The process in WordSeg
Package: wordseg.readthedocs.io Preprint:
https://osf.io/nx49h/Bernard et al. 2019 Beh Res Meth
Slide11Sample results:precision, recall, & F-score are correlated
Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES
Slide12Sample results:Effects
of algorithm and input represent-ation
Naima, in Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES
Slide13Today’s menu
A methodology for studying word form segmentation using modelsSegmentability differences for child-directed versus adult-directed register (in
French)… bilingual versus monolingual settings (English, Spanish, & Catalan)Implications for infant studies
Slide14Why look at register?
In child-directed speech, probably…More utterances consist of a single word (+ all models)Utterances are overall shorter in length (+ all models)
*MOT: Attends! *MOT: Ouaistuvastemettreausoleilpourtesecherlescheveux!
Slide15Why look at register?
In child-directed speech, probably…More utterances consist of a single word (+ all models)Utterances are overall shorter in length (+ all models)Utterances are more repetitious (+? lexical models)
*MOT: coucoucoucousitufaisaisdespetitssourirestoi.*MOT:
tumefaisdespetitssouriresXXXcoucoumongrand. *MOT: coucoutumefais
dessouriresoupas.
Slide16(Ask me about crosslinguistic
extensions if curious!)JapaneseRiken corpusCollected in the lab
adult-directed speech is with experimenterEnglishWinnipeg corpusCollected with child-worn device worn whole day adult-directed speech is among caregivers
FrenchLENA-Lyon corpus (LeNormand et al. HomeBank)Collected with child-worn device worn whole day adult-directed speech is among caregivers
Bogdan
Ludusan
Georgia
Loukatou
Slide17on Le Normand, Canault
, & Van Thai’s LENA-Lyon corpus
French“wild” ADSLoukatou + 2019 Proc Cog Sci
Slide18CDS-ADS: Conclusions
Overall trend for better performance for child- than adult-directed speechBut:reversed for some algorithmseffect of register < 15%(in the best controlled cases, 2%)
Slide19Today’s menu
A methodology for studying word form segmentation using modelsSegmentability differences for child-directed versus adult-directed register (in French
)… bilingual versus monolingual settings (English, Spanish, & Catalan)Implications for infant studies
Slide20‘Feet’ ‘mommy’ ‘baby’
‘alldone’ ‘tobed’
Bilinguals
need to:
Learn
words
,
like
monolinguals
do, but
in
two
languages
Overall less input
in each language
‘pié’ ‘mamá’ ‘bebé’ …
Hoff
+
2012
20
Why study word segmentation in a bilingual setting?
Fibla
& Cristia (submitted very soon, I hope)
Slide21Questions & predictions
Are segmentation strategies equally successful when applied to bilingual and monolingual corpora? → Measure the performance of previously studied segmentation algorithms in a controlled
monolingual versus bilingual corpus. Possible outcomes:
The confusion hypothesis: variable and inconsistent input→ Poorer performance for the bilingual than for the monolingual The resistant hypothesis:
(if
switching
only
at
utterance
edges
) local statistical
and lexical are
still
reliable
→ Similar performance for the bilingual and the monolingual
21
Fibla
& Cristia (submitted very soon, I hope)
Slide22Creating bilingual corpora
Slide23Slide24Slide25Three cases of bilingual
< monolingual
Slide26Three cases of bilingual
< monolingual
11 cases of bilingual ‘in between’ monolingual
Slide27Today’s menu
A methodology for studying word form segmentation using modelsSegmentability differences for child-directed versus adult-directed register (in
French)… bilingual versus monolingual settings (English, Spanish, & Catalan)Implications for infant studies
Slide28Effects of algorithm and input represent-
ation
size of algorithm x level effect = 40-60%?Cristia + 2019 Open Mind
Slide29Effect
of registeron LENA-Lyon corpus
Size of register effect < 10%?Loukatou + 2019
Proc Cog Sci
Slide30Effect
of bilingualism
Fibla
& Cristia (submitted very soon, I hope)
Size of bilingualism effect
~
0%?
Slide31Today’s menu
A methodology for studying word form segmentation using modelsSegmentability differences as a function of language properties… child-directed versus adult-directed register
(in Japanese, English, & French)… bilingual versus monolingual settings (English, Spanish, & Catalan)Implications for infant studies
Slide32What may babies be doing? Using CDI results & frequency effects
Larsen + 2017 Interspeech & in prep
Slide33What may babies be doing? Using CDI results & frequency effects
Larsen + 2017 Interspeech & in prep
Coefficient of determination R2=.1
Slide34Slide35Slide36Slide37phoneme-based
models
Slide38syllable-based
models
phoneme-based models
Slide39Cut only at utterance edges
frequency of words in isolation
Slide40To be continued…
Slide41Thanks to...
Families who agree to be recorded & for their data to be
sharedResearchers who record them and share on TalkBankTalkBank ~ Brian MacWhinney
&
you!
Slide42Japanese“lab” ADS
on Reiko Mazuka’s RIKEN corpus
much of this is in Ludusan et al. 2017 ACL(now working on journal paper with more material)
Slide43English
“wild” ADSon Melanie Soderstrom’s Winnipeg corpus
Cristia + 2019 Open Mind