Markpong Jongtaveesataporn Chai Wutiwiwatchai Koji Iwano Sadaoki Furui Tokyo Institute of Technology Japan NECTEC Thailand Background on Thai speech recognition research ID: 209860
Download Presentation The PPT/PDF document "Thai Broadcast News Corpus Construction ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Thai Broadcast News Corpus Construction and Evaluation
Markpong Jongtaveesataporn
†
Chai
Wutiwiwatchai
‡
Koji
Iwano
†
Sadaoki
Furui
†
†
Tokyo Institute of Technology, Japan
‡
NECTEC, ThailandSlide2
Background on Thai speech recognition research
2
1987
Isolated syllable recognition
1995
Isolated word recognition
Connected sub-word recognition
1999
Small task continuous speech recognition
2003
LVCSR
2005
Broadcast news
transcription system
2007
Difficulty
Thienlikit
et al.
,
2004
Newspaper read-speech recognitionSlide3
Development of Thai Broadcast News Transcription System
Research on broadcast news transcription system for Thai
falls behind
other languages
English:
1995 (Stern, 1997
) Japanese: 1997 (Matsuoka et al.,
1997) Mandarin: 1998
(Guo et al., 1998) Italian:
2000 (Federico et al., 2000
)We need to speed up our research activities to catch up with others3
Targets
Development of Thai broadcast news corpus
Speech corpus: training and testing dataText corpus
: language modelingDevelopment of a prototype system Slide4
Speech corpus
Structure information of broadcast news was annotated
Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn
Speaker’s name, if known
Speaker’s gender: male / femaleSpeaking mode: planned / spontaneousBackground noise: clean / music / noise
Only speech from announcers speaking in the studio was transcribedTranscription and annotation was created by one transcriber and checked by another
transcriber4Slide5
Episode : one broadcast news session
Structure of broadcast news
5
Section
1
: one news topic
Section
1
: one news topic
Section
2
Section
3Slide6
Episode :
one broadcast news session
Section
1
: one news topic
Structure of broadcast news
5
Speaker’s turn : speaker A
Speaker’s turn : speaker A
Speaker’s
turn : speaker B
Speaker’s
turn : speaker ASlide7
Episode : one broadcast news session
Structure of broadcast news
7
Section
1
: one news topic
Speaker’s turn : speaker A
Segment : one sentence or clause
Segment
: one sentence or clause
Segment
: one sentence or clauseSlide8
Speech corpus
Structure information of broadcast news was annotated
Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn
Speaker’s name, if knownSpeaker’s gender: male / femaleSpeaking mode: planned / spontaneous
Background noise: clean / music / noiseOnly speech from announcers speaking in the studio was transcribedTranscription and annotation was created by one transcriber
and checked by another transcriber8Slide9
Episode : one broadcast news session
Example of structure information
9
Section
1
:
Speaker’s turn :
Segment :
sentence A
Segment
: sentence B
Segment
: sentence C
Sports
Mr. A, male, p
lanned speech, c
lean speechSlide10
Speech corpus
Structure information of broadcast news was annotated
Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn
Speaker’s name, if knownSpeaker’s gender: male / femaleSpeaking mode: planned / spontaneous
Background noise: clean / music / noiseOnly speech from announcers speaking in the studio was transcribedTranscription and annotation was created by one transcriber and checked by another transcriber
10Slide11
Text corpus
No structure information was annotated
Additional information
Speaking mode: planned / spontaneous
11Slide12
Problems of Thai transcription text
No space between words
Definition of word is very ambiguous
No good morphological analyzer
Difficulties in transcription and checking processManually word-segmented transcription was made
Instruction was created for transcribersAutomatically segmented transcription
12Future targetSlide13
Broadcast news collection
News programs from one public TV station in Thailand were recorded
Total of
105
news episodesSpeech corpus :
35 news episodes 17 hoursText corpus :
70 news episodes13Slide14
Analysis of speech corpus
14Slide15
Information of speech & text corpora
Attribute
Speech corpus
Text corpus
No.
of sentences
13k
32kNo. of words
224k
573kNo. of unique words
10k14k
No. of phonemes
899k-
No. of speakers8 female,
4 male
-15Slide16
Data used in experiments
Test set data
Randomly selected from the speech corpus
3,000
utterances
Acoustic model training data for the baseline systemPhonetically balanced sentence speech corporaLOTUS (Kasuriya et al.,
2003) and the corpus developed internallyRead speech corpora40.3 hours (
68 male and 68 female)Acoustic model adaptation data
Selected from the speech corpusNo overlap between adaptation data and test set dataLanguage model training dataText corpus + transcript from speech corpus excluded test set
16Slide17
Experimental condition
Acoustic model
Gender-dependent acoustic model
12
MFCCs, delta, and delta energy
Triphones, 1000 tied-states,
8 Gaussian mixturesLanguage modelTri-gramsDictionary size: about
18k wordsTITech WFST speech recognition system (Dixon et al., 2007) was used as a speech decoder
17Slide18
Acoustic model adaptation
Supervised adaptation using MLLR
F-condition adaptation
F
0 : clean, planned F
1 : clean, spontaneous F3 : music noise F
4 : other noiseAdaptation data: 200
utterances regardless of speaker randomly selected from the speech corpusSpeaker adaptationAdaptation data: 200 utterances regardless of F-condition randomly selected from the speech corpus
18Slide19
WER results
19
Speaker adaptation yielded
better WER
F-condition
Proportion
Time
#words
F0
35.3%
17160F1
1.0%
629
F314.0%
7882
F449.7%
27542Slide20
Discussion
High WER
Mismatch recording condition
The speech corpus was only used as testing and adaptation data
Small text corpus
Inefficient language model
20Slide21
Conclusion
Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented
Speech corpus was annotated with structure information which is useful for further research purpose
An LVCSR system was setup and tested with the corpus
21Slide22
Future work
Applying our Thai language modeling technique (Jongtaveesataporn et al.,
2007
)
Compound pseudo-morpheme (CPM) unitPseudo-morpheme error rate (F
0 condition)Manually-segmented word unit system: 20.5%
CPM unit system: 19.9%Improving language model by using newspaper textCollaboration with NECTEC: additional
50 hours of speech corpus22Slide23
Thank you
23Slide24
Thank you
24Slide25
Thank you
25Slide26
Background
26
1987
Isolated syllable recognition
1995
Isolated word recognition
Connected sub-word recognition
1999
Small task continuous speech recognition
2003
LVCSR
2005
Broadcast news LVCSR
2007
Difficulty
Thienlikit
,
2004
Newspaper read-speech recognitionSlide27
Development of Thai Broadcast News LVCSR System
Development of an LVCSR system requires speech and text corpora
Existing speech corpora for Thai LVCSR research
NECTEC-ATR
LOTUS (NECTEC)GlobalPhone (CMU)
27
Newspaper read-speech
Development of Thai broadcast news corpus
Speech corpus: training and testing data
Text corpus: language modeling
Development of a prototype of LVCSR system Slide28
Experiments & Developed corpora
Speech corpus
The size of the speech corpus is still rather small
It was used in three ways
Test dataAdaptation dataA part of transcription text was used for training LM
Text corpusIt was used for training LM28Slide29
Perplexity & OOV rates
F-condition
Perplexity
OOV rate
Male
Female
MaleFemale
F0
107.5
106.90.9
0.8
F1
126.4100.1
0.9
0.6F3
145.2100.0
0.7
0.9
F4
141.6
157.6
1.5
1.9
Overall
126.9
125.6
1.2
1.3
29Slide30
Transcription process
Text corpus transcribing
7 persons
Guideline
30
Speech corpus transcribing
4 persons
Speech corpus checking
2 persons
Lexical entries checking
1 person
Speech corpus
Lexical entries checking
1 person
Text
corpusSlide31
Speech corpus
Transcription and annotation of about
17
hours of TV broadcast news
Tool: “Transcriber” (Barras et al.,
2001)Additional informationspeaker information: name, genderspeaking mode: planned/spontaneous speechSpeech from announcers speaking in the studio
31Slide32
Transcription conventions
Guideline for the transcription process
Segment segmentation
Word segmentation
Repeating wordThai/English abbreviationNumber entity
Special tags32Slide33
Introduction
Thai speech processing research in
TokyoTech
Dialogue system
[Whittiwiwattchai
, 2003]LVCSR systemDictation system [Tianlikid,2005]Broadcast news recognition system
33Slide34
Overview
Introduction
Corpus description
Recording and transcription processes
Corpus evaluationConclusion
34Slide35
Thai language corpora
Large language corpora are crucial to a state-of-the-art natural language processing system
Thai speech resources for speech processing
NECTEC-ATR
LOTUS (NECTEC)GlobalPhone (CMU)
TSynC-1 (NECTEC)
35
Newspaper read-speech
Unit-selection speech synthesisSlide36
WER Result
F-condition
Time proportion
WER (%)
Male
Female
F0
28.1%
44.4
40.8
F11.5%
62.4
60.2
F3
11.5%82.2
72.4F4
58.9%
54.9
57.5
Overall
100%
56.8
45.5
36Slide37
Text corpus
Text transcribed from
35
hours of TV broadcast news
Additional information
Speaking mode: planned/spontaneous
37Slide38
Transcription conventions (1)
Sentence segmentation
No sentence marker in Thai language
Ambiguous
Grammatically, there are
3 types of sentenceSimple sentenceCompound sentenceComplex sentenceSentence was defined as a simple sentence or clause with the help of delimited breaths
38Composed from several of clauses or simple sentencesSlide39
Transcription conventions (2)
Word segmentation
No word boundary marker in Thai language
Lead to difficulties in transcription and data checking processes
Too ambiguous to define all rulesA few rules of simple segmentation patterns were defined
Undefined patterns were left to the decision of transcribers39Slide40
Transcription conventions (3)
Repeating word
Thai/English abbreviation
Number entity
Special tagsDisfluencies, filled-pauses, exclamationsForeign words
Some other events: uncertainly transcribed part, etc.40Slide41
Recorded programs
News programs from one public TV station in Thailand was recorded
Total of
105
news episodesSpeech corpus
35 news episodesAbout 17 hours of speech dataText corpus:
70 news episodes41