/
Back-End Synthesis Julia Hirschberg Back-End Synthesis Julia Hirschberg

Back-End Synthesis Julia Hirschberg - PowerPoint Presentation

reese
reese . @reese
Follow
65 views
Uploaded On 2023-11-11

Back-End Synthesis Julia Hirschberg - PPT Presentation

CS 4706 Thanks to Dan and Jim Architectures of Modern Synthesis Articulatory Synthesis Model movements of articulators and acoustics of vocal tract Formant Synthesis Start with acoustics create rulesfilters to create each formant ID: 1031194

language processing jurafsky martin processing language martin jurafsky hmm unit diphone selection phone synthesis diphones features utterances duration tts

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Back-End Synthesis Julia Hirschberg" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Back-End SynthesisJulia HirschbergCS 4706(*Thanks to Dan and Jim)

2. Architectures of Modern SynthesisArticulatory Synthesis:Model movements of articulators and acoustics of vocal tractFormant Synthesis:Start with acoustics, create rules/filters to create each formantConcatenative Synthesis:Use databases of stored speech to assemble new utterances.HMM SynthesisText from Richard Sproat slides3/3/122Speech and Language Processing Jurafsky and Martin

3. Formant SynthesisMost common commercial systems (while computers relatively underpowered)1979 MIT MITalk (Allen, Hunnicut, Klatt)1983 DECtalk system Voice of Stephen Hawking3/3/123Speech and Language Processing Jurafsky and Martin

4. Concatenative SynthesisAll current commercial systems.Diphone Synthesis Units are diphones; middle of one phone to middle of next.Why? Middle of phone is steady state.Record 1 speaker saying each diphoneUnit Selection Synthesis Larger unitsRecord 10 hours or more, so have multiple copies of each unitUse search to find best sequence of units3/3/124Speech and Language Processing Jurafsky and Martin

5. TTS Demos (all Unit-Selection)Festivalhttp://www-2.cs.cmu.edu/~awb/festival_demos/index.htmlCepstralhttp://www.cepstral.com/cgi-bin/demos/generalAT&Thttp://www2.research.att.com/~ttsweb/tts/demo.php3/3/125

6. How do we get from Text to Speech?TTS Backend takes segments+f0+duration and creates a waveformA full system needs to go all the way from random text to sound3/3/126

7. Front End and Back EndPG&E will file schedules on April 20.TEXT ANALYSIS: Text to intermediate representation:WAVEFORM SYNTHESIS: From intermediate representation to waveform3/3/127Speech and Language Processing Jurafsky and Martin

8. The Hourglass 3/3/128Speech and Language Processing Jurafsky and Martin

9. Waveform SynthesisGiven:String of phonesProsodyDesired F0 for entire utteranceDuration for each phoneStress value for each phone, possibly accent valueGenerate:Waveforms3/3/129Speech and Language Processing Jurafsky and Martin

10. Diphone TTS ArchitectureTraining:Choose units (kinds of diphones)Record 1 speaker saying at least 1 example of eachMark boundaries and segment to create diphone databaseSynthesizing from diphonesSelect relevant set of diphones from databaseConcatenate them in order, doing minor signal processing at boundariesUse signal processing techniques to change prosody (F0, energy, duration) of sequence3/3/1210Speech and Language Processing Jurafsky and Martin

11. DiphonesWhere is the stable region?3/3/1211Speech and Language Processing Jurafsky and Martin

12. Diphone DatabaseMiddle of phone more stable than edgesNeed O(phone2) number of unitsSome phone-phone sequences don’t existATT (Olive et al.’98) system had 43 phones1849 possible diphones but only 1172 actualPhonotactics: [h] only occurs before vowelsDon’t need diphones across silence But…may want to include stress or accent differences, consonant clusters, etcRequires much knowledge of phonetics in designDatabase relatively small (by today’s standards)Around 8 megabytes for English (16 KHz 16 bit)3/3/1212

13. VoiceSpeakerCalled voice talentHow to choose?Diphone databaseCalled a voiceModern TTS systems have multiple voices3/3/1213Speech and Language Processing Jurafsky and Martin

14. Prosodic ModificationModifying pitch and duration independentlyChanging sample rate modifies both:Chipmunk speechDuration: duplicate/remove parts of the signalPitch: re-sample to change pitchText from Alan Black3/3/1214Speech and Language Processing Jurafsky and Martin

15. Speech as Sequence of Short Term SignalsAlan Black3/3/1215Speech and Language Processing Jurafsky and Martin

16. Duration ModificationDuplicate/remove short term signalsSlide from Richard Sproat3/3/1216

17. Duration modificationDuplicate/remove short term signals3/3/1217Speech and Language Processing Jurafsky and Martin

18. Pitch ModificationMove short-term signals closer together/further apart: more cycles per sec means higher pitch and vice versaAdd frames as needed to maintain desired durationSlide from Richard Sproat3/3/1218Speech and Language Processing Jurafsky and Martin

19. TD-PSOLA ™Time-Domain Pitch Synchronous Overlap and AddPatented by France Telecom (CNET)Epoch detection and windowingPitch-synchronousOverlap-and-addVery efficientCan modify Hz up to two times or by halfSmoother transitions3/3/1219Speech and Language Processing Jurafsky and Martin

20. Unit Selection SynthesisGeneralization of the diphone intuitionLarger units From diphones to phrases to …. sentencesRecord many copies of each unitE.g., 10 hours of speech instead of 1500 diphones (a few minutes of speech)Label diphones and their midpoints3/3/1220

21. Unit Selection IntuitionGiven a large labeled database, find the unit that best matches the desired synthesis specificationWhat does “best” mean?Target cost: Find closest match in terms ofPhonetic contextF0, stress, phrase positionJoin cost: Find best join with neighboring unitsMatching formants + other spectral characteristicsMatching energyMatching F03/3/1221Speech and Language Processing Jurafsky and Martin

22. Targets and Target CostsTarget cost C(t,u): How well does target specification t match db unit u?Goal: find unit least unlike targetExamples of labeled diphone midpoints/ih-t/ +stress, phrase internal, high F0, content word/n-t/ -stress, phrase final, high F0, function word/dh-ax/ -stress, phrase initial, low F0, word=theCosts of different features have different weights3/3/1222

23. Target CostsComprised of p weighted subcostsStressPhrase positionF0Phone durationLexical identityTarget cost for a unit:3/3/1223

24. Join (Concatenation) CostMeasure of smoothness of join between two database units ui and uj (target irrelevant)Features, costs, and weightsComprised of p weighted subcosts:Spectral featuresF0EnergyJoin cost:3/3/1224

25. Total CostsHunt and Black 1996We now have weights (per phone type) for features set between target and database unitsFind best path of units through database that minimize:Standard problem solvable with Viterbi search with beam width constraint for pruningSlide from Paul Taylor3/3/1225Speech and Language Processing Jurafsky and Martin

26. Synthesizing….3/3/1226Speech and Language Processing Jurafsky and Martin

27. Unit Selection SummaryAdvantagesQuality far superior to diphones: fewer joins, more choices of unitsNatural prosody selection sounds betterDisadvantages:Quality very bad when no good match in databaseHCI issue: mix of very good and very bad quite annoyingSynthesis is computationally expensiveHard to control prosody Diphone technique can vary emphasisUnit selection can give result that conveys wrong meaning3/3/1227

28. New Trend: HMM Synthesis Hidden Markov Model SynthesisWon recent TTS bakeoffSounds less natural to researchers but naïve subjects preferredHas potential to improve over both diphone and unit selectionGenerate speech parameters from statistics trained on dataVoice quality can easily be changed by transforming HMM parameters3/3/1228Speech and Language Processing Jurafsky and Martin

29. From Concatenation to Generation

30. HMM SynthesisA parametric modelCan train on mixed data from many speakersModel takes up a very small amount of spaceSpeaker adaptation

31. HMMsSome hidden process has generated some visible observation.

32. HMMsSome hidden process has generated some visible observation.

33. HMMsHidden states have transition probabilities and emission probabilities.

34. HMM SynthesisEvery phoneme+context is represented by an HMM. The cat is on the mat.The cat is near the door.< phone=/th/, next_phone=/ax/, word='the', next_word='cat', num_syllables=6, .... >Acoustic features extracted: f0, spectrum, durationTrain HMM with these examples.

35. HMM SynthesisEach state outputs acoustic features (a spectrum, an f0, and duration)

36. HMM SynthesisEach state outputs acoustic features (a spectrum, an f0, and duration)

37. HMM SynthesisMany contextual features = data sparsityCluster similar-sounding phonese.g: 'bog' and 'dog'the /aa/ in both have similar acoustic features, even though their context is a bit differentMake one HMM that produces both, and was trained on examples of both.

38. Experiments: Google, Summer 2010Can we train on lots of mixed data? (~1 utterance per speaker)More data vs. better data15k utterances from Google Voice Search as training dataace hardware rural supply

39. More Data vs. Better DataVoice Search utterances filtered by speech recognition confidence scores 50%, 6849 utterances 75%, 4887 utterances 90%, 3100 utterances 95%, 2010 utterances 99%, 200 utterances

40. Future WorkSpeaker adaptationPhonetically-balanced training dataListening experimentsParallelizationOther sources of dataVoices for more languages

41. Referencehttp://hts.sp.nitech.ac.jp

42. Tokuda et al ’02

43. HMM SynthesisUnit selection (Roger)HMM (Roger)Unit selection (Nina)HMM (Nina)3/3/1243Speech and Language Processing Jurafsky and Martin

44. TTS EvaluationIntelligibility TestsMean Opinion ScoresPreference Tests3/3/1244Speech and Language Processing Jurafsky and Martin

45. Intelligibility TestsDiagnostic Rhyme Test (DRT) Listening testListeners choose between two words differing by a single phonetic feature (voicing, nasality, sustenation, sibilation)DRT: 96 rhyming pairsDense/tense, bond/pond, …Subject hears dense, chooses either dense or tense% of correct answers is intelligibility scoreProblem: Only tests single word synthesis

46. Modified DRT: 300 words, 50 sets of 6 words (went, sent, bent, tent, dent, rent)Embedded in carrier phrases:Now we will say dense againMean Opinion ScoreHave listeners rate output on a scale from 1 (bad) to 5 (excellent)Preference tests:Reading addresses out loud, reading news text, using two different systems or systems against human voiceDo a preference test (prefer A, prefer B)

47. Next ClassMidterm on March 93/3/1247Speech and Language Processing Jurafsky and Martin