/
Speech Recognition with CMU Sphinx Speech Recognition with CMU Sphinx

Speech Recognition with CMU Sphinx - PowerPoint Presentation

singh
singh . @singh
Follow
80 views
Uploaded On 2023-09-24

Speech Recognition with CMU Sphinx - PPT Presentation

Srikar Nadipally Hareesh Lingareddy What is Speech Recognition A Speech Recognition System converts a speech signal in to textual representation 3 of 23 Types of speech recognition Isolated words ID: 1020471

recognition speech state set speech recognition set state model word sphinx words markov cmu hmms large www acoustic project

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Speech Recognition with CMU Sphinx" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Speech Recognition with CMU SphinxSrikar NadipallyHareesh Lingareddy

2. What is Speech RecognitionA Speech Recognition System converts a speech signal in to textual representation

3. 3 of 23Types of speech recognitionIsolated wordsConnected wordsContinuous speechSpontaneous speech (automatic speech recognition)Voice verification and identification (used in Bio-Metrics)Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993

4. Speech Production and Perception ProcessFundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993

5. 5 of 23Speech recognition – uses and applicationsDictation Command and controlTelephonyMedical/disabilitiesFundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993

6. Project at a Glance - Sphinx 4DescriptionSpeech recognition written entirely in the JavaTM programming languageBased upon Sphinx developed at CMU:www.speech.cs.cmu.edu/sphinx/GoalsHighly flexible recognizerPerformance equal to or exceeding Sphinx 3Collaborate with researchers at CMU and MERL, Sun Microsystems and others

7. How does a Recognizer Work?

8. Sphinx 4 Architecture

9. Front-EndTransforms speech waveform into features used by recognitionFeatures are sets of mel- frequency cepstrum coefficients (MFCC)MFCC model human auditory systemFront-End is a set of signal processing filtersPluggable architecture

10. Knowledge BaseThe data that drives the decoderConsists of three sets of data:DictionaryAcoustic ModelLanguage ModelNeeds to scale between the three application types

11. DictionaryMaps words to pronunciationsProvides word classification information (such as part-of- speech)Single word may have multiple pronunciationsPronunciations represented as phones or other unitsCan vary in size from a dozen words to >100,000 words

12. Language ModelDescribes what is likely to be spoken in a particular contextUses stochastic approach. Word transitions are defined in terms of transition probabilitiesHelps to constrain the search space

13. Command GrammarsProbabilistic context-free grammarRelatively small number of statesAppropriate for command and control applications

14. N-gram Language ModelsProbability of word N dependent on word N-1, N-2, ...Bigrams and trigrams most commonly usedUsed for large vocabulary applications such as dictationTypically trained by very large (millions of words) corpus

15. Acoustic ModelsDatabase of statistical modelsEach statistical model represents a single unit of speech such as a word or phonemeAcoustic Models are created/trained by analyzing large corpora of labeled speechAcoustic Models can be speaker dependent or speaker independent

16. A Markov Model (Markov Chain) is: similar to a finite-state automata, with probabilities of transitioning from one state to another:What is a Markov Model?S1S5S2S3S40.50.50.30.70.10.90.80.2 transition from state to state at discrete time intervals can only be in 1 state at any given time1.0http://www.cslu.ogi.edu/people/hosom/cs552/

17. Hidden Markov ModelHidden Markov Models represent each unit of speech in the Acoustic ModelHMMs are used by a scorer to calculate the acoustic probability for a particular unit of speechA Typical HMM uses 3 states to model a single context dependent phonemeEach state of an HMM is represented by a set of Gaussian mixture density functions

18. Gaussian MixturesGaussian mixture density functions represent each state in an HMMEach set of Gaussian mixtures is called a senoneHMMs can share senones

19. The Decoder

20. Decoder - heart of therecognizerSelects next set of likely statesScores incoming features against these statesPrunes low scoring statesGenerates results

21. Decoder Overview

22. Selecting next set of statesUses Grammar to select next set of possible wordsUses dictionary to collect pronunciations for wordsUses Acoustic Model to collect HMMs for each pronunciationUses transition probabilities in HMMs to select next set of states

23. Sentence HMMS

24. Sentence HMMS

25. Search Strategies: Tree SearchCan reduce number of connections by taking advantage ofredundancy in beginnings of words (with large enough vocabulary):t (washed)n (dan)z (dishes)er (after)er (dinner)r (car)s (alice)ch (lunch)aeddhklwflaeihax (the)aaahaatihshihnnshhttp://www.cslu.ogi.edu/people/hosom/cs552/

26. Questions??

27. ReferencesCMU Sphinx Project, http://cmusphinx.sourceforge.netVox Forge Project, http://www.voxforge.org