Srikar Nadipally Hareesh Lingareddy What is Speech Recognition A Speech Recognition System converts a speech signal in to textual representation 3 of 23 Types of speech recognition Isolated words ID: 1020471
Download Presentation The PPT/PDF document "Speech Recognition with CMU Sphinx" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. Speech Recognition with CMU SphinxSrikar NadipallyHareesh Lingareddy
2. What is Speech RecognitionA Speech Recognition System converts a speech signal in to textual representation
3. 3 of 23Types of speech recognitionIsolated wordsConnected wordsContinuous speechSpontaneous speech (automatic speech recognition)Voice verification and identification (used in Bio-Metrics)Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993
4. Speech Production and Perception ProcessFundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993
5. 5 of 23Speech recognition – uses and applicationsDictation Command and controlTelephonyMedical/disabilitiesFundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993
6. Project at a Glance - Sphinx 4DescriptionSpeech recognition written entirely in the JavaTM programming languageBased upon Sphinx developed at CMU:www.speech.cs.cmu.edu/sphinx/GoalsHighly flexible recognizerPerformance equal to or exceeding Sphinx 3Collaborate with researchers at CMU and MERL, Sun Microsystems and others
7. How does a Recognizer Work?
8. Sphinx 4 Architecture
9. Front-EndTransforms speech waveform into features used by recognitionFeatures are sets of mel- frequency cepstrum coefficients (MFCC)MFCC model human auditory systemFront-End is a set of signal processing filtersPluggable architecture
10. Knowledge BaseThe data that drives the decoderConsists of three sets of data:DictionaryAcoustic ModelLanguage ModelNeeds to scale between the three application types
11. DictionaryMaps words to pronunciationsProvides word classification information (such as part-of- speech)Single word may have multiple pronunciationsPronunciations represented as phones or other unitsCan vary in size from a dozen words to >100,000 words
12. Language ModelDescribes what is likely to be spoken in a particular contextUses stochastic approach. Word transitions are defined in terms of transition probabilitiesHelps to constrain the search space
13. Command GrammarsProbabilistic context-free grammarRelatively small number of statesAppropriate for command and control applications
14. N-gram Language ModelsProbability of word N dependent on word N-1, N-2, ...Bigrams and trigrams most commonly usedUsed for large vocabulary applications such as dictationTypically trained by very large (millions of words) corpus
15. Acoustic ModelsDatabase of statistical modelsEach statistical model represents a single unit of speech such as a word or phonemeAcoustic Models are created/trained by analyzing large corpora of labeled speechAcoustic Models can be speaker dependent or speaker independent
16. A Markov Model (Markov Chain) is: similar to a finite-state automata, with probabilities of transitioning from one state to another:What is a Markov Model?S1S5S2S3S40.50.50.30.70.10.90.80.2 transition from state to state at discrete time intervals can only be in 1 state at any given time1.0http://www.cslu.ogi.edu/people/hosom/cs552/
17. Hidden Markov ModelHidden Markov Models represent each unit of speech in the Acoustic ModelHMMs are used by a scorer to calculate the acoustic probability for a particular unit of speechA Typical HMM uses 3 states to model a single context dependent phonemeEach state of an HMM is represented by a set of Gaussian mixture density functions
18. Gaussian MixturesGaussian mixture density functions represent each state in an HMMEach set of Gaussian mixtures is called a senoneHMMs can share senones
19. The Decoder
20. Decoder - heart of therecognizerSelects next set of likely statesScores incoming features against these statesPrunes low scoring statesGenerates results
21. Decoder Overview
22. Selecting next set of statesUses Grammar to select next set of possible wordsUses dictionary to collect pronunciations for wordsUses Acoustic Model to collect HMMs for each pronunciationUses transition probabilities in HMMs to select next set of states
23. Sentence HMMS
24. Sentence HMMS
25. Search Strategies: Tree SearchCan reduce number of connections by taking advantage ofredundancy in beginnings of words (with large enough vocabulary):t (washed)n (dan)z (dishes)er (after)er (dinner)r (car)s (alice)ch (lunch)aeddhklwflaeihax (the)aaahaatihshihnnshhttp://www.cslu.ogi.edu/people/hosom/cs552/
26. Questions??
27. ReferencesCMU Sphinx Project, http://cmusphinx.sourceforge.netVox Forge Project, http://www.voxforge.org