/
Unlocking Audio/Video Content with Speech Recognition Unlocking Audio/Video Content with Speech Recognition

Unlocking Audio/Video Content with Speech Recognition - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
368 views
Uploaded On 2018-03-22

Unlocking Audio/Video Content with Speech Recognition - PPT Presentation

Behrooz Chitsaz Director IP Strategy Microsoft Research behroozcmicrosoftcom Frank Seide Lead Researcher Microsoft Research fseidemicrosoftcom Kit Thambiratnam Researcher Microsoft Research ID: 660356

speech microsoft search recognition microsoft speech recognition search research language model azure phonemes models acoustic grammar phoneme server indexing

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Unlocking Audio/Video Content with Speec..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Unlocking Audio/Video Content with Speech Recognition

Behrooz ChitsazDirector, IP StrategyMicrosoft Researchbehroozc@microsoft.com

Frank SeideLead ResearcherMicrosoft Researchfseide@microsoft.com

Kit Thambiratnam

Researcher

Microsoft Research

kit@microsoft.comSlide2

Division

established in 1991

900+ Researchers in 2010

50+ areas of computing

Open Research cultureImpact on most Microsoft products

Microsoft ResearchSlide3

Multimedia Research

Speech Search

Video summarization

Semantic extraction

Face identification

Object recognition

Visual search

3D ModelingSlide4

Speech Applications

Speech as interface

Speech as 1st

class content

Mobile accessSearch

Automation

PC application

Web service

Text input

Dictation

Indexing

Search

Metadata extraction

Advertising

Transcription

Meeting notes

Closed caption

Voicemail

Translation

Translating phoneSlide5

meta-data

surrounding & anchor text, URLtop-N lists, collaborative filteringeditorial meta-datafile content itselfkeyword search in audio track using speech recognition

Searching Media TodaySlide6

DemoSlide7

Spectral Analysis

Matching (Decoding)

time alignment

most likely hypothesisW’=argmax(w1..wN)p(ot..oT|w1..wN) P(w1..wN

)Acoustic Models

p(

ot..

ot

|phoneme)

Dictionary

P

(phonemes|

w)

Grammar (Language Model)

P

(w1..w

N

)

Hello World”

o

1

..

o

T

(

w

1

..

w

N

)^

Speech recognitionSlide8

speech recognition in a nutshell

Acoustic Models

p

(

ot..ot|phoneme)DictionaryP(phonemes|w)Grammar (Language Model)P(w1..wN)Speech recordings+ full manual transcriptsSpeech recognitionSlide9

Acoustic Models

p

(

o

t..ot|phoneme)DictionaryP(phonemes|w)Grammar (Language Model)P(w1..wN)...microscope m:s ay:n k:n r:n ax:n s:n k:n ow:n p:emicrosecond m:s ay:n k:n r:n ax:n s:n eh:n k:n ax:n n:n d:emicrosecond m:s

ay:n k:n r:n ow:n s:n eh:n k:n ax:n n:n d:emicrosoft m:s ay:n k:n r:n ax:n s:n ao:n f:n t:emicrosoft m:s

ay:n k:n r:n ow:n s:n ao:n f:n t:e

Speech recognitionSlide10

Acoustic Models

p

(

o

t..ot|phoneme)DictionaryP(phonemes|w)Grammar (Language Model)P(w1..wN)...-0.8790 this is a-2.3045 this is about

-3.1858 this is absolutely-5.2820 this is accomplished-

1.9542 this is actually

...-5.8492 is a barnyard

-5.1004 is a barometer-4.2270 is a baseball

-5.4292 is a baseless-4.4304 is a baseline

Speech recognitionSlide11

Challenges

Speaker accentBackground noiseReverberationVocabularyLanguageSlide12

lattice-based indexing

“into this bank account”Slide13

lattice-based indexing

“into this bank account”

expected benefits from indexing lattices:

alternative recognition candidates  recall++confidence scores  precision++(time information  user experience)Slide14

Speech

Word statistics

Metadata

NP extraction

Web query builderRecognizerBing Search

Docs

Queries

Docs

Base

Dict

Base

LM

Adapt Dictionary

Adapt Language Model

AdaptedDict

AdaptedLM

Vocabulary Adaptation

from NLC

groupSlide15

Architectural decisions

High quality Speech Recognition is compute intensive

Use Azure for indexingMedia content could be anywherePowerShell tools to upload contentCustomer should be able to own search experienceEasy integration with text search infrastructureIntegrate with SQL Server/Sharepoint

/FASTMust support click to play

Silverlight supports accurate seekingSlide16

Microsoft Azure

SQL Server(s)

1. Submit audio/video to index

2. Get back AIB

3. Import AIB in SQL

Web server(s)

Media server(s)

4. Search/Retrieve results

video RSS feed

Azure integrationSlide17

Cloud computing made simple

Windows Azure + Power shell=

Cloud computing at your fingertipsDemo media content submissionSlide18

Microsoft Research

Tell us if you are interestedmmms@microsoft.comVisit us:

http://research.microsoft.com/mavishttp://research.microsoft.comhttp://twitter.com/MSFTResearchhttp://www.facebook.com/microsoftresearch#

http://www.flickr.com/photos/msr_redmond/Slide19

Thank you!

Questions?Slide20
Slide21

© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,

it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.

MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.