/
Interactively Skimmingin partial fulfillment of the requirements for t Interactively Skimmingin partial fulfillment of the requirements for t

Interactively Skimmingin partial fulfillment of the requirements for t - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
408 views
Uploaded On 2016-04-29

Interactively Skimmingin partial fulfillment of the requirements for t - PPT Presentation

Certified by 2 ID: 298826

Certified by:

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Interactively Skimmingin partial fulfill..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Interactively Skimmingin partial fulfillment of the requirements for the degree ofat the Certified by: 2 © 1994 Massachusetts Institute of Technology. Thesis Supervisor Christopher M. Schmandt 4 5Doctoral Dissertation Committee Thesis Reader: Nathaniel I. Durlach 6 7 Chris Schmandt and I have been building interactive speech systemsme to think about how people operate and how we can get machines tohumans, machines, and interfaces should work in the future.Eric Hulteen, from Apple Computer, critiqued many of my early ideasand helped shape their final form.Lisa Stifelman provided valuable input in user interface design, assistedMeg Withgott lent her help and expertise on pitch, emphasis, andneeded it most) beyond the call of duty. I hope to repay the favorSteve Benton provided behind the scenes backing and a helping hand (or 8Acknowledgments Gayle Sherman provided advice, signatures, and assistance in managingDoug Reynolds ran his speaker identification software on my recordings.Don Hejna provided the SOLAFS implementation. Jeff Hermanand William Stasior for their assistance, advice, and for letting me useJonathan Cohen and Gitta Salomon provided important help along theway. Walter Bender, Janet Cahn, George Furnas, Andrew Kass, PaulWhen I completed my Master of Science degree in 1984 the idea ofgoing on to get a doctorate was completely out of the question, so I 9 Doctoral Dissertation Committee51Motivation and Related Work151.1Defining the Title161.2Introduction171.2.1The Problem: Current User Scenarios171.2.1.1Searching for Audio in Video181.2.1.2Lecture Retrieval181.2.2Speech Is Important191.2.3Speech Storage201.2.4A Non-Visual Interface211.2.5Dissertation Goal221.3Skimming this Document221.4Related Work231.4.1Segmentation241.4.2Speech Skimming and Gisting251.4.3Speech and Auditory Interfaces281.5A Taxonomy of Recorded Speech291.6Input (Information Gathering) Techniques321.6.1Explicit321.6.2Conversational331.6.3Implicit331.6.3.1Audio and Stroke Synchronization351.7Output (Presentation) Techniques351.7.1Interaction361.7.2Audio Presentation361.8Summary372Hyperspeech: An Experiment in Explicit Structure392.1Introduction392.1.1Application Areas402.1.2Related Work: Speech and Hypermedia Systems402.2System Description412.2.1The Database412.2.2The Links442.2.3Hardware Platform452.2.4Software Architecture462.3User Interface Design462.3.1Version 1472.3.2Version 2472.4Lessons Learned on Skimming and Navigating502.4.1Correlating Text with Recordings52 10Contents 2.4.2Automated Approaches to Authoring522.5Thoughts on Future Enhancements532.5.1Command Extensions532.5.2Audio Effects542.6Summary553Time Compression of Speech573.1Introduction573.1.1Time Compression Considerations583.1.2A Note on Compression Figures593.2General Time compression Techniques593.2.1Speaking Rapidly593.2.2Speed Changing603.2.3Speech Synthesis603.2.4Vocoding603.2.5Pause Removal603.3Time Domain Techniques613.3.1Sampling613.3.2Sampling with Dichotic Presentation623.3.3Selective Sampling633.3.4Synchronized Overlap Add Method643.4Frequency Domain Techniques653.4.1Harmonic Compression653.4.2Phase Vocoding663.5Tools for Exploring the Sampling Technique663.6Combined Time Compression Techniques673.6.1Pause Removal and Sampling673.6.2Silence Removal and SOLA683.6.3Dichotic SOLA Presentation683.7Perception of Time-Compressed Speech693.7.1Intelligibility versus Comprehension693.7.2Limits of Compression693.7.3Training Effects713.7.4The Importance of Pauses723.8Summary744Adaptive Speech Detection754.1Introduction754.2Basic Techniques764.3Pause Detection for Recording784.3.1Speech Group Empirical Approach: Schmandt794.3.2Improved Speech Group Algorithm: Arons804.3.3Fast Energy Calculations: Maxemchuk824.3.4Adding More Speech Metrics: Gan824.4End-point Detection834.4.1Early End-pointing: Rabiner834.4.2A Statistical Approach: deÊSouza834.4.3Smoothed Histograms: Lamel et al.844.4.4Signal Difference Histograms: Hess874.4.5Conversational Speech Production Rules: Lynch et al.894.5Speech Interpolation Systems894.5.1Short-term Energy Variations: Yatsuzuka914.5.2Use of Speech Envelope: Drago et al.914.5.3Fast Trigger and Gaussian Noise: Jankowski914.6Adapting to the UserÕs Speaking Style924.7Summary935SpeechSkimmer5.1Introduction95 Contents11 5.2Time compression and Skimming975.3Skimming Levels985.3.1Skimming Backward1005.4Jumping1015.5Interaction Mappings1025.6Interaction Devices1035.7Touchpad Configuration1055.8Non-Speech Audio Feedback1065.9Acoustically Based Segmentation1075.9.1Recording Issues1085.9.2Processing Issues1085.9.3Speech Detection for Segmentation1095.9.4Pause-based Segmentation1135.9.5Pitch-based Emphasis Detection for Segmentation114Usability Testing1185.10.1Method1195.10.1.1Subjects1195.10.1.2Procedure119Results and Discussion1205.10.2.1Background Interviews1215.10.2.2First Intuitions1215.10.2.3Warm-up Task121Skimming1225.10.2.5No Pause122Jumping123Backward1235.10.2.8Time Compression124Buttons1245.10.2.10Non-Speech Feedback1255.10.2.11Search Strategies1255.10.2.12Follow-up Questions1265.10.2.13Desired Functionality126Thoughts for Redesign127Comments on Usability Testing129Software Architecture129Use of SpeechSkimmer with BBC Radio Recordings1305.13Summary1316Conclusion6.1Evaluation of the Segmentation1336.2Future Research1356.3Evaluation of Segmentation Techniques1356.3.1Combining SpeechSkimmer with a Graphical Interface1366.3.2Segmentation by Speaker Identification1376.3.3Segmentation by Word Spotting1376.4Summary138 Fig. 1-1.A consumer answering machine with time compression.27Fig. 1-2.A close-up view of the digital message shuttle.27Fig. 1-3.A view of the categories in the speech taxonomy.31Fig. 2-1.The ÒfootmouseÓ built and used for workstation-based transcription.43Fig. 2-2.Side view of the Òfootmouse.Ó43Fig. 2-3.A graphical representation of the nodes in the database.44Fig. 2-4.Graphical representation of all links in the database (version 2).45Fig. 2-5.Hyperspeech hardware configuration.46Fig. 2-6.Command vocabulary of the Hyperspeech system.48Fig. 2-7.A sample Hyperspeech dialog.49Fig. 2-8.An interactive repair.50Fig. 3-1.Sampling terminology.61Fig. 3-2.Sampling techniques.62Fig. 3-3.Synchronized overlap add (SOLA) method.65Fig. 3-4.Parameters used in the sampling tool.67Fig. 4-1.Time-domain speech metrics for frames N samples long.77Fig. 4-2.Threshold used in Schmandt algorithm.79Fig. 4-3.Threshold values for a typical recording.81Fig. 4-4.Recording with speech during initial frames.81Fig. 4-5.Energy histograms with 10 ms frames.85Fig. 4-6.Energy histograms with 100 ms frames.86Fig. 4-7.Energy histograms of speech from four different talkers.87Fig. 4-8.Signal and differenced signal magnitude histograms.88Fig. 4-9.Hangover and fill-in.90Fig. 5-1.Block diagram of the interaction cycle of the speech skimming system.97Fig. 5-2.Ranges and techniques of time compression and skimming.97Fig. 5-3.Schematic representation of time compression and skimming ranges.98Fig. 5-4.The hierarchical Òfish earÓ time-scale continuum.99Fig. 5-5.Speech and silence segments played at each skimming level.99Fig. 5-6.Schematic representation of two-dimensional control regions.102Fig. 5-7.Schematic representation of one-dimensional control regions.103Fig. 5-8.Photograph of the thumb-operated trackball tested with SpeechSkimmer.104Fig. 5-9.Photograph of the joystick tested with SpeechSkimmer.104Fig. 5-10.The touchpad with paper guides for tactile feedback.105Fig. 5-11.Template used in the touchpad.106Fig. 5-12.Mapping of the touchpad control to the time compression range.106Fig. 5-13.A 3-D plot of average magnitude and zero crossing rate histogram.110Fig. 5-14.Average magnitude histogram showing a bimodal distribution.111Fig. 5-15.Histogram of noisy recording.111Fig. 5-16.Sample speech detector output.113Fig. 5-17.Sample segmentation output.114Fig. 5-18.F0 plot of a monologue from a male talker.116Fig. 5-19.Close-up of F0 plot in figure 5-18.117Fig. 5-20.Pitch histogram for 40 seconds of a monologue from a male talker.117Fig. 5-21.Comparison of three F0 metrics.118Fig. 5-22.Counterbalancing of experimental conditions.120Fig. 5-23.Evolution of SpeechSkimmer templates.127Fig. 5-24.Sketch of a revised skimming template.128Fig. 5-25.A jog and shuttle input device.129Fig. 5-26.Software architecture of the skimming system.130 1Motivation and Related Work ¥The importance of time when listening to and navigating in¥Techniques to provide multi-level structural representations of the¥User interfaces for accessing speech recordings. 16Chapter 1 1.1Defining the Title on speech-message information retrieval is Interactively Skimming Recorded Speech,format of the storage are not important, although a random access digitalSkimming means Òto remove the best É contents fromÓ and Òto passlightly or hastily: glide or skip along, above, or near a surfaceÓ (WebstersearchingScanning is similar to skimming in manyof speech are selected and played through the mutual actions of the Motivation and Related Work17 1.2Introduction our needs and rhythms.É Our ear, and with itour whole imaginative apparatus, marches in(Birkerts 1993, 111) 1.2.1The Problem: Current User Scenarios¥lectures and interviews on microcassette¥voice mail¥tutorial and motivational material on audio cassette¥conference proceedings on tape¥books on tape¥time-shifted, or personalized, radio and television programs¥story segments for radio shows¥using the audio track for editing a video tape story line¥information gathering by law enforcement agencies 18Chapter 1 1.2.1.1Searching for Audio in Video it is convenient to replay it in meaningful units.In a medium such as videotape, this can be 1.2.1.2Lecture Retrieval Motivation and Related Work19 1.2.2Speech Is Important 20Chapter 1 1.2.3Speech Storage Motivation and Related Work21 1.2.4A Non-Visual Interface Speech is fundamental for human communication, yet this medium isdifficult to skim, browse, and navigate because of the transient nature offundamental problem.This research therefore concentrates on that do not use a display or a keyboard, but take advantage ofthe audio channel. A graphical user interface may make some speech A highly trained specialist can slowly ÒreadÓ spectrograms; however, this approach is 22Chapter 1 1.2.5Dissertation Goal1.3Skimming this Document uttered, special steps have to be taken to refer toit again, unlike visually presented information ¥¥¥Casual readers will find chapter 1, particularly section 1.2 most However, they were developed where needed. Motivation and Related Work23 ¥¥Chapter 2 describes Hyperspeech, a preliminary investigation of¥Chapter 3 reviews methods to time compress speech, including¥Chapter 4 reviews techniques of finding speech versus¥¥¥Chapter 5 describes SpeechSkimmer, a user interface for¥¥Chapter 6 discusses the contributions of the research, and areas1.4Related Work 24Chapter 1 1.4.1Segmentation Motivation and Related Work25 1.4.2Speech Skimming and Gisting or ÒgistÓ of incoming messages. ¥Text descriptors can be associated with points in a speech¥While playing back a speech message it is possible to jump¥Finally, the playback rate can be increased. When the highest 26Chapter 1 X or LX formatting Motivation and Related Work27 sentence (assumed to be the ÒtopicÓ sentence) would be synthesized.Consumer products have begun to appear with rudimentary speechskimming features. Figures 1-1 and 1-2 show a telephone answering Fig. 1-1.A consumer answering machine with time compression. Fig. 1-2.A close-up view of the digital message shuttle. 28Chapter 1 1.4.3Speech and Auditory Interfaces Motivation and Related Work29 1.5A Taxonomy of Recorded Speech 30Chapter 1 ¥least structured¥user present¥unknown number of talkers¥variable noise¥all of the remaining items in this list¥more interactive than a lecture¥more talkers than a lecture¥may be a written agenda¥user may have been present or participated¥user may have taken notes¥typically a monologue¥organization may be more structured than a meeting¥may be a written outline or lecture notes¥user may have been present¥user may have taken notes Motivation and Related Work31 phone call voice message voice notes self-authoredother personother peopleFig. 1-3.A view of the categories in the speech taxonomy.¥single talker¥user authored¥can be explicitly categorized by user (considered as long voice¥different media than face-to-face communication¥well defined beginning and ending¥typically two talkers¥well documented speaking and hesitation characteristics (Brady¥user participation in conversation¥can differentiate caller from callee (Hindus 1993)¥consistent audio quality within calls¥single talker per message¥typically short¥typically contain similar types of information¥user not present¥consistent audio quality within messages 32Chapter 1 ¥single talker¥typically short notes¥authored by user¥consistent audio quality1.6Input (Information Gathering) Techniques sketching and writing, conventional keywordsearches of the meeting are not possible. The 1.6.1Explicit Motivation and Related Work33 1.6.2Conversational1.6.3Implicit 34Chapter 1 Direction sensing microphones.These techniques appear quite powerful; however, there are many timeswhere it is useful to retrieve information from a recording created in a Motivation and Related Work35 1.6.3.1Audio and Stroke Synchronization1.7Output (Presentation) Techniques 36Chapter 1 1.7.1Interaction you can tell whether the sound is a news anchorperson, a talk show, or music. What is really 1.7.2Audio Presentationnavigational information in a speech-only interface: (1) the use of non- 7 Motivation and Related Work37 1.8Summary 2Hyperspeech: An Experiment in 2.1Introduction 8 40Chapter 2 2.1.1Application Areas2.1.2Related Work: Speech and Hypermedia Systems Hyperspeech41 2.2System Description2.2.1The Database The interviewees and their affiliations at the time were: Cecil Bloch (Somosomo 42Chapter 2 Note that videotaping similar interviews for a video hypermedia system11 The interviews were deliberately kept short; the total time foreach automated interview was roughly five minutes.using a conventional text editor while simultaneously controlling audiomajor themes (summary nodes) with supporting comments (detail intended only for the author of the database, can bias a user of the system, forcing a Hyperspeech43 Fig. 2-1.The ÒfootmouseÓ built and used for workstation-based transcription. Fig. 2-2.Side view of the Òfootmouse.Ó Note the small screws used to depress 13forward word command (Meta-F) moved forward a small amount (250 ms), and the 44Chapter 2 Fig. 2-3.A graphical representation of the nodes in the database. Detail nodes2.2.2The Links In the remainder of the chapter, references to nodes and links do not include responsesThere are roughly equal numbers of summary nodes and detail nodes. Hyperspeech45 the brain and send information into the brainÓ are opposed to BlochÕsrelated view that ends Òand that, frankly, makes my blood run cold.Ónew link types were added (there are over 750 links in the current Fig. 2-4.Graphical representation of all links in the database (version 2). Note2.2.3Hardware Platform A more appropriate authoring tool would provide a better layout of the links and visual 46Chapter 2 text-to-speechsynthesizer SparcStation computer control Fig. 2-5.Hyperspeech hardware configuration.2.2.4Software Architecturewithout having to modify the underlying software system. This data-2.3User Interface Design 17Excluding extensive library routines and drivers that control the speech I/O devices. Hyperspeech47 2.3.1Version 12.3.2Version 2automatically jumps between the dozen or so nodes that provide a high- 48Chapter 2 Link typeCommandDescription UtilitiesCommandDescription help Fig. 2-6.Command vocabulary of the Hyperspeech system. Hyperspeech49 TalkerUtteranceCommentsMinskyWhat I think will happen over the next direct coupling between the nervous BlochIn WeitzmanI would hope that we never do achieve WeitzmanWeÕll always be able to improve on it, Interrupt to get SynthesizerThis is Louie Weitzman on the future of WeitzmanÉ we are going to learn new things and VertelneyI think itÕs like back in the Not of interest. WeitzmanWeÕll always be able toÉWeitzman again. WhatÕs MinskyÕs MinskyAnd when it becomes smart enough we Fig. 2-7.A sample Hyperspeech dialog. 50Chapter 2 TalkerUtteranceDescription of action SynthesizerÒsupportingÓÒThe interfaÉÓIncorrect sound is startedÒWeitzmanÓEcho of correctly recognized wordWeitzmanÒI hope we never doFig. 2-8.An interactive repair.2.4Lessons Learned on Skimming and Navigating 19Substitution error: a word was spoken, but a different word was recognized.21ambiguity, and be brief. Hyperspeech51 The design of this system is based on allowing the user to actively driveanother (e.g., recognition to playback) must be designed for low systemHyperspeech-like systems have the added complication of the serial andOne solution to managing speech recordings is to use traditional text (orhypertext) tools to manipulate transcriptions. Unfortunately, the 52Chapter 2 2.4.1Correlating Text with Recordings2.4.2Automated Approaches to Authoring Hyperspeech53 2.5Thoughts on Future Enhancements2.5.1Command Extensions 54Chapter 2 2.5.2Audio Effects 22with links. This may not present a problem if such nodes are sparsely distributedproduct idea Hyperspeech55 2.6Summary 3Time Compression of Speech experienced speaker can talk.(Smith 1970, 219) 3.1Introduction 58Chapter 3 3.1.1Time Compression Considerations¥The type of speech material to be compressed: content, language,¥The process of compression: algorithm, monophonic or¥The listener: prior training, intelligence, listening task, etc. Time Compression59 ¥Is the material familiar or self-authored, or is it unfamiliar to the¥Does the recorded material consist of many short items, or large¥Is the user quickly browsing or listening for maximum3.1.2A Note on Compression Figures3.2General Time compression Techniques3.2.1Speaking Rapidlyrelative attributes of their speech such as pause durations, consonant- 27clocked speaking at a rate of 586 wpm. Mr. Moschitta is best known for his roles as the 60Chapter 3 3.2.2Speed Changing3.2.3Speech Synthesis3.2.4Vocoding3.2.5Pause Removal Time Compression61 3.3Time Domain Techniques3.3.1Sampling sId Fig. 3-1.Sampling terminology (after Fairbanks 1957). 62Chapter 3 B) Interrupted signalA) Original signal 1 C) Sampling method 1 Right earFig. 3-2.(A) is the original signal; the numbered regions represent short (e.g.,3.3.2Sampling with Dichotic Presentation acoustic world; only one voice per speaker. Sampling with dichotic diotic Time Compression63 3.3.3Selective Sampling 64Chapter 3 does not require pitch marking, it is more robust than pitch-3.3.4Synchronized Overlap Add Methodand Wilgus (Roucos 1985) has recently become popular in computer-Combining the frames in this manner tends to preserve the time-simple and effective as it does not require pitch extraction, frequency- Time Compression65 B)D) Fig. 3-3.SOLA: shifting the two speech segments (as in figure 3-2) to find the3.4Frequency Domain Techniques3.4.1Harmonic Compression 66Chapter 3 3.4.2Phase Vocoding3.5Tools for Exploring the Sampling Technique Time Compression67 chunk size Is A)B) Fig. 3-4.Parameters used in the sampling tool. In (A) the sampling interval, I3.6Combined Time Compression Techniques3.6.1Pause Removal and Sampling 68Chapter 3 3.6.2Silence Removal and SOLA3.6.3Dichotic SOLA Presentation Time Compression69 3.7Perception of Time-Compressed Speech3.7.1Intelligibility versus Comprehension3.7.2Limits of Compression 70Chapter 3 speech, see Beasley 1976).significantly better than 25% compression presented dichotically, even275 wpm, but more rapidly beyond that point (Foulke 1969). The declineNote that in much of the literature the limiting factor that is often cited isFoulke and Sticht permitted sighted college students to select a preferreddegree of time compression for speech spoken at an original rate of 175 Time Compression71 redundant information. With greater than 50% compression, critical non-3.7.3Training EffectsEven naive listeners can tolerate compressions of up to 50%, and with 8Ð 72Chapter 3 3.7.4The Importance of Pauses 1838Ð42 recordings at any point, and were instructed to repeat what they hadfactors to a theoretical maximum segment size which coulda considerable effect on the speed and accuracy with which sentences Time Compression73 330Êms pause was inserted ungrammatically, response time for aMaxemchuk found that eliminating hesitation intervals decreasedor Òspeech unit.Ó In one study of spontaneous speech, the mean speechTheories about memory suggest that large-capacity rapid-decay sensorystorage is followed by limited capacity perceptual memory. Studies have 74Chapter 3 3.8Summary 4Adaptive Speech Detection 4.1Introduction 76Chapter 4 4.2Basic Techniques 33operated switchÓ and abbreviated VOX. Speech Detection77 ¥energy or magnitude¥zero crossing rate (ZCR)¥one sample delay autocorrelation coefficient¥the first LPC predictor coefficient¥LPC prediction error energy i=1Nåenergy=x[i]()2i=1NåZCR=sgnx[i]()-sgnx[i-) where sgnx[i]1if if i]³0-1otherwiseFig. 4-1.Time-domain speech metrics for frames N samples long. A high zero-crossing rate indicates low energy fricative sounds such as ÒsÓ and Òf.Ó Forexample, a ZCR greater than 2500 crossings/s indicates the presence of a fricative 78Chapter 4 4.3Pause Detection for Recording Speech Detection79 4.3.1Speech Group Empirical Approach: Schmandt 5 10 15 20 25 0 2 4 6 8 10 12 14 16 Speech Threshold Silence Threshold slope=2 slope=1.5 Fig. 4-2.Threshold used in Schmandt algorithm.¥The mapping function between the noise threshold and speech 80Chapter 4 ¥The algorithm assumes semi-stationary background noise, and¥Since the noise threshold is determined on-the-fly, the algorithm4.3.2Improved Speech Group Algorithm: Arons¥On a Sun SparcStation, RMS energy is calculated in real time¥On Apple Macintoshes, an average magnitude measure is used. Speech Detection81 25 30 35 40 0 20 40 60 80 100 Energy (dB) Frames (100 ms) SN noise threshold speech data speech threshold Fig. 4-3.Threshold values for a typical recording. Note that the threshold falls by 25 30 35 40 0 5 10 15 20 25 30 3 5 Energy (dB) Frames (100 ms) ``drop'' noise threshold speech data Fig. 4-4.Recording with speech during initial frames. A ÒdropÓ in the noise 82Chapter 4 4.3.3Fast Energy Calculations: Maxemchuk4.3.4Adding More Speech Metrics: Gan¥the zero crossing threshold between speech and silence¥the minimum continuous amount of time needed for a segment to¥the amplitude threshold for determining a silence-to-speech¥the amplitude threshold for determining a speech-to-silence Speech Detection83 4.4End-point Detection4.4.1Early End-pointing: Rabiner4.4.2A Statistical Approach: deÊSouza 84Chapter 4 4.4.3Smoothed Histograms: Lamel et al. measure to a normal distribution. Speech Detection85 2 4 6 8 10 12 14 0 5 10 15 20 25 30 35 40 Percent of frames Energy (dB) 10 ms frame, sound=s1 Schmandt threshold 0 2 4 6 8 10 12 14 0 5 10 15 20 25 30 35 40 Percent of frames Energy (dB) 10 ms frame, 3 point average, sound=s1 Lamel threshold Fig. 4-5.Energy histograms with 10 ms frames. Bottom graph has been Other algorithms described in this chapter are better at adapting to faster changes in 86Chapter 4 2 4 6 8 10 12 14 0 5 10 15 20 25 30 35 40 Percent of frames Energy (dB) 100 ms frame, sound=s1 0 2 4 6 8 10 12 14 0 5 10 15 20 25 30 35 40 Percent of frames Energy (dB) 100 ms frame, 3 point average, sound=s1 Fig. 4-6.Energy histograms with 100 ms frames. Bottom graph has been Speech Detection87 5 10 15 20 25 0 5 10 15 20 25 30 35 Percent of frames Energy (dB) 10 ms frame, sound=s2 5 10 15 20 25 0 5 10 15 20 25 30 35 Percent of frames Energy (dB) 10 ms frame, sound=s3 0 5 10 15 20 25 0 5 10 15 20 25 30 35 Percent of frames Energy (dB) 10 ms frame, sound=s4 5 10 15 20 25 0 5 10 15 20 25 30 35 Percent of frames Energy (dB) 10 ms frame, sound=s5 Fig. 4-7.Energy histograms of speech from four different talkers recorded overdetection, and develops an improved hybrid approach that combines end-bottom-up algorithm works well in stationary noise with high signal-to-4.4.4Signal Difference Histograms: Hess 38This section is included in the end-pointing portion of this chapter because this earlierpaper ties in closely with the histograms used by Lamel et al. (section 4.4.3). 88Chapter 4 histogram was also made of the magnitude of the differenced signal:differenced magnitude=x[i]-x[i- =1Nå 2 4 6 8 10 12 14 16 18 0 5 10 15 20 25 30 Percent of frames Magnitude (dB) 10 ms frame, sound=s4 weak fricatives nasals vowels Speech Ts 0 2 4 6 8 10 12 14 16 18 0 5 10 15 20 25 30 Percent of frames Magnitude (dB) differenced signal, 10 ms frame, sound=s4 nasals fricatives vowels Speech Td Fig. 4-8.Signal and differenced signal magnitude histograms. Note that the Speech Detection89 4.4.5Conversational Speech Production Rules: Lynch et al.4.5Speech Interpolation Systems 39 90Chapter 4 Talkspurts Talkspurtswith fill-in Talkspurtsdelayed by fill-in Talkspurtswith hangover Fig. 4-9.Hangover and fill-in (after Gruber 1983). The freeze-out fraction is typically designed to be less than 0.5 percent.41Speech activity is the ratio of talkspurt time to total time. Speech Detection91 4.5.1Short-term Energy Variations: Yatsuzuka4.5.2Use of Speech Envelope: Drago et al.4.5.3Fast Trigger and Gaussian Noise: Jankowski1.the noise threshold;2.the speech threshold (7 quantizing steps above the noise level);3.a threshold that disables noise adaptation during speech. 92Chapter 4 4.6Adapting to the UserÕs Speaking Style Speech Detection93 4.7Summary 5SpeechSkimmer He would have no opportunity to re-listen, to add redundancy by repetition, as he can by re- 5.1Introduction 96Chapter 5 levels of detail. User interaction through a manual input device providesSpeech also contains redundant information; high-level syntactic andspeech into semantically meaningful chunks. The recordings are thenWhen searching for information visually, we tend to refine our searchover time, looking successively at more detail. For example, we mayto control the auditory presentation through a simple interactioninformation in speech interfaces by combining information from multiple SpeechSkimmer97 presenting short segments of speech under the userÕs control. Note that InteractionPresentationUserInputProcessingTime compression Fig. 5-1.Block diagram of the interaction cycle of the speech skimming system.5.2Time compression and Skimming1.Normal2.Time-compressed3.SkimmingFig. 5-2.Ranges and techniques of time compression and skimming. 98Chapter 5 is to present all the non-redundant speech information in the signal. The 123456810 Fig. 5-3.Schematic representation of time compression and skimming ranges.5.3Skimming Levels SpeechSkimmer99 Fig. 5-4.The hierarchical Òfish earÓ time-scale continuum. Each level in the abcdefghiabcdefghicdhiLevel 1 UnprocessedLevel 2 Pause shorteningLevel 3 Pause-based skimming Level 4 Pitch-based skimming Fig. 5-5.Speech and silence segments played at each skimming level. The gray 100Chapter 5 5.3.1Skimming Backward BeatlesÕ Abbey Road album backward. Note that Òpitch rangeÓ is often used to mean the range above the talkerÕs baseline pitch(i.e., the talkerÕs lowest F0 for all speech).47This is analogous to taking Òthis is a testÓ and presenting it as Òtset a is siht.Ó SpeechSkimmer101 5.4Jumping 48 102Chapter 5 5.5Interaction Mappings level 1level 2level 3 Time compression Fig. 5-6.Schematic representation of two-dimensional control regions. Vertical SpeechSkimmer103 Fig. 5-7.Schematic representation of one-dimensional control regions.5.6Interaction Devices 104Chapter 5 Fig. 5-8.Photograph of the thumb-operated trackball tested with Fig. 5-9.Photograph of the joystick tested with SpeechSkimmer. SpeechSkimmer105 5.7Touchpad Configuration Fig. 5-10.The touchpad with paper guides for tactile feedback. 49 106Chapter 5 skimendbeginskimno pause normalnormal pause normal jump jump Fig. 5-11.Template used in the touchpad. The dashed lines indicate the location 2.40.6slow Fig. 5-12.Mapping of the touchpad control to the time compression range.5.8Non-Speech Audio Feedback is active at a time. SpeechSkimmer107 5.9Acoustically Based Segmentation The amount of feedback is user configurable. 108Chapter 5 5.9.1Recording Issues5.9.2Processing Issues 53 SpeechSkimmer109 5.9.3Speech Detection for Segmentation 110Chapter 5 10 20 30 40 50 AvgMag (dB) 0 20 40 60 80 ZCR x100 0 0.2 0.4 0.6 0.8 1 % frames 10 20 30 40 50 AvgMag (dB) 0 20 40 60 80 ZCR x 0 0 .2 0 .4 . 6 8 1 Fig. 5-13.A 3-D plot of average magnitude and zero crossing rate histogram. SpeechSkimmer111 05101520Fig. 5-14.Average magnitude histogram showing a bimodal distribution. The 05101520Fig. 5-15.Histogram of noisy recording. Note that there are no zero or low- 112Chapter 5 (ÊÊ)(-been successfully segmented into speech and background noise. This pre- SpeechSkimmer113 4579476018104760500024015000537137105371557120015571612255106122677465216774753576107535771618117716780690078069509170319509973022109730990017019900101612610101611039123011039110541150010541114238821114231153411101153412245711112245123951500Fig. 5-16.Sample speech detector output. Columns are: start time, stop time,5.9.4Pause-based Segmentation 54The duration field is not necessary, but has been found useful for visual debugging. 114Chapter 5 47605370115371557020537161211056226121206122677320612275341070357534207535771520 753577153175357805107806950821780695083078069729119730989920973098993097301016010101611039020101611039030101611054010105411142220105411142230105411153310115341224420115341224430115341239410Fig. 5-17.Sample segmentation output. Columns are: start time, stop time,5.9.5Pitch-based Emphasis Detection for Segmentationcomprehension and understanding but can also be exploited for machine- denoted as F0. The terms Òpitch,Ó Òfundamental frequency,Ó and ÒF0Ó are used SpeechSkimmer115 Chen and Withgott (Chen 1992) trained a Hidden Markov Model (HMM,While performing some exploratory data analysis on ways to improve onthis HMM-based approach, it became clear that the fundamentalFor example, it is well known in the speech and linguistics communitiesthat there are changes in pitch under different speaking conditionsThese are general rules of thumb, however, automatically finding theseSeveral experiments were performed by visually correlating areas ofFigure 5-18 shows the F0 for 40 seconds of a recorded monologue. Thereare several clearly identifiable areas of increased pitch activity. Figure 5- Pitch is used differently in other languages, particularly Òtone languagesÓ where pitch 116Chapter 5 track is not continuous; pitch can only be calculated for vowels andhand-marked log. The statistics were gathered over one second windowsthreshold were most highly correlated with the hand-marked data. The 010203040Fig. 5-18.F0 plot of a monologue from a male talker. Note that the area near 30 SpeechSkimmer117 particular talker. A histogram of the F0 data is used, and a threshold ischosen to select the top 1% of the pitch frames (figure 5-20). 303234Fig. 5-19.Close-up of F0 plot in figure 5-18. 50100150200250 50100150200250Fig. 5-20.Pitch histogram for 40 seconds of a monologue from a male talker. find a larger or smaller number of emphasized segments. 118Chapter 5 0.21Standard Deviation (normalized) 0.21 0.20.40.60.81 50100150Fig. 5-21.Comparison of three F0 metrics.5.10Usability Testing Committee on the Use of Humans as Experimental Subjects (application number 2132). SpeechSkimmer119 5.10.1Method5.10.1.1Subjects5.10.1.2Procedure However, if a subject said something like ÒI wish it did X,Ó and the system did performthat function, the feature was revealed to them by the interviewer through directed 120Chapter 5 # of subjectsfirst presentationsecond presentation 3pitch-basedpart 1isochronouspart 2 3isochronouspart 1pitch-basedpart 2 3isochronouspart 2pitch-basedpart 1 3pitch-basedpart 2isochronouspart 1 Fig. 5-22.Counterbalancing of experimental conditions.5.10.2Results and Discussion The recording is of Nicholas NegroponteÕs ÒPerspectives Speaker SeriesÓ talk titled SpeechSkimmer121 5.10.2.1Background Interviews5.10.2.2First Intuitions5.10.2.3Warm-up Task 122Chapter 5 ÓÒcontent or keyword searching going onÓ than in the isochronoussegmentation.A few of the subjects requested that longer segments be played (perhapsuntil the next pause), or that the length of the segments could becontrollable. One subject said ÒI felt like I was missing a lot of his mainideas, since it would start to say one, and then jump.ÓThe subjects were asked to rank the skimming performance under thedifferent segmentation conditions. A score of 7 indicates the bestpossible summary of high-level ideas, a score of 1 indicates very poorlyselected segments. The mean score for the pitch-based segmentation wasMÊ=Ê4.5 (SDÊ=Ê1.7, NÊ=Ê12); the mean score for the isochronoussegmentation was MÊ=Ê2.7 (SDÊ=Ê1.4, NÊ=Ê12). The pitch-based skimmingwas rated better than isochronous skimming with a statistical significanceof pÊÊ(t test for paired samples). No statistically significantdifference was found on how subjects rated the first versus the secondpart of the talk, or on how subjects rated the first versus second soundpresented.Most of the subjects, including the few that did not think the pitch-basedskimming gave a good summary, used the skimming level to navigatethrough the recording. When asked to find the answer to a specificquestion, most started off by saying something like ÒIÕll go the beginningand skim till I get to the right topic area in the recording,Ó or in somecases ÒI think its near the end, so IÕll jump to the end and skimbackward.Ó5.10.2.5No SpeechSkimmer123 [since you normally have to] remember when he said [the point ofobvious, but subjects who did not initially understand what the button didOne subject had a moment of inspiration while skimming along at a highspeed, and tried the button after passing the point of interest. After usingjump button and Òbackward no pauseÓ one subject noted Òoh, I see the5.10.2.7taking units of conversationÑand goes backwards.Ó Another subject said (normal, no pause, or skimming). 124Chapter 5 Ósubject suggested providing feedback to indicate when sounds werebeing played backward, to make it easily distinguishable from forwards.5.10.2.8Time Compression 65 SpeechSkimmer125 5.10.2.10Non-Speech Feedback5.10.2.11Search Strategies several temporary contact closures before settling to a quiescent state. The difficultiesduring the test. There was no table on which to place the touchpad, and subjects had to 126Chapter 5 5.10.2.12Follow-up Questions5.10.2.13Desired Functionality Two of the subjects did not answer the question. SpeechSkimmer127 5.10.3Thoughts for Redesignnormal-to-fastest range could go from white to red, suggesting a cool-to-First prototypeSecond prototype Third prototypeCurrent design skimendbeginskimno pause normalnormal pause normal jump jump Fig. 5-23.Evolution of SpeechSkimmer templates. 128Chapter 5 pausesremoved playmarks end middlebeginning faster mark*& jump jumpplay Fig. 5-24.Sketch of a revised skimming template. SpeechSkimmer129 situations where the hands are busy, such as when transcribing or taking Fig. 5-25.A jog and shuttle input device.5.10.4Comments on Usability Testing5.11Software Architecture 69Think C 5.0 provides the object-oriented features of C++, but does not include other 130Chapter 5 Time compression Main event loopInput mapping User input (e.g., touch pad, joystick) Segment player Fig. 5-26.Software architecture of the skimming system.5.12Use of SpeechSkimmer with BBC Radio Recordings SpeechSkimmer131 5.13Summary 6Conclusion 6.1Evaluation of the Segmentation 134Chapter 6 were then segmented using the technique described in section 5.9.5. The71The four highest scoring segments (i.e., the most pitch activity above theOK, the second thing I wanted to mention was, themultimedia toolkit. And currently this pretty muchvery interested in and we havenÕt worked on a lot, is the idea 71Note that many of the selected segments of this recording also contain linguistic cue Conclusion135 6.2Future Research6.3Evaluation of Segmentation Techniques 136Chapter 6 6.3.1Combining SpeechSkimmer with a Graphical Interface 75Subjects were not required to perform this task in real time. Conclusion137 6.3.2Segmentation by Speaker Identification6.3.3Segmentation by Word Spotting 138Chapter 6 6.4Summary Conclusion139 This dissertation presents a framework for thinking about and designingThis research provides insight into making oneÕs ears an alternative to cmcentimeterCSCWcomputer supported cooperative workdBdecibeldichotica different signal is presented to each eardioticthe same signal is presented to both earsDSIDigital Speech InterpolationF0fundamental frequency of voicingHMMHidden Markov ModelHzfrequency in Hertz (cycles per second)isochronousrecurring at regular intervalsI/Oinput/outputkgkilogramkHzkilohertzLPClinear predictive codingmonotica signal is presented to only one earmsmilliseconds, 1/1000 of a second (e.g., 250Êms = 1/4 s)RMSroot mean squaressecondSNRsignal to noise ratioTASITime Assigned Speech Interpolationwpmwords per minuteZCRzero crossing rate (crossings per second)smicroseconds, 1/1000000 of a second Aaronson 1971D. Aaronson, N. Markowitz, and H. Shapiro. Perception and ImmediateAdaptive 1991AdaptiveÊDigitalÊSystemsÊInc. Agnello 1974J. G. Agnello. Review of the Literature on the Studies of Pauses. InArons 1989B. Arons, C. Binding, K. Lantz, and C. Schmandt. The VOX Audio IEEECommunications Society,Arons 1991aB. Arons. Hyperspeech: Navigating in Speech-Only Hypermedia. In ACM,NewArons 1991bB. Arons. Authoring and Transcription Tools for Speech-BasedAmericanVoice I/O Society, Sep. 1991, pp. 15Ð20.Arons 1992aB. Arons. Techniques, Perception, and Applications of Time-AmericanVoice I/O Society, Sep. 1992, pp. 169Ð177.Arons 1992bB. Arons. A Review of the Cocktail Party Effect. Arons 1992cB. Arons. Tools for Building Asynchronous Servers to Support Speech ACMSIGGRAPH and ACMArons 1993aB. Arons. SpeechSkimmer: Interactively Skimming Recorded Speech. In ACMSIGGRAPH and ACM SIGCHI, ACM Press,Arons 1993bB. Arons. Hyperspeech (videotape). Atal 1976B. S. Atal and L. R. Rabiner. A Pattern Recognition Approach to Voiced-Backer 1982D. S. Backer and S. Gano. Dynamically Alterable Videodisc Displays. In 144References Ballou 1987G. Ballou. Beasley 1976D. S. Beasley and J. E. Maki. Time- and Frequency-Altered Speech. Ch.Birkerts 1993S. Birkerts. Close Listening. Blattner 1989M. M. Blattner, D. A. Sumikawa, and R. M. Greenberg. Earcons andBly 1982S. Bly. Presenting Information in Sound. In ACM,New York,Brady 1965P. T. Brady. A Technique for Investigating On-Off Patterns of Speech.Brady 1968P. T. Brady. A Statistical Analysis of On-Off Patterns in 16Brady 1969P. T. Brady. A Model for Generating On-Off Speech Patterns in Two-Bregman 1990A. S. Bregman. Bush 1945V. Bush. As We May Think. Butterworth 1977B. Butterworth, R. R. Hine, and K. D. Brady. Speech and Interaction inBuxton 1991W. Buxton, B. Gaver, and S. Bly. . ACMSIGGCHI. Tutorial Notes. 1991.Campanella 1976S. J. Campanella. Digital Speech Interpolation. Card 1991S. K. Card, J. D. Mackinlay, and G. G. Robertson. A MorphologicalChalfonte 1991B. L. Chalfonte, R. S. Fish, and R. E. Kraut. Expressive Richness: A ACM,New York, 1991, pp.Chen 1992F. R. C References145 Cherry 1954E. C. Cherry and W. K. Taylor. Some Further Experiments on theCohen 1991M. Cohen and L. F. Ludwig. Multidimensional Window Management.Cohen 1993M. Cohen. Integrating Graphic and Audio Windows. Compernolle 1990D. van Compernolle, W. Ma, F. Xie, and M. van Diest. SpeechCondray 1987R. Condray. Speed Listening: Comprehension of Time-CompressedConklin 1987J. Conklin. Hypertext: an Introduction and Survey. Davenport 1991G. Davenport, T. A. Smith, and N. Pincever. Cinematic Primitives for (Jul. 1991), 67ÐDavid 1956E. E. David and H. S. McDonald. Note on Pitch-Synchronous ProcessingDavis 1993M. Davis. Media Streams: An Iconic Visual Language for Videode Souza 1983P. de Souza. A Statistical Approach to the Design of an Adaptive Self-Degen 1992L. Degen, R. Mander, and G. Salomon. Working with Audio: Integrating ACM,New York, 1992, pp. 413Ð418.Dolson 1986M. Dolson. The Phase Vocoder: A tutorial. Drago 1978P. G. Drago, A. M. Molinari, and F. C. Vagliani. Digital DynamicDuker 1974S. Duker. Summary of Research on Time-Compressed Speech. In Durlach 1992N. I. Durlach, A. Rigopulos, X. D. Pang, W. S. Woods, A. Kulkarni, H.Edwards 1993A. D. N. Edwards and R. D. Stevens. Mathematical Representations: 146References Elliott 1993E. L. Elliott. Watch-Grab-Arrange-See: Thinking with Motion ImagesEngelbart 1984D. Engelbart. Authorship provisions in AUGMENT. In Ericsson 1984K. A. Ericsson and H. A. Simon. Fairbanks 1954G. Fairbanks, W. L. Everitt, and R. P. Jaeger. Method for Time orFairbanks 1957G. Fairbanks and F. Kodman. Word Intelligibility as a Function of TimeFlanagan 1985J. L. Flanagan, J. D. Johnson, R. Zahn, and G. W. Elko. Computer-Foulke 1969W. Foulke and T. G. Sticht. Review of Research on the Intelligibility andFoulke 1971E. Foulke. The Perception of Time Compressed Speech. Ch. 4 inFurnas 1986G. W. Furnas. Generalized Fisheye Views. In ACM,New York, 1986, pp. 16Ð23.Gan 1988C. K. Gan and R. W. Donaldson. Adaptive Silence Deletion for SpeechGarvey 1953aW. D. Garvey. The Intelligibility of Abbreviated Speech Patterns.Garvey 1953bW. D. Garvey. The Intelligibility of Speeded Speech. Gaver 1989aW. W. Gaver. Auditory Icons: Using Sound in Computer Interfaces.Gaver 1989bW. W. Gaver. The SonicFinder: An Interface that uses Auditory Icons.Gaver 1993W. W. Gaver. Synthesizing Auditory Icons. In ACM,New York, 1993, pp. 228Ð235. References147 Gerber 1974S. E. Gerber. Limits of Speech Time Compression. In Gerber 1977S. E. Gerber and B. H. Wulfeck. The Limiting Effect of Discard IntervalGlavitsch 1992U. Glavitsch and P. SchŠuble. A System for Retrieving Speech ACM,New York,Grice 1975H. P. Grice. Logic and Conversation. In Griffin 1984D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short-Gruber 1982J. G. Gruber. A Comparison of Measured and Calculated SpeechGruber 1983J. G. Gruber and N. H. Le. Performance Requirements for IntegratedGrudin 1988J. Grudin. Why CSCW Applications Fail: Problems in the Design and ACM,New York, 1988, pp. 85Ð93.Hardam 1990E. Hardam. High Quality Time-Scale Modification of Speech SignalsHawley 1993M. Hawley. Structure out of Sound. Ph.D. dissertation, MIT, Sep. 1993.Hayes 1983P. J. Hayes and D. R. Reddy. Steps Towards Graceful Interaction inHeiman 1986G. W. Heiman, R. J. Leo, G. Leighbody, and K. Bowler. WordHejna 1990D. J. Hejna Jr. Real-Time Time-Scale Modification of Speech via theHess 1976W. J. Hess. A Pitch-Synchronous Digital Feature Extraction System forHess 1983W. Hess. Hindus 1993D. Hindus, C. Schmandt, and C. Horner. Capturing, Structuring, and 148References Hirschberg 1986J. Hirschberg and J. Pierrehumbert. The Intonational Structuring ofHirschberg 1987J. Hirschberg and Diane Litman. Now LetÕs Talk About Now: Associationfor Computational Linguistics,Hirschberg 1992J. Hirschberg and B. Grosz. Intonational Features of Local and Global DefenseAdvanced ResearchHoule 1988G. R. Houle, A. T. Maksymowicz, and H. M. Penafiel. Back-End AmericanVoice I/O Society, 1988.Hu 1987A. Hu. Jankowski 1976J. A. Jankowski. A New Digital Voice-Activated Switch. Jeffries 1991R. Jeffries, J. R. Miller, C. Wharton, and K. M. Uyeda. User Interface ACM,NewKato 1992Y. Kato and K. Hosoya. Fast Message Searching Method for Voice Mail AmericanVoice I/O Society, 1992, pp. 215Ð222.Kato 1993Y. Kato and K. Hosoya. Message Browsing Facility for Voice BulletinKeller 1992E. Keller. . InfoSignalInc., Lausanne, Switzerland. 1992.Keller 1993E. Keller. Kernighan 1976B. W. Kernighan and P. J. Plauger. Klatt 1987D. H. Klatt. Review of Text-To-Speech Conversion for English. Knuth 1984D. E. Knuth. Kobatake 1989H. Kobatake, K. Tawa, and A. Ishida. Speech/Nonspeech Discrimination References149 Lamel 1981L. F. Lamel, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon. AnLamming 1991M. G. Lamming. Towards a Human Memory Prosthesis. XeroxLamport 1986L. Lamport. Lass 1977N. J. Lass and H. A. Leeper. Listening Rate Preference: Comparison ofLee 1972F. F. Lee. Time Compression and Expansion of Speech by the SamplingLee 1986H. H. Lee and C. K. Un. A Study of On-off Characteristics ofCOM-Levelt 1989W. J. M. Levelt. Lim 1983J. S. Lim. Lipscomb 1993J. S. Lipscomb and M. E. Pique. Analog Input Device PhysicalLudwig 1990L. Ludwig, N. Pincever, and M. Cohen. Extending the Notion of aLynch 1987J. F. Lynch Jr., J. G. Josenhans, and R. E. Crochiere. Speech/SilenceMackinlay 1991J. D. Mackinlay, G. G. Robertson, and S. K. Card. The Perspective Wall: ACM,New York, 1991, pp. 173Ð179.Makhoul 1986J. Makhoul and A. El-Jaroudi. Time-Scale Modification in Medium toMaksymowicz 1990A. T. Maksymowicz. Automatic Gisting for Voice Communications. InMalah 1979D. Malah. Time-Domain Algorithms for Harmonic Bandwidth ReductionMalone 1988T. W. Malone, K. R. Grant, K-Y. Lai, R. Rao, and D. Rosenblitt. Semi- 150References Manandhar 1991S. Manandhar. Activity Server: You Can Run but you CanÕt Hide. InMauldin 1989M. L. Mauldin. Information Retrieval by Text Skimming. CarnegieMaxemchuk 1980N. Maxemchuk. An Experimental Speech Storage and Editing Facility.McAulay 1986R. J. McAulay and T. F. Quatieri. Speech Analysis/Synthesis Based on aMermelstein 1975P. Mermelstein. Automatic Segmentation of Speech into Syllabic Units.Microtouch 1992MicrotouchÊSystemsÊInc. Miedema 1962H. Miedema and M. G. Schachtman. TASI QualityÑEffect of SpeechMiller 1950G. A. Miller and J. C. R. Licklider. The Intelligibility of InterruptedMills 1992M. Mills, J. Cohen, and Y. Y. Wong. A Magnifier Tool for Video Data. ACM,New York, Apr.Minifie 1974F. D. Minifie. Durational Aspects of Connected Speech Samples. InMuller 1990M. J. Muller and J. E. Daniel. Toward a Definition of Voice Documents.Conference on Office Information Systems (Cambridge, MA, Apr. 25Ð ACMSIGOIS and IEEECS TC-OA, ACM Press, 1990, pp. 174Ð183.Mullins 1993A. Mullins. Multimedia 1989Multimedia Lab. Natural 1988NaturalÊMicroSystemsÊCorporation. Negroponte 1991N. Negroponte. Beyond the Desktop Metaphor. Ch. 9 in Nelson 1974T. Nelson. References151 Neuburg 1978E. P. Neuburg. Simple Pitch-Dependent Algorithm for High QualityNielsen 1990J. Nielsen and R. Molich. Heuristic Evaluation of User Interfaces. In ACM,New York, 1990.Nielsen 1991J. Nielsen. Finding Usability Problems through Heuristic Evaluation. In ACM,NewNielsen 1993aJ. Nielsen. Nielsen 1993bJ. Nielsen. Is Usability Engineering Really Worth It? Noll 1993P. Noll. Wideband Speech and Audio Coding. Orr 1965D. B. Orr, H. L. Friedman, and J. C. Williams. Trainability of ListeningOrr 1971D. B. Orr. A Perspective on the Perception of Time Compressed Speech.and J. J. Jenkins. Charles E. Merrill Publishing Company, 1971. pp. 108ÐOÕShaughnessy 1987D. OÕShaughnessy. OÕShaughnessy 1992D. OÕShaughnessy. Recognition of Hesitations in Spontaneous Speech.Parunak 1989H. V. D. Parunak. Hypermedia Topologies and User Navigation. In ACM,New York,Pincever 1991N. C. Pincever. If You Could See What I Hear: Editing AssistancePitman 1985K. M. Pitman. CREF: An Editing Facility for Managing Structured Text.Portnoff 1978M. R. Portnoff. Time-Scale Modification of Speech Based on Short-TimePortnoff 1981M. R. Portnoff. Time-Scale Modification of Speech Based on Short-TimeQuatieri 1986T. F. Quatieri and R. J. McAulay. Speech Transformations Based on aQuereshi 1974S. U. H. Quereshi. Speech Compression by Computer. In 152References Rabiner 1975L. R. Rabiner and M. R. Sambur. An Algorithm for Determining theRabiner 1989L. R. Rabiner. A Tutorial on Hidden Markov Models and SelectedRaman 1992aT. V. Raman. An Audio View of (LA)TEX Documents. In Raman 1992bT. V. Raman. Documents are not Just for Printing. In Reich 1980S. S. Reich. Significance of Pauses for Speech Perception. Resnick 1992aP. Resnick and R. A. Virzi. Skip and Scan: Cleaning Up Telephone ACM,NewResnick 1992bP. Resnick. HyperVoice: Groupware by Telephone. Ph.D. dissertation,Resnick 1992cP. Resnick. HyperVoice a Phone-Based CSCW Platform. In SIGCHIand SIGOIS, ACMReynolds 1993D. A. Reynolds. A Gaussian Mixture Modeling Approach to Text-Richaume 1988A. Richaume, F. Steenkeste, P. Lecocq, and Y. Moschetto. IntelligibilityRippey 1975R. F. Rippey. Speech Compressors for Lecture Review. Roe 1993D. B. Roe and J. G. Wilpon. Whither Speech Recognition: The Next 25Rose 1991R. C. Rose. Techniques for Information Retrieval from SpeechRoucos 1985S. Roucos and A. M. Wilgus. High Quality Time-Scale Modification forSalthouse 1984T. A. Salthouse. The Skill of Typing. Savoji 1989M. H. Savoji. A Robust Algorithm for Accurate Endpointing of SpeechSchmandt 1984C. Schmandt and B. Arons. A Conversational Telephone Messaging References153 Schmandt 1985C. Schmandt, B. Arons, and C. Simmons. Voice Interaction in an AmericanVoice I/O Society, 1985.Schmandt 1986C. Schmandt and B. Arons. A Robust Parser and Dialog Generator for aAmericanVoice I/O Society, 1986, pp. 355Ð365.Schmandt 1987C. Schmandt and B. Arons. Conversational Desktop (videotape). Schmandt 1988C. Schmandt and M. McKenna. An Audio and Telephone Server for IEEEComputer Society, Mar. 1988, pp. 150ÐSchmandt 1989C. Schmandt and B. Arons. Getting the Word (Desktop Audio). Schmandt 1993C. Schmandt. From Desktop Audio to Mobile Access: Opportunities forScott 1967R. J. Scott. Time Adjustment in Speech Synthesis. Scott 1972R. J. Scott and S. E. Gerber. Pitch-Synchronous Time-Compression ofSheridan 1992aT. B. Sheridan. Defining Our Terms. Sheridan 1992bT. B. Sheridan. Silverman 1987K. E. A. Silverman. The Structure and Processing of FundamentalSmith 1970S. L. Smith and N. C. Goodwin. Computer-Generated Speech and Man-Sony 1993SonyÊCorporation. Stallman 1979R. M. Stallman. EMACS: The Extensible, Customizable, Self-Stevens 1993R. D. Stevens. Sticht 1969T. G. Sticht. Comprehension of Repeated Time-Compressed Recordings. 154References Stifelman 1991L. J. Stifelman. Not Just Another Voice Mail System. In AmericanVoice I/O Society, 1991, pp. 21Ð26.Stifelman 1992aL. J. Stifelman. VoiceNotes: An Application for a Voice ControlledStifelman 1992bL. Stifelman. Stifelman 1993L. J. Stifelman, B. Arons, C. Schmandt, and E. A. Hulteen. VoiceNotes: ACM,NewThomas 1990G. S. Thomas. Toong 1974H. D. Toong. A Study of Time-Compressed Speech. Ph.D. dissertation,Tucker 1991P. Tucker and D. M. Jones. Voice as Interface: An Overview.Tufte 1990E. Tufte. Voor 1965J. B. Voor and J. M. Miller. The Effect of Practice Upon theWallace 1983W. P. Wallace. Speed Listening: Exploring an Analogue of SpeedReading. University of Nevada- Reno, technical report no. NIE-G-81-Want 1992R. Want, A. Hopper, V. Falcao, and J. Gibbons. The Active BadgeWatanabe 1990T. Watanabe. The Adaptation of Machine Conversational Speed toWatanabe 1992T. Watanabe and T. Kimura. In Review of Acoustical Patents:Wayman 1988J. L. Wayman and D. L. Wilson. Some Improvements on theWayman 1989J. L. Wayman, R. E. Reinke, and D. L. Wilson. High Quality Speech References155 Webster 1971Webster. Weiser 1991M. Weiser. The Computer for the 21st Century. Wenzel 1988E. M. Wenzel, F. L. Wightman, and S. H. Foster. A Virtual DisplayWenzel 1992E. M. Wenzel. Localization in Virtual Acoustic Displays. Wightman 1992C. W. Wightman and M. Ostendorf. Automatic Recognition of vol. I, IEEE, 1992, pp. I221ÐWilcox 1991L. Wilcox and M. Bush. HMM-Based Wordspotting for Voice EditingWilcox 1992aL. Wilcox and M. Bush. Training and Search Algorithms for anWilcox 1992bL. Wilcox, I. Smith, and M. Bush. Wordspotting for Voice Editing andACM,New York, 1992, pp. 655Ð656.Wilpon 1984J. G. Wilpon, L. R. Rabiner, and T. Martin. An Improved Word-Wilpon 1990J. G. Wilpon, L. R. Rabiner, C. Lee, and E. R. Goldman. AutomaticWingfield 1980A. Wingfield and K. A. Nolan. Spontaneous Segmentation in Normal andWingfield 1984A. Wingfield, L. Lombardi, and S. Sokol. Prosodic Features and theWolf 1992C. G. Wolf and J. R. Rhyne. Facilitating Review of Meeting InformationYatsuzuka 1982Y. Yatsuzuka. Highly Sensitive Speech Detector and High-SpeedZellweger 1989P. T. Zellweger. Scripted Documents: A Hypermedia Path Mechanism. ACM,New 156References Zemlin 1968W. R. Zemlin, R. G. Daniloff, and T. H. Shriner. The Difficulty of