Nishant Pandey Synopsis Problem statement and motivation Previous work and background System Intuition and Overview Preprocessing of audio signals Building f eature space Finding patterns in unlabelled data and labelling of samples ID: 432949
Download Presentation The PPT/PDF document "Predicting Voice Elicited Emotions" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Predicting Voice Elicited Emotions
Nishant PandeySlide2
Synopsis
Problem statement and motivation
Previous work and background
System
Intuition and Overview
Pre-processing of audio
signals
Building
f
eature space
Finding patterns in unlabelled data and labelling of samples
Regression Results
Deployed System
Market ResearchSlide3
Motivation
Automate the screening process in
service based
industries
Hourly
job workers (two-thirds of U.S. Labour force or ~50 million job seekers every year)
Problem Statement
To be able to analyse voice and predict listener emotions elicited by the
paralinguistic
elements of the voice.Slide4
Previous work
Current work focuses on predicting the
elicited emotions of voice clips
.
2
set of goals, which includes recognizing-
the type of personality
traits intrinsically possessed by the speaker
, for e.g. speaker trait and speaker state
the types of
emotions carried within the speech
clip, for e.g. acoustic affect (cheerful, trustworthy, deceitful etc.)Slide5
Background – Emotion Taxonomy
The framework articulated by “FEELTRACE”
I
ncludes
all the
emotion
responses
we want
to
p
redict.
Emotions by finite quantifiable dimensions.Slide6
Features - Paralinguistic features of Voices
Concept
Definition
Data Representation
Amplitude
measurement of the variations over time of the acoustic signal
quantified values of a sound wave’s Oscillation
Energy
acoustic signal energy representation in decibels
20*log10(abs(FFT))
Formants
the resonance frequencies of the vocal tract
maxima detected using Linear Prediction on audio windows with high tonal content
Perceived pitch
Perceived Fundamental frequency and harmonics
Formants
Fundamental frequency
the reciprocal of time duration of one glottal cycle - a strict definition of “pitch”
first formantSlide7
System – Intuition
Spectrogram of two job applicants responding to “Greet me as if I am a customer”Slide8
System – OverviewSlide9
System – Pre-Processing of Audio Signals
Pre-processing tasks involve
:
Removing voice clips with <2 seconds length and containing
noise
audio signal to data in time and frequency domain
Short-term Fast Fourier Transform per frameEnergy measures in frequency domain per frameLinear prediction coefficient in frequency domain per frameSlide10
System - Feature Space Construction
We experimented with feature construction based on
the following
dimensions and combinations
:
Signal measurements such as energy and
amplitude.Statistics such as min, max, mean, and standard deviation on signal measurementsMeasurement
window in time domain
: different time
size and
entire time window
Measurement window in frequency domain
: all
frequencies, optimal
audible frequencies, and selected frequency
rangesSlide11
System – Labels and Right set of Features?
Conventional
approach – getting voice samples rated by
experts
Unsupervised Learning
– Analyse features and their effectiveness
Process:Unsupervised learning is used to find patterns in unlabelled data.Now, training data sets are constructed based on clustering results and manual labelling.Slide12
System – How do we get the labels? Contd.
Parameters
Cost Function:
Connectivity
Dunn Index
Silhouette
Clustering Results
Technique:
Hierarchical Clustering
Number of clusters:
5
Manual validation of clusters was also doneSlide13
System – Visualization of clustersSlide14
System – Modelling
Supervised Learning algorithms
Logistic Regression
Support Vector Machine
Random Forest
Semi-Supervised Learning algorithm
KODAMAOutput:Binary outcome (positive or negative)Numerical scoresSlide15
Case Study –
Modelling
Prediction – Positive vs Negative Response
A
positive response
could be one or multiple perceptions of a “pleasant voice”, “makes me feel good”, “cares about me”, “makes me feel comfortable”, or “makes me feel engaged
”.System.V1 -> Using SVM and V2 -> Random ForestInterview Prompts: “Greet me as If I am a customer”Slide16
System - Prediction Results
Accuracy
:
0.86
95
% CI : (0.76, 0.92)P-Value [
Acc > NIR] : 5.76e-07Sensitivity : 0.81Specificity : 0.88Pos
Pred
Value :
0.81
Neg
Pred
Value :
0.88Slide17
System - Prediction Results
(KODAMA
)
Kodama performs
feature extraction from noisy
and high-dimensional data.Output of Kodama includes dissimilarity
matrix from which we can perform clustering and classification.Slide18
Deployed SystemSlide19
Market Research
Demographics Matters
Young listeners (18-29 years old) and Income
less than $29000/year
have more strict criteria of how they sense engaging.
No Correlation
b/w emotion elicited vs
age/ ethnicity/ education
level.
Bias towards female voice.Slide20
ThanksSlide21
Time and Frequency Domain
Time Domain:
https://en.wikipedia.org/wiki/Time_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).
gif
Frequency Domain:
https
://en.wikipedia.org/wiki/Frequency_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gifSlide22
Learnings – Difference in Voice Characteristics
Result Improves by 10% - when a decision tree is layered by features related to voice characteristic on top of the Random Forest.Slide23
Prediction Results – SVM vs Random Forest