Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance What makes it difficult Review Minimum Distance Algorithm E X E C U T I ID: 697441
Download Presentation The PPT/PDF document "Comparing Audio Signals Phase misalignme..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Comparing Audio Signals
Phase misalignmentDeeper peaks and valleysPitch misalignmentEnergy misalignmentEmbedded noiseLength of vowelsPhoneme variance
What makes it difficult?Slide2
Review: Minimum Distance Algorithm
E
X
E
CUTION0123456789I1123456678N2223456777T3333455678E4343456678N5444456777T6555555678I7666666567O8777777656N9888888765
Array[
i,j
] = min{1+Array[i-1,j], cost(
i,j
)+Array[i-1,j-1],1+ Array[i,j-1)}Slide3
Pseudo Code (minDistance(target, source))
n = character in source
m
= characters in target
Create array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = rFOR c=0 TO m distance[0,c] = cFOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]Slide4
Is Minimum Distance Applicable?
Maybe?The optimal distance from indices [a,b] is a function of the costs with smaller indices.This suggests that a dynamic approach may work.Problems
The cost function is more complex. A binary equal or not equal doesn’t work
Need to define a distance metric. But what should that metric be? Answer: It depends on which audio features we use.
Longer vowels may still represent the same speech. The classical solution is not to apply a cost when going from index [i-1,j] or [i,j-1] to [I,j]. Unfortunately, this assumption can lead to singularities, which result in incorrect comparisonsSlide5
Complexity of Minimum Distance
The basic algorithm is O(m*n) where m is the length (samples) of one audio signal and m is the length of the other. If m=n, the algorithm is O(n2). Why?: count the number of cells that need to be filled in.O(n2) may be too slow. Alternate solutions have been devised.Don’t fill in all of the cells.Use a multi-level approach
Question: Are the faster approaches needed for our purposes? Perhaps not!Slide6
Don’t Fill in all of the Cells
Problem:
May miss the optimal minimum
distancepathSlide7
The Multilevel Approach
Concept
Down sample to coarsen the array
Run the algorithm
Refine the array (up sample)Adjust the solutionRepeat steps 3-4 till the original sample rate is restoredNotes The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n)Example is partitioning a graph to balance work loads among threads or processorsSlide8
Singularities
AssumptionThe minimum distance comparing two signals only depends on the previous adjacent entriesThe cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be well-definedProblem:
The algorithm can compute incorrectly due to mismatched alignments
Possible solutions:
Compare based on the change of feature values between windows instead of the values themselvesPre-process to eliminate the causes of the mismatchesSlide9
Possible Preprocessing
Remove the phase from the audio:Compute the Fourier transform Perform discrete cosine transform on the amplitudesNormalize the energy of voiced audio
:
Compute the energy of both signals
Multiply the larger by the percentage differenceRemove the DC offset: Subtract the average amplitude from all samplesBrick Wall Normalize the peaks and valleys: Find the average peak and valley valueSet values larger than the average equal to the averageNormalize the pitch: Use PSOLA to align the pitch of the two signalsRemove duplicate frames: Auto correlate frames at pitch pointsRemove noise from the signal: implement a noise removal algorithmNormalize the speed of the speech: Slide10
Which Audio Features?
Cepstrals: They are statistically independent and phase differences are removedΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the nextEnergy: Distinguish the frames that are voiced verses those that are unvoiced
Normalized
LPC
Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers.These are the popular features used for speech recognitionSlide11
Which Distance Metric?
General Formula: array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}
Assumption
: There is no cost assessed for
duplicate or eliminated frames. Distance Formula:Euclidian: sum the square of one metric minus another squaredLinear: sum the absolute value of the distance between featuresWeighting the features: Multiply each metric’s difference by a weighting factor to give greater/lesser emphasis to certain featuresExample of a distance metric using linear distance∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight