/
FindingMusicallyMeaningfulWordsbySparseCCA FindingMusicallyMeaningfulWordsbySparseCCA

FindingMusicallyMeaningfulWordsbySparseCCA - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
376 views
Uploaded On 2016-06-29

FindingMusicallyMeaningfulWordsbySparseCCA - PPT Presentation

DavidATorresDouglasTurnbullDepartmentofComputerScienceandEngineeringUniversityofCaliforniaSanDiegoLaJollaCA92093fdatorrescsucsdedudturnbulcsucsdedugBharathKSriperumbudurLukeBarringtonGe ID: 382794

DavidA.Torres DouglasTurnbullDepartmentofComputerScienceandEngineeringUniversityofCalifornia SanDiegoLaJolla CA92093fdatorres@cs.ucsd.edu dturnbul@cs.ucsd.edugBharathK.Sriperumbudur LukeBarrington

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "FindingMusicallyMeaningfulWordsbySparseC..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FindingMusicallyMeaningfulWordsbySparseCCA DavidA.Torres,DouglasTurnbullDepartmentofComputerScienceandEngineeringUniversityofCalifornia,SanDiegoLaJolla,CA92093fdatorres@cs.ucsd.edu,dturnbul@cs.ucsd.edugBharathK.Sriperumbudur,LukeBarrington,GertR.G.LanckrietDepartmentofElectricalandComputerEngineeringUniversityofCalifornia,SanDiegoLaJolla,CA92093fbharathsv@ucsd.edu,lbarrington@ucsd.edu,gert@ece.ucsd.edugAbstractAmusicallymeaningfulvocabularyisoneofthekeystonesinbuildingacomputerauditionsystemthatcanmodelthesemanticsofaudiocontent.Ifawordinthevocabularyisnotclearlyrepresentedbytheunderlyingacousticrepresentation,thewordcanbeconsiderednoisyandshouldberemovedfromthevocabulary.Thispaperproposesanapproachtoconstructavocabularyofpredictiveseman-ticconceptsbasedonsparsecanonicalcomponentanalysis(sparseCCA).Thegoalistondwordsthatarehighlycorrelatedwiththeunderlyingaudiofeaturerepresentationwiththeexpectationthatthesewordscanmemodeledmoreaccu-rately.Experimentalresultsillustratethat,byidentifyingthesemusicallymean-ingfulwords,wecanimprovetheperformanceofapreviouslyproposedcomputerauditionsystemformusicannotationandretrieval.1IntroductionOverthepasttwoyearswehavebeendevelopingacomputerauditionsystemthatcanannotatesongswithsemanticallymeaningfulwordsandretrieverelevantsongsbasedonatextquery.Thissystemlearnsajointprobabilisticmodelbetweenavocabularyofwordsandacousticfeaturevectorsusingaheterogeneousdatasetofsongandsongannotations.However,ifaspecicwordisnotwellrepresentedbytheacousticfeatures,thesystemwillnotbeabletomodelthisnoisywordaccurately.Inthispaper,weexploretheproblemofvocabularyselectioninwhichourgoalistodiscoverwordsthatcanbemodeledaccuratelyanddiscardthosewordsthatcannot.Weproposetheconceptofacousticcorrelationasanindicatorforpickingcandidatewords.Thegoalistondwordsinavocabularythathaveahighcorrelationwiththeaudiofeaturerepresentation.Theexpectationisthatthispatternbetweenwordsandaudiowillbemorereadilycapturedbyourmodels.Previously,wecollectedannotationsofmusicusingvariousmethods:text-miningsongreviews[16],conductingahumansurvey[17],andexploringtheuseofahumancomputationgame[18,20].Inallcases,weareforcedtochooseavocabularyusingad-hocmethods.Forexample,text-miningthesongreviewsresultedinalistofover1,000candidatewordswhichtheauthorsmanuallyprunediftherewasageneralconsensusthatawordwasnot`musically-relevant'.Tocollectthesurveyandgamedata,webuilt,apriori,atwo-levelhierarchicalvocabularybyrstconsideringasetofhigh-levelsemanticcategories(`Instrumentation',`Emotion',`VocalCharacteristic',`Genre')andthenlistinglow-levelwords(`ElectricGuitar',`Happy',`Breathy',`Bebop')foreachsemantic Correspondingauthor1 category.Inbothcases,avocabularyrequiredmanualconstructionandincludedsomenoisywordsthatdegradedtheperformanceofourcomputerauditionsystem.Presumably,onereasonthatcertainwordscauseproblemsforoursystemisthattheacousticrep-resentationofthesewordsishardtomodel.Thisrelatestotheexpressivepowerofourchosenaudiofeaturerepresentation.Forexample,ifweareinterestedinwordsrelatedtolong-termmusicstructure(e.g.,`12-barblues')andweonlyrepresenttheaudiousingshort-term(1sec)audiofeaturevectors,wemaybeunabletomodelsuchconcepts.Anotherexampleiswordsthatrelatetoageographicalassociation(e.g.,`BritishInvasion',`Woodstock')whichmayhavestrongculturalroots,butarepoorlyrepresentedintheaudiocontent.Inthispaperandinourlatestresearch[15],givenanaudiofeaturerepresentation,wewishtoidentifywordsthatarerepresentedwellbytheaudiocontentbeforewetrytomodelthem.Todothisweproposetheuseofanunsupervisedmethodbasedoncanonicalcorrelationanalysis(CCA)tomeasureacousticcorrelation.CCAisamethodofexploringcorrelationsbetweendifferentrepresentationsofsomeunderlyingdatawheretheserepresentationsexistintwodifferentfeaturespaces(vectorspaces).Forexample,inthispapertheinformationwewishtomodel,thesemanticsofmusic,isrepresentedastextlabelsandaudiofeatures.CCAhasalsobeenusedinapplicationsdealingwithmulti-languagetextanalysis[19],learningsemanticrepresentationsbetweenimagesandtext[2],andlocalizingpixelswhicharecorrelatedwithaudioinavideostream[6].Similartothewayprincipalcomponentanalysis(PCA)ndsinformativevectorsinonefeaturespacebymaximizingthevarianceofprojecteddata,CCAndspairsofvectorswithintwofeaturespacesthatmaximizecorrelation.Putanotherway,CCAndsaonedimensionalsubspacewithineachfeaturespacesuchthattheprojectionofdatapointsontotheirrespectivesubspacesmaximizesthecorrelationbetweentheseone-dimensionalprojections.Givenmusicdatarepresentedinbothasemanticfeaturespaceandanacousticfeaturespace,weproposethatthesevectorsofmaximalcorrelationcanbeusedtondwordsthatarestronglycharac-terizedbyanaudiorepresentation.Specically,theCCAsolutionvector,orcanonicalcomponent,thatcorrespondstothesemanticrepresentationisamappingofvocabularywordstoweightswhereahighweightimpliesthatagivenwordishighlycorrelatedwiththeaudiofeaturerepresentation.Weinterprethighweightsassinglingoutwordsthatare“musicallymeaningful”.Inotherwords,theunderlyingrelationship(acorrelation)betweenwordandaudiofeaturerepresentationhintsthatawordmaybemoreaccuratelymodeledthanawordforwhichnosuchrelationshipexists.GenerallyCCAreturnsamappingofwordstoweightsthatisnon-sparse,meaningthateachwordwillbemappedtoanon-zeroweight.Hence,selectinggoodwordsfromthisvocabularywouldbedonebythresholdingtheweightsorbysomeothersimilarheuristic.Itisdesirabletoremovesuchanarbitrarystepfromthevocabularyselectionprocess.Thisisdonebyimposingsparsityonthesolutionweights,thatis,wemodifytheCCAproblemsothatittendstogivesolutionswithfewnon-zeroweights,(sparseCCA[13]).Thisleadstoaverynaturalcriterionforselectingavocabulary:throwawaywordswithweightsequaltozero.2AcousticCorrelationwithCCACanonicalCorrelationAnalysis,orCCA,isamethodofexploringdependenciesbetweendatawhicharerepresentedintwodifferent,butrelated,featurespaces.Forexample,considerasetofsongswhereeachsongisrepresentedbybothasemanticannotationvectorandanaudiofeaturevector.Anannotationvectorforasongisareal-valued(orbinary)vectorwhereeachelementrepresentsthestrengthofassociationbetweenthesongandawordfromourvocabulary.Anaudiofeaturevectorisareal-valuedvectorofstatisticscalculatedfromthedigitalaudiosignal.Itisassumedthatthetwospacessharesomejointinformationwhichcanbecapturedintheformofcorrelationsbetweenthedatainthetwofeaturespaces.CCAndsaone-dimensionalprojectionofthedataineachspacesuchthatthecorrelationsbetweentheprojectionsismaximized.Moreformally,considertwodatamatrices,AandS,fromtwodifferentfeaturespaces.TherowsofAcontainmusicdatarepresentedintheaudiofeaturespaceA.ThecorrespondingrowsofScontainthemusicdatarepresentedinthesemanticannotationspaceS(e.g.,annotationvectors).CCAseekstooptimizemaxwa2A;ws2SfwTaasws:wTaaawa=1;wTsssws=1g(1)2 where=aaassassisthecovariancematrixof[AjS].ByformulatingtheLagrangiandualofEq.(1),itcanbeshownthatsolvingEq.(1)isequivalenttosolvingapairofmaximumeigenvalueproblems,1aaas1sssawa=2wa;(2)1sssa1aaasws=2ws;(3)withbeingthemaximumvalueofEq.(1).Thesolutionvectors,waandws,inEq.(1)aregenerallynon-sparse,meaningthatthecoefcientsinthevectorstendtobenon-zero.Inmanyapplicationsitmaybeofinteresttolimitthenumberofnon-zeroelementsinthesolutionvectorasthisaidsintheinterpretabilityoftheresults.Forexample,inthisapplicationthesolutionvectorwscanbeinterpretedasamappingofwordstoweightswhereahighweightimpliesthatagivenwordishighlycorrelatedwiththeaudiofeaturerepresentation.InthenextsectionweimposesparsityonwstherebyturningCCAintoanexplicitvocabularyselectionmechanismwherewordswithzeroweight(i.e.,negligiblecorrelation)arethrownaway.2.1SparseCCAWhilePCAanditstwo-viewextension,CCAarewellstudiedandunderstood,toourknowledge,thesparseverionofCCAisnot.Differentalgorithms[5,23,1,13],bothconvexandnon-convexhavebeenproposedforsparsePCA.Recentlyin[13],weproposedageneralframeworkforsolvingsparseeigenvalueproblemsusingd.c.programming(differenceofconvexfunctions)[3]whichweextendheretoderiveasparseCCAalgorithm.ConsiderthevariationalformulationofCCAinEq.(1)whichcanbewrittenasageneralizedeigen-valueproblem,maxxfxTPx:xTQx=1g,whereP=0assa0,Q=aa00ssandx=waws.Therefore,thevariationalformulationforsparseCCAisgivenbymaxxfxTPx:xTQx=1;jjxjj0kg;(4)where1knandn=dim(A)+dim(S).Eq.(4)isnon-convex,NP-hardandthereforeintractable(see[13]fordetaileddiscussion).ArelatedproblemtoEq.(4)isgivenbymaxxfxTPxjjxjj0:xTQx1g;(5)where�0isthepenalizationparameterthatcontrolssparsity.Usually,jjxjj0isreplacedbyjjxjj1toachieveconvexity.However,inoursetting,sincePisindenite,`1approximationdoesnotyieldanycomputationaladvantage.So,weuseabetterapproximationthan`1,givenbyPnilogjxij,leadingtothefollowingprogram,maxxfxTPxnXi=1logjxij:xTQx1g;(6)ThisapproximationhasshownsuperiorperformanceinsparsePCAexperiments[13]andSVMfeatureselectionexperiments[21].WithPbeingindenite,Eq.(6)canbereducedtoad.c.programasminx(jjxjj2 xT[P+I]xnXi=1logjxij!:xTQx1);(7)wheremax(P).Usingthed.c.minimizationalgorithm(DCA)[14],wegetthesparseCCAalgorithmgivenbyAlgorithm1,whichisasequenceofquadraticprograms.Inourexperiments,weimposeasparsityconstraintonlyonthesolutioninthethesemanticspace,S,thereforeweneedaconstraintonjjwsjj0insteadoftheoverallvector,jjxjj0.Insuchacasewehavex=argminxxTD2(xl)x2xTl[P+I]D(xl)x+jjxjj1s.t.xTD(xl)QD(xl)x1(8)whereD(x)=diag(x),(pq)i=aibiand=[0;0;dim(A):::;0;;;dim(S):::;]T.3 Algorithm1SparseCCAAlgorithm Require:SymmetricP,Q0and�01:Choosex02fx:xTQx1garbitrarily2:repeat3:x=argminxxTD2(xl)x2xTl[P+I]D(xl)x+jjxjj1s.t.xTD(xl)QD(xl)x14:xl+1=xlx5:untilxl+1=xl6:returnxl,x ForourpurposesEq.(8)becomesavocabularyselectionmechanismwhichtakesasitsinputthesemanticandaudiofeaturerepresentationsforasetofsongs,SandA,andapenalizationparameterthatcontrolssparsity.Themethodreturnsasetofsparseweightswsassociatedwitheachwordinthevocabulary.Wordswithaweightofzeroareremovedfromourmodelingprocess.Thenon-zeroelementsofthesparsesolutionvectorwscanbeinterpretedasthosewordswhichhaveahighcorrelationwiththeaudiorepresentation.Thus,intheexperimentsthatfollow,settingvaluesofandsolvingEq.(8)reducestoavocabularyselectiontechnique.3RepresentingAudioandSemanticDataInthissectionwedescribetheaudioandsemanticrepresentations,aswellasdescribetheCAL500[17]andWeb2131[16]annotatedmusiccorporathatareusedinourexperiments.Inbothcases,thesemanticinformationwillberepresentedusingasingleannotationvectorswithdimensionequaltothesizeofthevocabulary.Theaudiocontentwillberepresentedasmultiplefeaturevectorsfa1;:::;aTg,whereTdependsonthelengthofthesong.TheconstructionofthematricesAandStosolvethesparseCCAfollows:Eachfeaturevectorinthemusiccorpusisassociatedwiththelabelforitssong.Forexample,foragivensong,weduplicateitsannotationvectorsforatotalofTtimessothatthesong-labelpairmayberepresentedasf(s;a1);:::;(s;aT)g.ToconstructAwestackthefeaturevectorsforallsongsinthecorpusintoonematrix.Sisconstructedbystackingallthecorrespondingannotationvectorsintoonematrix.Ifeachsonghasapproximately600featurevectorsandwehave500hundredsongs,thenbothAandSwillhaveabout30,000rows.3.1AudioRepresentationEachsongisrepresentedasabag-of-feature-vectors:weextractanunorderedsetoffeaturevec-torsforeverysong,byextractingonefeaturevectorforeachshort-timesegmentofaudiodata.Specically,wecomputedynamicMel-frequencycepstralcoefcients(dMFCC)fromeachhalf-overlapping,medium-time(743ms)segmentofaudiocontent[9].Mel-frequencycepstralcoefcients(MFCC)describethespectralshapeofashort-timeaudioframeandarepopularfeaturesforspeechrecognitionandmusicclassication(e.g.,[10,7,12]).Wecalculate13MFCCcoefcientsforeachshort-time(23msec)frameofaudio.Foreachofthe13MFCCs,wetakeadiscreteFouriertransform(DFT)overatimeseriesof64frames,normalizebytheDCvalue(toremovetheeffectofvolume)andsummarizetheresultingspectrumbyintegratingacross4modulationfrequencybands:(unnormalized)DC,1-2Hz,3-15Hzand20-43Hz.Thus,wecreatea52-dimensionalfeaturevector(4featuresforeachofthe13MFCCs)forevery3=4segmentofaudiocontent.Foraveminutesong,thisresultsinabout80052-dimensionalfeaturevectors.3.2SemanticRepresentationTheCAL500isanannotatedmusiccorpusof500westernpopularsongsby500uniqueartists.Eachsonghasbeenannotatedbyaminimumof3individualsusingavocabularyof174words.Wepaid66undergraduatemusicstudentstoannotateourmusiccorpuswithsemanticconcepts.Wecollectedasetofsemanticlabelscreatedspecicallyforamusicannotationtask.Weconsidered4 Top3wordsbysemanticcategory AcousticCorrelation overall rapping,ataparty,hip-hop/rap emotion arousing/awakening,exciting/thrilling,sad genre hip-hop/rap,electronica,funk instrument drummachine,samples,synthesizer general heavybeat,verydanceable,synthesizedtexture usage ataparty,exercising,gettingreadytogoout vocals rapping,strong,alteredwitheffects Bottom3wordsbysemanticcategory AcousticCorrelation overall notweird,notarousing,notangry/agressive emotion notweird,notarousing,notangry/agressive genre classicrock,bebop,alternativefolk instrument femaleleadvocals,drumset,acousticguitar general constantenergylevel,changingenergylevel,notcatchy usage goingtosleep,cleaningthehouse,atwork vocals highpitches,falsetto,emotional Table1:Topandbottom3wordsbysemanticcategoryfoundbysparseCCA.135musically-relevantconceptsspanningsixsemanticcategories:29instrumentswereannotatedaspresentinthesongornot;22vocalcharacteristicswereannotatedasrelevanttothesingerornot;36genres,asubsetoftheCodaichgenrelist[8],wereannotatedasrelevanttothesongornot;18emotions,foundbySkowroneketal.[11]tobebothimportantandeasytoidentify,wereratedonascalefromonetothree(e.g.,”nothappy”,”neutral”,”happy”);15songconceptsdescribingtheacousticqualitiesofthesong,artistandrecording(e.g.,tempo,energy,soundquality);and15usagetermsfrom[4],(e.g.,“Iwouldlistentothissongwhiledriving,sleeping,etc.”).The135conceptsareconvertedtothe174-wordvocabularybyrstmappingbi-polarconceptstomultiplewordlabels(`EnergyLevel'mapsto`lowenergy'and`highenergy').Thenwepruneallwordsthatarerepresentedinveorfewersongstoremoveunder-representedwords.Lastly,weconstructareal-valued174-dimensionalannotationvectorbyaveragingthelabelfrequenciesoftheindividualannotators.Detailsofthesummarizationprocesscanbefoundin[17].Ingeneral,eachelementintheannotationvectorcontainsareal-valuedscalarindicatingthestrengthofassociation.TheWeb2131isanannotatedcollectionof2131songsandaccompanyingexpertsongreviewsminedfromaweb-accessiblemusicdatabase1[16].Exactly363songsfromWeb2131overlapwiththeCAL500songs.Thevocabularyconsistsof317wordsthatwerehandpickedfromalistofthecommonwordsfoundinthecorpusofsongreviews.Commonstopwordsareremovedandtheresultingwordsarepreprocesseswithacustomstemmingalgorithm.Werepresentasongreviewasabinary317-dimensionalannotationvector.Theelementofavectoris1ifthecorrespondingwordappearsinthesongreviewand0otherwise.4Experiments4.1QualitativeResultsWesolveaseriesofsparseCCAproblems,eachtimeincreasingthesparsityparametercorrespond-ingtothesamanticspace.Thisgeneratesaseriesofvocabulariesranginginsize.Thiseffectivelygivesusa“ranking”ofwords.Therstthreeandlastthreewordstobeeliminated(i.e.,thebottomthreeandtopthreewords,respectively)canbefoundinTable1.Thetopthreewordswerethelastwordstobeeliminatedinthesequenceandarewordsthatarehighlycorrelatedwithaudiocon-tent.Thebottomthreewordswerethoseremovedinthesequenceintheveryrststep,thesewordspresumablyhaveverylowcorrelationwiththeaudiocontent. 1AMGAllMusicGuidewww.allmusic.com5 vocab.sz. 488 249 203 149 103 50 #CAL500words 173 118 101 85 65 39 #Web2131words 315 131 102 64 38 11 %Web2131 .64 .52 .50 .42 .36 .22 Table2:Thefractionofnoisyweb-minedwordsinavocabularyasvocabularysizeisreduced:AsthesizeshrinkssparseCCAprunesnoisywordsandtheweb-minedwordsareeliminatedoverhigherqualityCAL500words.4.2QuantitativeResults4.2.1VocabularyPruningusingSparseCCASparseCCAcanbeusedtoperformvocabularyselectionwherethegoalistoprunenoisywordsfromalargevocabulary.TotestthishypothesiswecombinedthevocabulariesfromtheCAL500andWeb2131datasetsandconsiderthesubsetof363songsthatarefoundinbothdatasets.Basedonourowninformaluserstudy,wefoundthattheWeb2131annotationsarenoisywhencom-paredtotheCAL500annotations.Weshowedsubjects10wordsfromeachdatasetandaskedthemwhichsetofwordswererelevanttoasong.TheWeb2131annotationswerenotmuchbetterthanselectingwordsrandomlytofromthevocabulary,whereasCAL500wordsweremostlyconsideredrelevant.BecauseWeb2131wasfoundtobenoisierthanCAL500,weexpectsparseCCAtolteroutmoreoftheWeb2131vocabulary.Table2showstheresultsofthisexperiment.Intheexperimentweselectavalueforthesparsityparameterandobtainasparsevocabulary.ThenwerecordhowmanynoisyWeb2131wordscomprisetheresultingvocabulary.TherstcolumninTable2reectsthevocabularywithnosparsityconstraints.Becausethestartingvocabulariesareofdifferentsizes,Web2131initiallycomprises.64ofthecombinedvocabularysize.Subsequentcolumnsinthetableshowtheresultingvocabularysizesasweincreaseand,consequently,reducethevocabularysize.NoisyWeb2131wordsarebeingdiscardedbyourvocabularyselectionprocessatafasterratethanthecleanerCAL500vocabulary,upholdingourpredictionthatvocabularyselectionbyaccousticcorrelationshouldtendtoremovenoisywords.TheresultsimplythatWeb2131containsmorewordsthatareuncorrelatedwiththeaudiorepresentation.Thedifferentwaysthatthesedatasetswerecollectedreinforcesthisfact.Web2131wasminedfromacollectionofmusicreviews;thewordsusedinamusicreviewarenotexplicitlychosenbythewriterbecausetheydescribeasong.ThisistheexactoppositeconditionunderwhichtheCAL500datasetwascollected,inwhichhumansubjectswerespecicallyaskedtolabelsongsinacontrolledenvironment.4.2.2VocabularySelectionforMusicRetrievalInthisexperimentweapplyourvocabularyselectiontechniquetoasemanticmusicannotationandretrievalsystem.Inbrief,oursystemestimatestheconditionalprobabilitiesofaudiofeaturevectorsgivenwordsinthevocabulary,P(songjword).TheseconditionalprobabilitiesaremodeledasGausianMixtureModels.Withtheseprobabilitydistributionsinhand,oursystemcanannotateanovelsongwithwordsfromitsvocabulary,oritcanretrieveanorderedlistof(novel)songsbasedbasedonakeywordquery.Afulldescriptionofthissystemcanbefoundin[17].Oneusefulevaluationmetricforthissystemistheareaunderthereceiveroperatingcharacteristic(AROC)curve.(AROCcurveisaplotofthetruepositiverateasafunctionofthefalsepositiverateaswemovedownarankedlistofsongsgivenakeywordinput.)Foreachword,itsAROCrangesbetween0.5forarandomrankingand1.0foraperfectranking.AverageAROCisusedtodescribetheperformanceofourentiresystemandisfoundbyaveragingAROCoverallwordsinthevocabulary.Thisnalperformancemetricisbroughtdownbywordsthataredifculttomodel.ThesearewordsthathavealowAROCsotheybringtheaverageAROCdown.WeproposetousesparseCCAtodiscoverthesedifcultwords(priortomodeling).(Alternatively,onecansaythatweareusingsparseCCAtokeep“easy”wordsthatexhibithighcorrelation,asoppsedtodiscardingdifcultwords.)6 20 40 60 80 100 120 140 160 180 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 mean avg rocvocab size sparse cca human agreement random Figure1:Comparisonofvocabularyselectiontechniques:Wecompareretrievalperformanceusingvocabularyselectionbyhumanagreementandacousticcorrelation.Theexpectedperformanceofrandomlyselectingwordsisalsoshown.Inthisexperiment,weusesparseCCAtogeneratesparsevocabulariesoffullsize(about180)downto20.ThisisdonebygeneratingasequenceofvocabulariesofprogressivlysmallersizesasisdescribedinSection4.1.Foreachvocabularytheconditionalprobabilitydensitiesassociatedwiththesewords,orwordmodels,aretrainedandtheaverageAROCoftheresultingwordmodelsiscalculatedonatestsetof50songsheldoutseparatelyfromatrainingsetof450songsfromtheCAL500dataset.Figure1showsthattheaverageAROCofoursystemimprovesassparseCCAselectsvocabulariesofsmallersize.AnyincreasefromtheleftmostpointofthegureimpliesthatthetechniqueisremovingwordswhichwouldhavehadalowAROC(thusbringingtheaveragedown).Toreiterate,itisexactlythewordswithlowAROCthatwepresumetobenoisyandaredifculttomodel.AlsoshowninFigure1aretheperformanceresultsoftrainingbasedontwoalternativevocabularyselectiontechniques.Oneisarandombaselineinwhichavocabularyiscreatedbyrandomlyse-lectingwordsfromthefullvocabulary;theexpected(mean)performanceofthismethodisshown.Theothertechniqueisbasedonaheuristicweusedin[15]inwhichweassignedtowordsascorebasedonthenotionofhumanagreement.Thenweselectavocabularybasedonhighagreementscores.(Detailscanbefoundin[15].)Briey,becausewehaveaccesstomorethanonehumanannotationpersong,wecounthowmanytimespeopleagreedwiththeuseofaparticularwordtodescribeasong.Forexample,moresubjectstendedtoagreeonmoreobjectiveinstrumentationtermslike“thissonghasadrum.”,asopposedtomoresubjectiveusagetermslike“Iwouldlistentothissongwhilewakingupinthemorning.”Inthiscase“drum”wouldgetahigherhumanagree-mentscorethan“wakingup,”andsowepositthatitwouldbeeasiertomodel“drum”(itwouldhaveahigherAROC)thanitwouldtomodel“wakingup”.Ourresultsshowthatvocabularyselectionusingacousticcorrelationoutperformsusingthisverylogicalheuristicofhumanagreement.Itshouldbenotedthatthegoalhereisnottoraisetheperformanceofsomearbitrarysystem,ratherthatraisingtheperformanceofthissystemsuggeststhatourvocabularyselectionmethodisremov-ingwordsthataredifculttomodelwellandisselectingwordsthatcanbeaccuratelymodeled.Also,asapracticalmatter,ourresultsshowthatthistechniquecouldbeused,inapreprocessingstep,toinformastowhichwordstospendcomputationalresourcesonwhenmodeling.5DiscussionWehavepresentedacousticcorrelationviasparseCCAasamethodbywhichwecanautomatically,andinanunsupervisedfashion,discoverwordsthatarehighlycorrelatedwiththeiraudiofeaturerepresentations.Ourresultssuggestthatthistechniquecanbeusedtoremove“noisy”wordsthataredifculttomodelaccurately.Thisabilitytolterpoorlyrepresentedwordsalsoprovidesameanstoconstructamusicallymeaningfulvocabularypriortoinvestingfurthercomputationalresourcesinmodelingoranalyzingthesemanticsofmusicasisdoneinourmusicannotationandretrievalsys-tem.Thoseinterestedintheapplicationofvocabularyselectionandmusicanalysisareencouraged7 toviewtheworkofWhitmanandElliswhohavepreviouslylookedatvocabularyselectionbytrain-ingbinaryclassiers(e.g.,SupportVectorMachines)onaheterogeneousdatasetofweb-documentsrelatedtoartistsandtheaudiocontentproducedbytheseartists[22].References[1]A.d'Aspremont,L.ElGhaoui,M.I.Jordan,andG.R.G.Lanckriet.Adirectformulationforsparsepcausingsemideniteprogramming.AcceptedforpublicationinSIAMReview,2006.[2]DavidR.Hardoon,SandorSzedmak,andJognShawe-Taylor.Canonicalcorrelationanalysis;anoverviewwithapplicationtolearningmethods.NeuralComputation,2004.[3]R.HorstandN.V.Thoai.D.c.programming:Overview.JournalofOptimizationTheoryandApplications,103:1–43,1999.[4]XiaoHu,J.StephenDownie,andAndreasF.Ehmann.Exploitingrecommendedusagemeta-data:Exploratoryanalyses.ISMIR,2006.[5]I.T.Jolliffe,N.T.Trendalov,andM.Uddin.AmodiedprincipalcomponenttechniquebasedontheLASSO.JournalofComputationalandGraphicalStatistics,12:531–547,2003.[6]EinatKidron,YoavY.Schechner,andMichaelElad.Pixelsthatsound.InIEEEComputerVisionandPatternRecognition,2005.[7]BethLogan.Melfrequencycepstralcoefcientsformusicmodeling.ISMIR,2000.[8]CoryMcKay,DanielMcEnnis,andIchiroFujinaga.Alargepubliclyaccessibleprototypeaudiodatabaseformusicresearch.ISMIR,2006.[9]M.F.McKinneyandJ.Breebaart.Featuresforaudioandmusicclassication.ISMIR,2003.[10]L.RabinerandB.H.Juang.FundamentalsofSpeechRecognition.PrenticeHall,1993.[11]JantoSkowronek,MartinMcKinney,andStevenvendePar.Ground-truthforautomaticmusicmoodclassication.ISMIR,2006.[12]M.Slaney.Semantic-audioretrieval.IEEEICASSP,2002.[13]BharathK.Sriperumbudur,DavidA.Torres,andGertR.G.Lanckriet.Sparseeigenmethodsbyd.c.programming.InInternationalConferenceonMachineLearning,2007.[14]PhamDinhTaoandLeThiHoaiAn.D.c.optimizationalgorithmsforsolvingthetrustregionsubproblem.SIAMJournalofOptimization,8:476–505,1998.[15]D.Torres,D.Turnbull,L.Barrington,andG.Lanckriet.Identifyingwordsthataremusicallymeaningful.InISMIR07,2007.[16]D.Turnbull,L.Barrington,andG.Lanckriet.Modellingmusicandwordsusingamulti-classna¨vebayesapproach.ISMIR,2006.[17]D.Turnbull,L.Barrington,D.Torres,andG.Lanckriet.Towardsmusicalquery-by-semanticdescriptionusingtheCAL500dataset.InToappearinSIGIR'07,2007.[18]D.Turnbull,R.Liu,L.Barrington,D.Torres,andGertLanckriet.UCSDCALtechnicalreport:Usinggamestocollectsemanticinformationaboutmusic.Technicalreport,2007.[19]AlexeiVinokourov,JohnShawe-Taylor,andNelloCristianini.Inferringasementicrepresenta-tionoftextviacross-languagecorrelationanalysis.AdvancesinNeuralInformationProcessingSystems,2003.[20]LuisvonAhn.Gameswithapurpose.IEEEComputerMagazine,39(6):92–94,2006.[21]J.Weston,A.Elisseeff,B.Sch¨olkopf,andM.Tipping.Useofthezero-normwithlinearmodelsandkernelmethods.Journalofmachinelearningresearch,2003.[22]B.WhitmanandD.Ellis.Automaticrecordreviews.ISMIR,2004.[23]HuiZou,TrevorHastie,andRobertTibshirani.Sparseprincipalcomponentanalysis.Journalofcomputationalandgraphicalstatistics,2004.8

Related Contents


Next Show more