/
Journal of Machine Learning Research    Su bmitted  Revised  Published  Learning From Journal of Machine Learning Research    Su bmitted  Revised  Published  Learning From

Journal of Machine Learning Research Su bmitted Revised Published Learning From - PDF document

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
534 views
Uploaded On 2014-11-16

Journal of Machine Learning Research Su bmitted Revised Published Learning From - PPT Presentation

Raykar VIKAS RAYKAR SIEMENS COM Shipeng Yu SHIPENG YU SIEMENS COM CAD and Knowledge Solutions IKM CKS Siemens Healthcare Malvern PA 19355 USA Linda H Zhao LZHAO WHARTON UPENN EDU Department of Statistics University of Pennsylvania Philadelphia PA 19 ID: 13236

Raykar VIKAS RAYKAR SIEMENS

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Journal of Machine Learning Research ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch11(2010)1297-1322Submitted9/09;Revised2/10;Published4/10LearningFromCrowdsVikasC.RaykarVIKAS.RAYKAR@SIEMENS.COMShipengYuSHIPENG.YU@SIEMENS.COMCADandKnowledgeSolutions(IKMCKS)SiemensHealthcareMalvern,PA19355USALindaH.ZhaoLZHAO@WHARTON.UPENN.EDUDepartmentofStatisticsUniversityofPennsylvaniaPhiladelphia,PA19104USAGerardoHermosilloValadezGERARDO.HERMOSILLOVALADEZ@SIEMENS.COMCharlesFlorinCHARLES.FLORIN@SIEMENS.COMLucaBogoniLUCA.BOGONI@SIEMENS.COMCADandKnowledgeSolutions(IKMCKS)SiemensHealthcareMalvern,PA19355USALindaMoyLINDA.MOY@NYUMC.ORGDepartmentofRadiologyNewYorkUniversitySchoolofMedicineNewYork,NY10016USAEditor:DavidBleiAbstractFormanysupervisedlearningtasksitmaybeinfeasible(orveryexpensive)toobtainobjectiveandreliablelabels.Instead,wecancollectsubjective(possiblynoisy)labelsfrommultipleexpertsorannotators.Inpractice,thereisasubstantialamountofdisagreementamongtheannotators,andhenceitisofgreatpracticalinteresttoaddressconventionalsupervisedlearningproblemsinthisscenario.Inthispaperwedescribeaprobabilisticapproachforsupervisedlearningwhenwehavemultipleannotatorsproviding(possiblynoisy)labelsbutnoabsolutegoldstandard.Theproposedalgorithmevaluatesthedifferentexpertsandalsogivesanestimateoftheactualhiddenlabels.Experimentalresultsindicatethattheproposedmethodissuperiortothecommonlyusedmajorityvotingbaseline.Keywords:multipleannotators,multipleexperts,multipleteachers,crowdsourcing1.SupervisedLearningFromMultipleAnnotators/ExpertsAtypicalsupervisedlearningscenarioconsistsofatrainingsetD=f(xi;yi)gNi=1containingNinstances,wherexi2isaninstance(typicallyad-dimensionalfeaturevector)andyi2isthecorrespondingknownlabel.Thetaskistolearnafunction!whichgeneralizeswellonunseendata.Specicallyforbinaryclassicationthesupervisionisfromtheset=f0;1g,formulti-classclassication=f1;:::;Kg,forordinalregression=f1;:::;Kg(withanordering1:::K),and=Rforregression.c\r2010VikasC.Raykar,ShipengYu,LindaH.Zhao,GerardoH.Valadez,CharlesFlorin,LucaBogoniandLindaMoy. RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOYHowever,formanyreallifetasks,itmaynotbepossible,ormaybetooexpensive(ortedious)toacquiretheactuallabelyifortraining—whichwerefertoasthegoldstandardortheobjec-tivegroundtruth.Instead,wemayhavemultiple(possiblynoisy)labelsy1i;:::;yRiprovidedbyRdifferentexpertsorannotators.Inpractice,thereisasubstantialamountofdisagreementamongtheexperts,andhenceitisofgreatpracticalinteresttoaddressconventionalsupervisedlearningalgorithmsinthisscenario.Ourmotivationforthisworkcomesfromtheareaofcomputer-aideddiagnosis1(CAD),wherethetaskistobuildaclassiertopredictwhetherasuspiciousregiononamedicalimage(likeaX-ray,CTscan,orMRI)ismalignant(cancerous)orbenign.Inordertotrainsuchaclassier,asetofimagesiscollectedfromhospitals.Theactualgoldstandard(whetheritiscancerornot)canonlybeobtainedfromabiopsyofthetissue.Sinceitisanexpensive,invasive,andpotentiallydangerousprocess,oftenCADsystemsarebuiltfromlabelsassignedbymultipleradiologistswhoidentifythelocationsofmalignantlesions.Eachradiologistvisuallyexaminesthemedicalimagesandprovidesasubjective(possiblynoisy)versionofthegoldstandard.2Theradiologistalsoannotatesvariousdescriptorsofthepotentiallymalignantlesion,likethesize(aregressionproblem),shape(amulti-classclassicationproblem),andalsodegreeofmalignancy(anordinalregressionproblem).Theradiologistscomefromadiversepoolincludingluminaries,experts,residents,andnovices.Veryoftenthereislotofdisagreementamongtheannotations.Foralotoftasksthelabelsprovidedbytheannotatorsareinherentlysubjectiveandtherewillbesubstantialvariationamongdifferentannotators.Thedomainoftextclassicationofferssuchascenario.Inthiscontextthetaskistopredictthecategoryforatokenoftext.Thelabelsfortrainingareassignedbyhumanannotatorswhoreadthetextandattributetheirsubjectivecategory.Withtheadventofcrowdsourcing(Howe,2008)serviceslikeAmazon'sMechanicalTurk,3GameswithaPurpose,4andreCAPTCHA5itisquiteinexpensivetoacquirelabelsfromalargenumberofannotators(possiblythousands)inashorttime(Shengetal.,2008;Snowetal.,2008;SorokinandForsyth,2008).WebsitessuchasGalaxyZoo6allowthepublictolabelastronomicalimagesovertheinternet.Insituationslikethese,theperformanceofdifferentannotatorscanvarywidely(somemayevenbemalicious),andwithouttheactualgoldstandard,itmaynotbepossibletoevaluatetheannotators.Inthiswork,weprovideprincipledprobabilisticsolutionstothefollowingquestions:1.Howtoadaptconventionalsupervisedlearningalgorithmswhenwehavemultipleannotatorsprovidingsubjectivelabelsbutnoobjectivegoldstandard?2.Howtoevaluatesystemswhenwedonothaveabsolutegold-standard?3.Acloselyrelatedproblem—particularlyrelevantwhentherearealargenumberofannotators—istoestimatehowreliable/trustworthyiseachannotator. 1.SeeFungetal.(2009)foranoverviewofthedataminingissuesinthisarea.2.Sometimesevenabiopsycannotconrmwhetheritiscancerornotandhenceallwecanhopetogetissubjectivegroundtruth.3.MechanicalTurkfoundathttps://www.mturk.com.4.GameswithaPurposefoundathttp://www.gwap.com.5.reCAPTCHAfoundathttp://recaptcha.net/.6.GalaxyZoofoundathttp://galaxyzoo.org.1298 LEARNINGFROMCROWDS1.1TheProblemWithMajorityVotingWhenwehavemultiplelabelsacommonlyusedstrategyistousethelabelsonwhichthemajorityofthemagree(oraverageforregressionproblem)asanestimateoftheactualgoldstandard.Forbinaryclassicationproblemsthisamountstousingthemajoritylabel,7thatis,ˆyi=(1if(1=R)åRj=1yji�0:50if(1=R)åRj=1yji0:5;asanestimateofthehiddentruelabelandusethisestimatetolearnandevaluateclassiers/annotators.Anotherstrategyisthatofconsideringeverypair(instance,label)providedbyeachexpertasasep-arateexample.Notethatthisamountstousingasoftprobabilisticestimateoftheactualgroundtruthtolearntheclassier,thatis,Pr[yi=1jy1i;:::;yRi]=(1=R)Råj=1yji:Majorityvotingassumesallexpertsareequallygood.However,forexample,ifthereisonlyonetrueexpertandthemajorityarenovices,andifnovicesgivethesameincorrectlabeltoaspecicinstance,thenthemajorityvotingmethodwouldfavorthenovicessincetheyareinamajority.Onecouldaddressthisproblembyintroducingaweightcapturinghowgoodeachexpertis.Buthowwouldonemeasuretheperformanceofanexpertwhenthereisnogoldstandardavailable?1.2ProposedApproachandOrganizationToaddresstheapparentchicken-and-eggproblem,wepresentamaximum-likelihoodestimatorthatjointlylearnstheclassier/regressor,theannotatoraccuracy,andtheactualtruelabel.Foreaseofexpositionwestartwithbinaryclassicationproblemin§2.Theperformanceofeachannotatorismeasuredintermsofthesensitivityandspecicitywithrespecttotheunknowngoldstandard(§2.1).Theproposedalgorithmautomaticallydiscoversthebestexpertsandassignsahigherweighttothem.Inordertoincorporatepriorknowledgeabouteachannotator,weimposeabetaprioronthesensitivityandspecicityandderivethemaximum-a-posterioriestimate(§2.6).ThenalestimationisperformedbyanExpectationMaximization(EM)algorithmthatiterativelyestablishesaparticulargoldstandard,measurestheperformanceoftheexpertsgiventhatgoldstandard,andrenesthegoldstandardbasedontheperformancemeasures.Whiletheproposedapproachisdescribedusinglogisticregressionasthebaseclassier(§2.2),itisquitegeneral,andcanbeusedwithanyblack-boxclassier(§2.7),andcanalsohandlemissinglabels(thatis,eachexpertisnotrequiredtolabelalltheinstances).Furthermore,weextendtheproposedalgorithmtohandlecategorical(§3),ordinal(§4),andregressionproblems(§5).In§6sectionweextensivelyvalidateourapproachusingbothsimulateddataandrealdatafromdifferentdomains.1.3RelatedWorkandNovelContributionsWerstsummarizethenovelcontributionsofthisworkincontextofotherrelatedworkinthisemergingnewarea.Therehasbeenalonglineofworkinthebiostatisticsandepidemiologylitera-tureonlatentvariablemodelswherethetaskistogetanestimateoftheobservererrorratesbased 7.Whenthereisnoclearmajorityamongthemultipleexperts(thatis,ˆyi=0:5)inCADdomainthenaldecisionisoftenmadebyanadjudicatororasuper-expert.Whenthereisnoadjudicatorafaircointossisused.1299 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOYontheresultsfrommultiplediagnostictestswithoutagoldstandard(seeDawidandSkene,1979,HuiandWalter,1980,HuiandZhou,1998,AlbertandDodd,2004andreferencestherein).InthemachinelearningcommunitySmythetal.(1995)rstaddressedthesameprobleminthecontextoflabelingvolcanoesinsatelliteimagesofVenus.Wedifferfromthispreviousbodyofworkinthefollowingaspects:1.UnlikeDawidandSkene(1979)andSmythetal.(1995)whichjustfocusedonestimatingthegroundtruthfrommultiplenoisylabels,wespecicallyaddresstheissueoflearningaclassier.Estimatingthegroundtruthandtheannotator/classiferperformanceisabyproductofourproposedalgorithm.2.InordertolearnaclassierSmyth(1995)proposedtorstestimatethegroundtruth(withoutusingthefeatures)andthenusetheprobabilisticgroundtruthtolearnaclassier.Incontrast,ourproposedalgorithmlearnstheclassierandthegroundtruthjointly.Ourexperiments(§6.1.1)showthattheclassierlearntandgroundtruthobtainedbytheproposedalgorithmissuperiortothatobtainedbyotherprocedureswhichrstestimatesthegroundtruthandthenlearnstheclassier.3.Oursolutionismoregeneralandcanbeeasilyextendedtocategorical(§3),ordinal(§4),andcontinuousdata(§5).Itcanalsobeusedinconjunctionwithanysupervisedlearningalgorithm.Apreliminaryversionofthispaper(Raykaretal.,2009)mainlydiscussedthebinaryclassicationproblem.4.OurproposedalgorithmisalsoBayesian—weimposeapriorontheexperts.Thepriorscanpotentialcapturetheskillofdifferentannotators.InthispaperwerefrainfromdoingafullBayesianinferenceandusethemodeoftheposteriorasapointestimate.ArecentcompleteBayesiangeneralizationofthesekindofmodelshasbeendevelopedbyCarpenter(2008).5.TheEMapproachusedinthispaperissimilartothatproposedbyJinandGhahramani(2003).Howevertheirmotivationissomewhatdifferent.Intheirsetting,eachtrainingexampleisannotatedwithasetofpossiblelabels,onlyoneofwhichiscorrect.Therehasbeenrecentinterestinthenaturallanguageprocessing(Shengetal.,2008;Snowetal.,2008)andcomputervision(SorokinandForsyth,2008)communitieswheretheyuseAmazon'sMechanicalTurktocollectannotationsfrommanypeople.Theyshowthatitcanbepotentiallyasgoodasthatprovidedbyanexpert.Shengetal.(2008)analyzedwhenitisworthwhiletoacquirenewlabelsforsomeofthetrainingexamples.Thereisalsosometheoreticalwork(seeLugosi,1992andDekelandShamir,2009a)dealingwithmultipleexperts.RecentlyDekelandShamir(2009b)presentedanalgorithmwhichdoesnotresorttorepeatedlabeling,thatis,eachexampledoesnothavetobelabeledbymultipleteachers.Donmezetal.(2009)addresstheissueofactivelearninginthisscenario—Howtojointlylearntheaccuracyoflabelingsourcesandobtainthemostinformativelabelsfortheactivelearningtask?Therehasalsobeensomeworkinthemedicalimagingcommunity(Wareldetal.,2004;Cholletietal.,2008).2.BinaryClassicationWerstdescribeourproposednoisemodelfortheannotators.Theperformanceofeachannotatorismeasuredintermsofthesensitivityandspecicitywithrespecttotheunknowngoldstandard.1300 LEARNINGFROMCROWDS2.1ATwo-coinModelforAnnotatorsLetyj2f0;1gbethelabelassignedtotheinstancexbythethannotator/expert.Letybetheactual(unobserved)labelforthisinstance.Eachannotatorprovidesaversionofthishiddentruelabelbasedontwobiasedcoins.Ifthetruelabelisone,sheipsacoinwithbiasaj(sensitivity).Ifthetruelabeliszero,sheipsacoinwithbiasbj(specicity).Ineachcase,ifshegetsheadsshekeepstheoriginallabel,otherwisesheipsthelabel.Ifthetruelabelisone,thesensitivity(truepositiverate)forthethannotatorisdenedastheprobabilitythatshelabelsitasone.aj=Pr[yj=1jy=1]:(1)Ontheotherhand,ifthetruelabeliszero,thespecicity(1falsepositiverate)isdenedastheprobabilitythatshelabelsitaszero.bj=Pr[yj=0jy=0]:(2)Theassumptionintroducedisthatajandbjdonotdependontheinstancex.Forexample,intheCADdomain,thismeansthattheradiologist'sperformanceisconsistentacrossdifferentsub-groupsofdata.82.2ClassicationModelWhiletheproposedmethodcanbeusedforanyclassier,foreaseofexposition,weconsiderthefamilyoflineardiscriminatingfunctions:F=fwg,whereforanyx;w2Rdw(x)=w�xThenalclassiercanbewritteninthefollowingform:ˆy=1ifw�xgand0otherwise.Thethresholdgdeterminestheoperatingpointoftheclassier.TheReceiverOperatingCharacteristic(ROC)curveisobtainedasgissweptfrom¥to¥.Theprobabilityforthepositiveclassismodeledasalogisticsigmoidactingonw,thatis,Pr[y=1jx;w]=s(w�x);wherethelogisticsigmoidfunctionisdenedass(z)=1=(1+ez).Thisclassicationmodelisknownaslogisticregression2.3Estimation/LearningProblemGiventhetrainingdataDconsistingofNinstanceswithannotationsfromRannotators,thatis,D=fxi;y1i;:::;yRigNi=1,thetaskistoestimatetheweightvectorwandalsothesensitivity=[a1;:::;aR]andthespecicity =[b1;:::;bR]oftheRannotators.Itisalsoofinteresttogetanestimateoftheunknowngoldstandardy1;:::;yN2.4MaximumLikelihoodEstimatorAssumingthetraininginstancesareindependentlysampled,thelikelihoodfunctionoftheparame-tersq=fw;; ggiventheobservationsDcanbefactoredasPr[Djq]=NÕi=1Pr[y1i;:::;yRijxi;q]: 8.Whilethisisareasonableassumption,itisnotentirelytrue.Itisknownthatsomeradiologistsaregoodatdetectingcertainkindsofmalignantlesionsbasedontheirtrainingandexperience.1301 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOYConditioningonthetruelabelyi,andalsousingtheassumptionyjiisconditionallyindependent(ofeverythingelse)givenajbjandyi,thelikelihoodcanbedecomposedasPr[Djq]=NÕi=1Pr[y1i;:::;yRijyi=1;]Pr[yi=1jxi;w]+Pr[y1i;:::;yRijyi=0; ]Pr[yi=0jxi;w] :Giventhetruelabelyi,weassumethaty1i;:::;yRiareindependent,thatis,theannotatorsmaketheirdecisionsindependently.9Hence,Pr[y1i;:::;yRijyi=1;]=RÕj=1Pr[yjijyi=1;aj]=RÕj=1[aj]yji[1aj]1yji:Similarly,wehavePr[y1i;:::;yRijyi=0; ]=RÕj=1[bj]1yji[1bj]yji:HencethelikelihoodcanbewrittenasPr[Djq]=NÕi=1aipi+bi(1pi);wherewehavedenedpi=s(w�xi):ai=RÕj=1[aj]yji[1aj]1yji:bi=RÕj=1[bj]1yji[1bj]yji:Themaximum-likelihoodestimatorisfoundbymaximizingthelog-likelihood,thatis,ˆqML=fˆ;ˆ ;ˆwg=argmaxqflnPr[Djq]g:2.5TheEMAlgorithmThismaximizationproblemcanbesimpliedalotifweusetheExpectation-Maximization(EM)algorithm(Dempsteretal.,1977).TheEMalgorithmisanefcientiterativeproceduretocomputethemaximum-likelihoodsolutioninpresenceofmissing/hiddendata.Wewillusetheunknownhiddentruelabelyiasthemissingdata.Ifweknowthemissingdatay=[y1;:::;yN]thenthecompletelikelihoodcanbewrittenaslnPr[D;yjq]=Nåi=1yilnpiai+(1yi)ln(1pi)bi: 9.Thisassumptionisnottrueingeneralandthereissomecorrelationsamongthelabelsassignedbymultipleannotators.ForexampleintheCADdomainifthecancerisinadvancedstage(whichisveryeasytodetect)almostalltheradiologistsassignthesamelabel.1302 LEARNINGFROMCROWDSEachiterationoftheEMalgorithmconsistsoftwosteps:anExpectation(E)-stepandaMaximization(M)-step.TheM-stepinvolvesmaximizationofalowerboundonthelog-likelihoodthatisrenedineachiterationbytheE-step.1.E-step.GiventheobservationDandthecurrentestimateofthemodelparametersq,theconditionalexpectation(whichisalowerboundonthetruelikelihood)iscomputedasEflnPr[D;yjq]g=Nåi=1µilnpiai+(1µi)ln(1pi)bi;(3)wheretheexpectationiswithrespecttoPr[yjD;q],andµi=Pr[yi=1jy1i;:::;yRi;xi;q].UsingBayes'theoremwecancomputeµiµPr[y1i;:::;yRijyi=1;q]Pr[yi=1jxi;q]=aipi aipi+bi(1pi):2.M-step.BasedonthecurrentestimateµiandtheobservationsD,themodelparametersqarethenestimatedbymaximizingtheconditionalexpectation.Byequatingthegradientof(3)tozeroweobtainthefollowingestimatesforthesensitivityandspecicity:aj=åNi=1µiyji åNi=1µi;bj=åNi=1(1µi)(1yji) åNi=1(1µi):Duetothenon-linearityofthesigmoid,wedonothaveaclosedformsolutionforwandwehavetousegradientascentbasedoptimizationmethods.WeusetheNewton-Raphsonupdategivenbywt+1=wthH1g,wheregisthegradientvector,HistheHessianmatrix,andhisthesteplength.Thegradientvectorisgivenbyg(w)=Nåi=1hµis(w�xi)ixi:TheHessianmatrixisgivenbyH(w)=Nåi=1hs(w�xi)ih1s(w�xi)ixix�i:Essentially,weareestimatingalogisticregressionmodelwithprobabilisticlabelsµiThesetwosteps(theE-andtheM-step)canbeiteratedtillconvergence.Thelog-likelihoodin-creasesmonotonicallyaftereveryiteration,whichinpracticeimpliesconvergencetoalocalmaxi-mum.TheEMalgorithmisonlyguaranteedtoconvergetoalocalmaximum.Inpracticemultiplerestartswithdifferentinitializationscanpotentiallymitigatethelocalmaximumproblem.Inthispaperweusemajorityvotingµi=1=RåRj=1yjiastheinitializationforµitostarttheEM-algorithm.1303 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOY2.6ABayesianApproachInsomeapplicationswemaywanttotrustaparticularexpertmorethantheothers.Onewaytoachievethisisbyimposingpriorsonthesensitivityandspecicityoftheexperts.Sinceajandbjrepresenttheprobabilityofabinaryevent,anaturalchoiceofprioristhebetaprior.Thebetapriorisalsoconjugatetothebinomialdistribution.Foranya�0,b�0,andd2[0;1]thebetadistributionisgivenbyBeta(dja;b)=da1(1d)b1 B(a;b);whereB(a;b)=R10da1(1d)b1ddisthebetafunction.Weassumeabetaprior10forboththesensitivityandthespecicityasPr[ajjaj1;aj2]=Beta(ajjaj1;aj2):Pr[bjjbj1;bj2]=Beta(bjjbj1;bj2):ForsakeofcompletenesswealsoassumeazeromeanGaussianpriorontheweightswwithin-versecovariancematrix,thatis,Pr[w]=N(wj0;1).Assumingthatfajgfbjg,andwhaveindependentpriors,themaximum-a-posteriori(MAP)estimatorisfoundbymaximizingthelog-posterior,thatis,ˆqMAP=argmaxqflnPr[Djq]+lnPr[q]g:AnEMalgorithmcanbederivedinasimilarfashionforMAPestimationbyrelyingontheinter-pretationofNealandHinton(1998).Thenalalgorithmissummarizedbelow1.Initializeµi=(1=R)åRj=1yjibasedonmajorityvoting.2.Givenµi,estimatethesensitivityandspecicityofeachannotator/expertasfollows.aj=aj11+åNi=1µiyji aj1+aj22+åNi=1µi:bj=bj11+åNi=1(1µi)(1yji) bj1+bj22+åNi=1(1µi):(4)TheNewton-Raphsonupdateforoptimizingwisgivenbywt+1=wthH1g,withsteplengthh,gradientvectorg(w)=Nåi=1hµis(w�xi)ixiw;andHessianmatrixH(w)=Nåi=1s(w�xi)h1s(w�xi)ixix�i: 10.Itmaybeconvenienttospecifyapriorintermsofthemeanµandvariances2.Themeanandthevarianceforabetaprioraregivenbyµ=a=(a+b)ands2=ab=((a+b)2(a+b+1)).Solvingforaandbwegeta=(µ3+µ2µs2)=s2andb=a(1µ)=µ.1304 LEARNINGFROMCROWDS3.Giventhesensitivityandspecicityofeachannotatorandthemodelparameters,updateµiasµi=aipi aipi+bi(1pi);(5)wherepi=s(w�xi):ai=RÕj=1[aj]yji[1aj]1yji:bi=RÕj=1[bj]1yji[1bj]yji:(6)Iterate(2)and(3)tillconvergence.2.7Discussions1.EstimateofthegoldstandardThevalueoftheposteriorprobabilityµiisasoftprobabilis-ticestimateoftheactualgroundtruthyi,thatis,µi=Pr[yi=1jy1i;:::;yRi;xi;q].Theactualhiddenlabelyicanbeestimatedbyapplyingathresholdonµi,thatis,yi=1ifµigandzerootherwise.Wecanuseg=0:5asthethreshold.Byvaryinggwecanchangethemisclas-sicationcostsandobtainagroundtruthwithlargesensitivityorlargespecicity.BecauseofthisinourexperimentalvalidationwecanactuallydrawanROCcurvefortheestimatedgroundtruth.2.Log-oddsofAparticularlyrevealinginsightcanbeobtainedintermsofthelog-oddsorthelogitoftheposteriorprobabilityµi.From(5)thelogitofµicanbewrittenaslogit(µi)=lnµi 1µi=lnPr[yi=1jy1i;:::;yRi;xi;q] Pr[yi=0jy1i;:::;yRi;xi;q]=w�xi+c+Råj=1yji[logit(aj)+logit(bj)]:wherec=åRj=1log1aj bjisaconstanttermwhichdoesnotdependon.Thisindicatesthattheestimatedgroundtruth(inthelogitformoftheposteriorprobability)isaweightedlinearcombinationofthelabelsfromalltheexperts.Theweightofeachexpertisthesumofthelogitofthesensitivityandspecicity.3.UsinganyotherclassierForeaseofexpositionweusedlogisticregression.However,theproposedalgorithmcanbeusedwithanygeneralizedlinearmodelorinfactwithanyclassierthatcanbetrainedwithsoftprobabilisticlabels.IneachstepoftheEM-algorithm,theclassieristrainedwithinstancessampledfromµi.Thismodicationiseasyformostprobabilisticclassiers.Forgeneralblack-boxclassierswherewecannottweakthetrainingalgorithmanalternateapproachistoreplicatethetrainingexamplesaccordingtothesoftlabel.Forexampleaprobabilisticlabelµi=0:8canbeeffectivelysimulatedbyadding8trainingexampleswithdeterministiclabel1and2exampleswithlabel0.1305 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOY4.ObtaininggroundtruthwithnofeaturesInsomescenarioswemaynothavefeaturesxiandwewishtoobtainanestimateoftheactualgroundtruthbasedonlyonthelabelsfrommultipleannotators.Hereinsteadoflearningaclassierweestimatepwhichistheprevalenceofthepositiveclass,thatis,p=Pr[yi=1].Wefurtherassumeabetapriorfortheprevalence,thatis,Beta(pjp1;p2).Thealgorithmsimpliesasfollows.(a)Initializeµi=(1=R)åRj=1yjibasedonmajorityvoting.(b)Givenµi,estimatethesensitivityandspecicityofeachannotatorusing(4).Thepreva-lenceofthepositiveclassisestimatedasfollows.p=p11+åNi=1µi p1+p22+N:(c)Giventhesensitivityandspecicityofeachannotatorandprevalence,reneµiasfol-lows.µi=aip aip+bi(1p):Iterate(2)and(3)tillconvergence.ThisalgorithmissimilartotheoneproposedbyDawidandSkene(1979)andSmythetal.(1995).5.HandlingmissinglabelsTheproposedapproachcaneasilyhandlemissinglabels,thatis,whenthelabelsfromsomeexpertsaremissingforsomeinstances.LetRibethenumberofradiologistslabelingthethinstance,andletNjbethenumberofinstanceslabeledbythethradiologist.ThenintheEMalgorithm,wejustneedtoreplaceNbyNjforestimatingthesensitivityandspecicityin(4),andreplaceRbyRiforupdatingµiin(6).6.EvaluatingaclassierWecanusetheprobabilityscoresµidirectlytoevaluateclassiers.Ifziarethelabelsobtainedfromanyotherclassier,thensensitivityandspecicitycanbeestimatedasa=åNi=1µizi åNi=1µi;b=åNi=1(1µi)(1zi) åNi=1(1µi):7.PosteriorapproximationAttheendofeachEMiterationacrudeapproximationtotheposteriorisobtainedasajBeta ajjaj1+Nåi=1µiyji;aj2+Nåi=1µi(1yji)!;bjBeta bjjbj1+Nåi=1(1µi)(1yji);bj2+Nåi=1(1µi)yji!:3.Multi-classClassicationInthissectionwedescribehowtheproposedapproachforbinaryclassicationcanbeextendedtocategoricaldata.SupposethereareK2categories.AnexampleforcategoricaldatafromtheCADdomainisinLungCAD,wheretheradiologistneedstolabelwhetheranodule(knowntobeprecursorsofcancer)isasolid,apart-solid,oragroundglassopacity—whicharethree1306 LEARNINGFROMCROWDSdifferentkindsonnodules.Wecanextendthepreviousmodelandintroduceavectorofmultinomialparametersjc=(ajc1;:::;ajcK)foreachannotator,whereajck=Pr[yj=kjy=c]andåKk=1ajck=1.Hereajckdenotestheprobabilitythattheannotatorassignsclassktoaninstancegiventhetrueclassisc.WhenK=2,aj11andaj00aresensitivityandspecicity,respectively.AsimilarEMalgorithmcanbederived.IntheE-step,weestimatePr[yi=cj;]µPr[yi=cjxi]RÕj=1KÕk=1(ajck)d(yjik);whered(u;v)=1ifu=vand0otherwiseandintheM-stepwelearnamulti-classclassierandupdatethemultinomialparameterasajck=åNi=1Pr[yi=cj;]d(yji;k) åNi=1Pr[yi=cj;]:OnecanalsoassignaDirichletpriorforthemultinomialparameters,andthisresultsinasmoothingtermintheaboveupdatesintheMAPestimate.4.OrdinalRegressionWenowconsiderthesituationwheretheoutputsarecategoricalandhaveanorderingamongthelabels.IntheCADdomaintheradiologistoftengivesascore(forexample,1to5fromlowesttohighest)toindicatehowlikelyshethinksitismalignant.Thisisdifferentfromamulti-classsettinginwhichwedonothaveanypreferenceamongthemultipleclasslabels.Letyji2f1;:::;Kgbethelabelassignedtothethinstancebythethexpert.Notethatthereisanorderinginthelabels1:::K.Asimpleapproachistoconverttheordinaldataintoaseriesofbinarydata(FrankandHall,2001).SpecicallytheKclassordinallabelsaretransformedintoK1binaryclasslabelsasfollows:yjci=1ifyji&#x-0.1;隔c0otherwisec=1;:::;K1:ApplyingthesameprocedureusedforbinarylabelswecanestimatePr[yi&#x-0.1;隔c]forc=1;:::;K1.TheprobabilityoftheactualclassvaluescanthenbeobtainedasPr[yi=c]=Pr[yi&#x-0.1;隔c1andyic]=Pr[yi&#x-0.1;隔c1]Pr[yi&#x-0.1;隔c]:Theclasswiththemaximumprobabilityisassignedtotheinstance.5.RegressionInthissectionwedevelopasimilaralgorithmtolearnaregressionfunctionusingannotationsfrommultipleexperts.IntheCADdomainasapartoftheannotationprocessacommontaskforaradiologististomeasurethediameterofasuspiciouslesion.1307 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOY5.1ModelforAnnotatorsLetyji2Rbethecontinuoustargetvalueassignedtothethinstancebythethannotator.Ourmodelisthattheannotatorprovidesanoisyversionoftheactualtruevalueyi.ForthethannotatorwewillassumeaGaussiannoisemodelwithmeanyi(thetrueunknownvalue)andinverse-variance(precision)tj,thatis,Pr[yjijyi;tj]=N(yjijyi;1=tj);(7)wheretheGaussiandistributionisdenedasN(zjm;s2)=(2ps2)1=2exp((zm)2=2s2).Theunknowninverse-variancetjmeasurestheaccuracyofeachannotator—thelargerthevalueoftjthemoreaccuratetheannotator.Wehaveassumedthattjdoesnotdependontheinstancexi.Forexample,intheCADdomain,thismeansthattheradiologist'saccuracydoesnotdependonthenodulesheismeasuring.Whilethisapracticalassumption,itisnotentirelytrue.Itisknownthatsomenodulesarehardertomeasurethanothers.5.2LinearRegressionModelforFeaturesAsbeforeweconsiderthefamilyoflinearregressionfunctions:F=fwg,whereforanyx;w2Rdw(x)=w�x.WeassumethattheactualtargetresponseyiisgivenbythedeterministicregressionfunctionwwithadditiveGaussiannoise,thatis,yi=w�xi+e;whereeisazero-meanGaussianrandomvariablewithinverse-variance(precision)g.HencePr[yijxi;w;g]=N(yijw�xi;1=g):(8)5.3CombinedModelCombiningboththeannotator(7)andtheregressor(8)modelwehavePr[yjijxi;w;tj;g]=ZPr[yjijyi;tj]Pr[yijxi;w;g]dyi=N(yjijw�xi;1=g+1=tj):Sincethetwoprecisionterms(gandtj)aregroupedtogethertheyarenotuniquelyidentiable.Hencewewilldeneanewprecisiontermljas1 lj=1 g+1 tj:SowehavethefollowingmodelPr[yjijxi;w;lj]=N(yjijw�xi;1=lj):(9)5.4Estimation/LearningProblemGiventhetrainingdataDconsistingofNinstanceswithannotationsfromRexperts,thatis,D=fxi;y1i;:::;yRigNi=1,thetaskistoestimatetheweightvectorwandtheprecision=[l1;:::;lR]ofalltheannotators.1308 LEARNINGFROMCROWDS5.5Maximum-likelihoodEstimatorAssumingtheinstancesareindependentthelikelihoodoftheparametersq=fw;ggiventheobservationsDcanbefactoredasPr[Djq]=NÕi=1Pr[y1i;:::;yRijxi;q]:Conditionalontheinstancexiweassumethaty1i;:::;yRiareindependent,thatis,theannotatorsprovidetheirresponsesindependently.Hencefrom(9)thelikelihoodcanbewrittenasPr[Djq]=NÕi=1RÕj=1N(yjijw�xi;1=lj):Themaximum-likelihoodestimatorisfoundbymaximizingthelog-likelihoodbqML=fb;bwg=argmaxqflnPr[Djq]g:Byequatingthegradientofthelog-likelihoodtozeroweobtainthefollowingupdateequationsfortheprecisionandtheweightvector.1 blj=1 NNåi=1yjibw�xi2:(10)bw= Nåi=1xix�i!1Nåi=1xi åRj=1bljyji åRj=1blj!:(11)Astheparametersbwandbarecoupledtogetherweiteratethesetwostepstillconvergence.5.6Discussions1.Isthisstandardleast-squares?DenethedesignmatrixX=[x1;:::;xN]�andthere-sponsevectorforeachannotatorasyj=[yj1;:::;yjN]�.UsingmatrixnotationEquation11canbewrittenasbw=(X�X)1X�bywhereby=åRj=1bljyj åRj=1blj:(12)Equation12isessentiallythesolutiontoastandardlinearregressionmodel,exceptthatwearetrainingalinearregressionmodelwithbyasthegroundtruth,whichisaprecisionweightedmeanoftheresponsevectorsfromalltheannotators.Thevarianceofeachannotatorisestimatedusing(10).Thenalalgorithmiterativelyestablishesaparticulargoldstandard(by),measurestheperformanceoftheannotatorsandlearnsaregressorgiventhatgoldstandard,andrenesthegoldstandardbasedontheperformancemeasures.2.Arewebetterthanthebestannotator?Ifweassumebisxed(i.e.,weignorethevari-abilityandassumethatitiswellestimated)thenˆwisanunbiasedestimatorofwandthecovariancematrixisgivenbyCov(ˆw)=Cov(by)X�X1=1 åRj=1bljX�X1:1309 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOYSinceåRj=1blj�maxj(blj)theproposedmethodhasalowervariancethantheregressorlearntwiththebestannotator(i.e.,theonewiththeminimumvariance).3.Arewebetterthantheaverage?ForaxedXtheerrorinbwdependsonlyonthevarianceofbyj.Ifweknowthetrueljthenbyiisthebestlinearunbiasedestimatorforyiwhichmini-mizesthevariance.Toseethisconsideranylinearestimatoroftheformbyi=åjaj(yjibj)ThevarianceisgivenbyVar[byi]=åj(aj)2=lj.SinceE[byi]=yiåjaj,forthebiasofthises-timatortobezerowerequirethatåjaj=1.Solvingtheconstrainedminimizationproblemweseethataj=lj=åjljminimizesthevariance.4.ObtainingaconsensuswithoutfeaturesWhennofeaturesareavailablethesamealgorithmcanbesimpliedtogetaconsensusestimateoftheactualgroundtruthandalsoevaluatetheannotators.Essentiallywehavetoiteratethefollowingtwoupdatestillconvergencebyi=åRj=1bljyji åRj=1blj1 blj=1 NNåi=1yjibyi2:6.ExperimentalValidationWenowexperimentallyvalidatetheproposedalgorithmsonbothsimulatedandrealdata.6.1ClassicationExperimentsWeusetwoCADandonetextdatasetinourexperiments.TheCADdatasetsincludeadigitalmammographydatasetandabreastMRIdataset,bothofwhicharebiopsyproven,thatis,thegoldstandardisavailable.Forthedigitalmammographydatasetwesimulatetheradiologistsinordertovalidateourmethods.ThebreastMRIdatahasannotationsfromfourradiologists.WealsoreportresultsonaRecognizingTextualEntailmentdatacollectedbySnowetal.(2008)usingtheAmazon'sMechanicalTurkwhichhasannotationsfrom164annotators.6.1.1DIGITALMAMMOGRAPHYWITHSIMULATEDRADIOLOGISTSMammogramsareusedasascreeningtooltodetectearlybreastcancer.CADsystemssearchforabnormalareas(lesions)inadigitizedmammographicimage.Theselesionsgenerallyindicatethepresenceofmalignantcancer.TheCADsystemthenhighlightstheseareasontheimages,alertingtheradiologisttotheneedforafurtherdiagnosticmammogramorabiopsy.Inclassicationterms,givenasetofdescriptivemorphologicalfeaturesforaregiononaimage,thetaskistopredictwhetheritispotentiallymalignant(1)ornot(0).Inordertotrainsuchaclassier,asetofmammogramsiscollectedfromhospitals.Thegroundtruth(whetheritiscancerornot)isobtainedfrombiopsy.Sincebiopsyisanexpensive,tedious,andaninvasiveprocess,veryoftenCADsystemsarebuiltfromlabelscollectedfrommultipleexpertradiologistswhovisuallyexaminethemammogramsandmarkthelesionlocations—thisconstitutesourgroundtruth(multiplelabels)forlearning.Inthisexperimentweuseaproprietarybiopsy-provendataset(Krishnapurametal.,2008)containing497positiveand1618negativeexamples.Eachinstanceisdescribedbyasetof27mor-phologicalfeatures.Inordertovalidateourproposedalgorithm,wesimulatemultipleradiologistsaccordingtothetwo-coinmodeldescribedin§2.1.Basedonthelabelsfrommultipleradiologists,1310 LEARNINGFROMCROWDSwecansimultaneously(1)learnalogistic-regressionclassier,(2)estimatethesensitivityandspeci-cityofeachradiologist,and(3)estimatethegoldengroundtruth.Wecomparetheresultswiththeclassiertrainedusingthebiopsyprovedgroundtruthaswellasthemajority-votingbaseline.Fortherstsetofexperimentsweuse5radiologistswithsensitivity=[0:900:800:570:600:55]andspecicity =[0:950:850:620:650:58].Thiscorrespondstoascenariowherethersttworadiologistsareexpertsandthelastthreearenovices.Figure1summarizestheresults.Wecompareonthreedifferentaspects:(1)Howgoodisthelearntclassier?(2)Howwellcanweestimatethesensitivityandspecicityofeachradiologist?(3)Howgoodistheestimatedgroundtruth?Thefollowingobservationscanbemade.1.ClassierperformanceFigure1(a)plotstheROCcurveofthelearntclassieronthetrainingset.Thedotted(black)lineistheROCcurvefortheclassierlearntusingtheactualgroundtruth.Thesolid(red)lineistheROCcurvefortheproposedalgorithmandthedashed(blue)lineisfortheclassierlearntusingthemajority-votingscheme.Theclassierlearntusingtheproposedmethodisasgoodastheonelearntusingthegoldengroundtruth.TheareaundertheROCcurve(AUC)fortheproposedalgorithmisaround3:5%greaterthanthatlearntusingthemajority-votingscheme.2.RadiologistperformanceTheactualsensitivityandspecicityofeachradiologistismarkedasablackinFigure1(b).Theendofthesolidredlineshowstheestimatesofthesensitivityandspecicityfromtheproposedmethod.Weusedauniformprioronalltheparameters.Theellipseplotsthecontourofonestandarddeviationasobtainedfromthebetaposteriorestimates.Theendofthedashedbluelineshowstheestimateobtainedfromthemajority-votingalgorithm.Weseethattheproposedmethodismuchclosertotheactualvaluesofsensitivityandspecicity.3.ActualgroundtruthSincetheestimatesoftheactualgroundtruthareprobabilisticscores,wecanalsoplottheROCcurvesoftheestimatedgroundtruth.FromFigure1(b)wecanseethattheROCcurvefortheproposedmethoddominatesthemajorityvotingROCcurve.Furthermore,theareaundertheROCcurve(AUC)isaround3%higher.Theestimateob-tainedbymajorityvotingisclosertothenovicessincetheyformamajority(3/5).Itdoesnothaveanideaofwhoisanexpertandwhoisanovice.Theproposedalgorithmappropriatelyweightseachradiologistbasedontheirestimatedsensitivityandspecicity.Theimprove-mentobtainedisquitelargeinFigure2whichcorrespondsasituationwherewehaveonlyoneexpertand7novices.4.JointEstimationTolearnaclassier,Smythetal.(1995)proposedtorstestimatethegoldengroundtruthandthenusetheprobabilisticgroundtruthtolearnaclassier.Incontrast,ourproposedalgorithmlearnstheclassierandthegroundtruthjointlyasapartoftheEMalgorithm.Figure3showsthattheclassierandthegroundtruthlearntobtainedbytheproposedalgorithmissuperiorthanthatobtainedbyotherprocedureswhichrstestimatesthegroundtruthandthenlearnstheclassier.6.1.2BREASTMRIInthisexample,eachradiologistreviewsthebreastMRIdataandassessesthemalignancyofeachlesiononaBIRADSscaleof1to5.TheBIRADSscaleisdenedasfollows:1Negative,2Benign,1311 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOY MajorityVotingTrue1True2True3True4True5 Estimated1x0.02170x0.0000Estimated2x0.58690x0.1785Estimated3x0.23910x0.1071Estimated4x0.15211x0.2500Estimated5x0.00000x0.4642 EMalgorithmTrue1True2True3True4True5 Estimated1x0.00000x0.0000Estimated2x0.69570x0.1428Estimated3x0.13040x0.0000Estimated4x0.17391x0.3214Estimated5x0.00000x0.5357 Table1:TheconfusionmatrixfortheestimateobtainedusingmajorityvotingandtheproposedEMalgorithm.Thexindicatesthattherewasnosuchcategoryinthetruelabels(thegoldstandard).Thegold-standardisobtainedbythebiopsywhichcanconrmwhetheritisbenign(BIRADS=2)ormalignant(BIRADS=5).3ProbablyBenign,4Suspiciousabnormality,and5Highlysuggestiveofmalignancy.Ourdatasetcomprisesof75lesionswithannotationsfromfourradiologists,andthetruelabelsfrombiopsy.Basedoneightmorphologicalfeatures,wehavetopredictwhetheralesionismalignantornot.FortherstexperimentwereducetheBIRADSscaletoabinaryone:anylesionwithaBIRADS�3isconsideredmalignantandbenignotherwise.Thesetincluded28malignantand47benignlesions.Figure4summarizestheresults.Weshowtheleave-one-outcrossvalidatedROCfortheclassier.Thecross-validatedAUCoftheproposedmethodisapproximately6%betterthanthemajorityvotingbaseline.WealsoconsidertheBIRADSlabelsasasetofordinalmeasurementssincethereisanorder-ingamongtheBIRADSlabel.TheconfusionmatrixinTable1showsthattheEMalgorithmissignicantlysuperiorthanthemajorityvotinginestimatingthetrueBIRADS.6.1.3RECOGNIZINGTEXTUALENTAILMENTFinallywereportresultsonRecognizingTextualEntailmentdatacollectedbySnowetal.(2008)usingtheAmazon'sMechanicalTurk.Inthistask,theannotatorispresentedwithtwosentencesandgivenachoiceofwhetherthesecondsentencecanbeinferredfromtherst.Thedatahas800tasksand164distinctreaders,with10annotationspertaskalongwiththegoldengroundtruth.Themajorityoftheentries(94%)inthe800x164matrixaremissing.Thereisoneannotatorwhohaslabeledallthetasks.Weusethisdatasettoobtainanestimateoftheactualgroundtruth.Figure5plotstheaccuracyoftheestimatedgroundtruthasafunctionofthenumberofannotators.TheproposedEMalgorithmachievesahigheraccuracythanmajorityvoting.Inotherwordstoachieveadesiredaccuracytheproposedalgorithmneedsfewerannotatorsthanthemajorityvotingscheme.1312 LEARNINGFROMCROWDS 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)ROC Curve for the classifier Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.913 Majority voting baseline AUC=0.882 (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)ROC Curve for the estimated true labels Proposed EM algorithm AUC=0.991 Majority voting baseline AUC=0.962 (b)Figure1:Resultsforthedigitalmammographydatasetwithannotationsfrom5simulatedradiol-ogists.(a)TheROCcurveofthelearntclassierusingthegoldengroundtruth(dottedblackline),themajorityvotingscheme(dashedblueline),andtheproposedEMalgo-rithm(solidredline).(b)TheROCcurvefortheestimatedgroundtruth.Theactualsensitivityandspecicityofeachoftheradiologistsismarkedasa.Theendofthedashedbluelineshowstheestimatesofthesensitivityandspecicityobtainedfromthemajorityvotingalgorithm.Theendofthesolidredlineshowstheestimatesfromtheproposedmethod.Theellipseplotsthecontourofonestandarddeviation.1313 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOY 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)ROC Curve for the classifier Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.906 Majority voting baseline AUC=0.884 (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)ROC Curve for the estimated true labels Proposed EM algorithm AUC=0.967 Majority voting baseline AUC=0.872 (b)Figure2:SameasFigure1exceptwith8differentradiologistannotations1314 LEARNINGFROMCROWDS 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)ROC Curve for the classifier Proposed EM algorithm [Joint Estimation] AUC=0.905 Decoupled Estimation AUC=0.884 (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)ROC Curve for the estimated true labels Proposed EM algorithm [Joint Estimation] AUC=0.972 Decoupled Estimation AUC=0.921 (b)Figure3:ROCcurvescomparingtheproposedalgorithm(solidredline)withtheDecoupledEsti-mationprocedure(dottedblueline),whichreferstothealgorithmwherethegroundtruthisrstestimatedusingjustthelabelsfromtheveradiologistsandthenalogisticregres-sionclassieristrainedusingthesoftprobabilisticlabels.IncontrasttheproposedEMalgorithmestimatesthegroundtruthandlearnstheclassiersimultaneouslyduringtheEMalgorithm.1315 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOY 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)Leave-One-Out ROC Curve for the classifier Golden ground truth AUC=0.909 Majority voting baseline AUC=0.828 Proposed EM algorithm AUC=0.879 (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-specifcity)True Positive Rate (sensitivity)ROC Curve for the estimated true labels Proposed EM algorithm AUC=0.944 Majority voting baseline AUC=0.937 (b)Figure4:BreastMRIresults.(a)Theleave-one-outcrossvalidatedROC.(b)ROCfortheestimatedgroundtruth.1316 LEARNINGFROMCROWDS 20 40 60 80 100 120 140 160 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Number of AnnotatorsAccuracy Majority Voting EM Algorithm Figure5:ThemeanandtheonestandarddeviationerrorbarsfortheaccuracyoftheestimatedgroundtruthfortheRecognizingTextualEntailmenttaskasafunctionofthenumberofannotators.Theplotwasgeneratedbyrandomlysamplingtheannotators100times.6.2RegressionExperimentsWerstillustratethealgorithmonatoydatasetandthenpresentacasestudyforautomatedpolypmeasurements.6.2.1ILLUSTRATIONFigure6illustratesthetheproposedalgorithmforregressiononaone-dimensionaltoydatasetwiththreeannotators.Theactualregressionmodel(shownasabluedottedline)isgivenbyy=5x2.Wesimulate20samplesfromthreeannotatorswithprecisions0.01,0.1,and1.0.Thedataareshownbytheannotators'snumber.Whilewecantaregressionmodelusingeachannotators'sresponse,weseethatonlythemodelforannotatorthree(withhighestprecision)isclosetothetrueregressionmodel.Thegreendashedlineshowsthemodellearntusingtheaverageresponsefromallthethreeannotators.Theredlineshowsthemodellearntbytheproposedalgorithm.6.2.2AUTOMATEDPOLYPMEASUREMENTSColorectalpolypsaresmallcolonicndingsthatmaydevelopintocanceratalaterstage.Thediameterofthepolypisoneofthekeyfactorswhichdecidesthemalignancyofasuspiciouspolyp.Henceaccuratesizeestimationiscrucialtodecidetheactiontobetakenonapolyp.Wehavedevelopedvariousalgorithmstosegmentapolyp.Multiplesegmentationalgorithmsgiverisetoasetoffeatureswhicharecorrelatedwiththediameterofthepolyp.Wewanttolearnaregressionfunctionwhichcanpredictthediameterofapolypasafunctionofthesefeatures.Inordertolearnaregressionfunctionwecollectourgroundtruthbyaskingmanyradiologiststomanuallymeasurethethediameterofthepolypsfromthethree-dimensionalimages.Inpracticethereisalotofdisagreementamongtheradiologistsastotheactualsizeofthepolyp.1317 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOY 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -5 -4 -3 -2 -1 0 1 2 3 4 5 x (feature)y (response)N=50 examples R=3 annotators 11111111111111111111111 22222222222222222222222222222222222222222222222222 33333333333333333333333333333333333333333333333333 Actual regression model Model with annotator 1 (Precision=0.01) Model with annotator 2 (Precision=0.10) Model with annotator 3 (Precision=1.00) Model with average response Proposed algorithm Figure6:Illustrationoftheproposedalgorithmonaone-dimensionaltoydataset.Theactualregressionmodel(shownasabluedottedline)isgivenbyy=5x2.Wesimulate50samplesfromthreeannotatorswithprecisions0.01,0.1,and1.0.Thedataareshownbytheannotators'snumber.Whilewecantaregressionmodelusingeachannotators'sresponse,weseethatonlythemodelforannotatorthree(withhighestprecision)isclosetothetrueregressionmodel.Thegreendashedlineshowsthemodellearntusingtheaverageresponsefromallthethreeannotators.Theredlineshowsthemodellearntbytheproposedalgorithm. 0 5 10 15 0 5 10 15 Actual Polyp Diameter (mm)Estimated Polyp Diameter (mm)Pearson Correlation Coefficient=0.714720 RMSE=1.970815 (a)Goldstandardmodel 0 5 10 15 0 5 10 15 Actual Polyp Diameter (mm)Estimated Polyp Diameter (mm)Pearson Correlation Coefficient=0.706558 RMSE=1.991576 (b)Proposedmodel 0 5 10 15 0 5 10 15 Actual Polyp Diameter (mm)Estimated Polyp Diameter (mm)Pearson Correlation Coefficient=0.554966 RMSE=2.887740 (c)AveragemodelFigure7:Scatterplotoftheactualpolypdiametervsthediameterpredictedbythemodelslearntusing(a)theactualgoldstandard,(b)theproposedalgorithmwithannotationsfromveradiologists,and(c)theaverageoftheradiologist'sannotations.(See§6.2.2forade-scriptionoftheexperimentalsetup.)1318 LEARNINGFROMCROWDSWeuseaproprietarydatasetcontaining393examples(whichpointto285distinctpolyps—thesegmentationalgorithmsgenerallyreturnmultiplemarksonthesamepolyp.)alongwiththemeasureddiameter(rangingfrom2mmto15mm)asourtrainingset.Eachexampleisdescribedbyasetof60morphologicalfeatureswhicharecorrelatedtothediameterofthepolyp.Inordertovalidatethefeasibilityofourproposedalgorithm,wesimulateveradiologistsaccordingtothenoisymodeldescribedin§5.1witht=[0:0010:010:1110].Thiscorrespondstoasituationwheretherstthreeradiologistsareextremelynoisyandthelasttwoarequiteaccurate.Basedonthemeasurementsfrommultipleradiologists,wecansimultaneously(1)learnalinearregressorand(2)estimatetheprecisionofeachradiologist.Wecomparetheresultswiththeclassiertrainedusingtheactualgoldengroundtruthaswellastheregressorlearntusingtheaverageoftheradiologistsmeasurements.Theresultsarevalidatedonanindependenttestsetcontaining397examples(whichpointto298distinctpolyps).Figure7showsthescatterplotoftheactualpolypdiametervsthediameterpredictedbythethreedifferentmodels.Wecomparetheperformancebasedontherootmeansquarederror(RMSE)andalsothePearson'scorrelationcoefcient.Theregressorlearntusingtheproposediterativealgorithm(Figure7(b))isalmostasgoodastheonelearntusingthegoldengroundtruth(Figure7(a)).Thecorrelationcoefcientfortheproposedalgorithmissignicantlylargerthanthatlearntusingtheaverageoftheradiologistsresponse.Theestimateobtainedbyaveragingisclosertothenovicessincetheyformamajority(3/5).Theproposedalgorithmappropriatelyweightseachradiologistbasedontheirestimatedprecisions.7.ConclusionsandFutureWorkInthispaperweproposedaprobabilisticframeworkforsupervisedlearningwithmultipleannota-torsprovidinglabelsbutnoabsolutegoldstandard.Theproposedalgorithmiterativelyestablishesaparticulargoldstandard,measurestheperformanceoftheannotatorsgiventhatgoldstandard,andthenrenesthegoldstandardbasedontheperformancemeasures.Wespecicallydiscussedbinary/categorical/ordinalclassicationandregressionproblems.Wemadetwokeyassumptions:(1)theperformanceofeachannotatordoesnotdependonthefeaturevectorforagiveninstanceand(2)conditionalonthetruththeexpertsareindependent,thatis,theymaketheirerrorsindependently.Aswepointedoutearliertheseassumptionsarenottrueinpractice.Theannotatorperformancedependsontheinstanceheislabelingandthereissomedegreeofcorrelationamongtheannotators.Webrieydiscusssomestrategiestorelaxthesetwoassumptions.7.1InstanceDifcultyOnedrawbackofthecurrentmodelisthatitdoesn'testimatedifcultyofitems.Itisoftenobservedthatfortheeasyinstancesalltheannotatorsagreeonthelabels—thusviolatingourconditionalindependenceassumption.Thedifcultyofannotatinganitemcanbecapturedbyanotherlatentvariablegiforeachinstance—whichmodulatestheannotatorsperformance.Modelsforthishavebeendevelopedintheareaofitem-responsetheory(BakerandKim,2004)andalsoinepidemiol-ogy(UebersaxandGrove,1993)—seealsoWhitehilletal.(2009)forarecentpaperinthemachinelearningcommunity.Whilethesemodelsdonottakeintoaccounttheavailablefeaturesourpro-1319 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOYposedmodelforsensitivityandspecicitycanbeextendedasfollows(inplaceof(1)and(2)):aj(gi)=Pr[yji=1jyi=1;gi]=s(aj1+bj1gi):bj(gi)=Pr[yji=0jyi=0;gi]=s(aj0+bj0gi):Heretheparametersaj1andaj0arerelatedtothesensitivityandspecicityofthethannotator,whilethelatenttermgicapturesthedifcultyoftheinstance.Thekeyassumptionhereisthattheannotatorsareindependentconditionalonbothyiandgi.Variousassumptionscanbemadeontwoparametersbj1andbj0tosimplifythesemodelsfurther—forexamplewecouldsetbj1=b1andbj0=b0foralltheannotators.7.2AnnotatorsActuallyLookattheDataInourmodelwemadetheassumptionthatthesensitivityajandthespecicitybjofthethannota-tordoesnotdependonthefeaturevectorxi.Forexample,intheCADdomain,thismeantthattheradiologist'sperformanceisconsistentacrossdifferentsub-groupsofdata—whichisnotentirelytrue.Itisknownthatsomeradiologistsaregoodatdetectingcertainkindsofmalignantlesionsbasedontheirtrainingandexperience.Wecanextendthepreviousmodelsuchthatthesensitivityandthespecicitydependsonthefeaturevectorxiexplicitlyasfollowsaj(gi;xi)=Pr[yji=1jyi=1;gi;xi]=s(aj1+bj1gi+wja�xi):aj(gi;xi)=Pr[yji=0jyi=0;gi;xi]=s(aj0+bj0gi+wjb�xi):Howeverthischangeincreasesthenumberofparameterstobelearned.ReferencesP.S.AlbertandL.E.Dodd.Acautionarynoteontherobustnessoflatentclassmodelsforestimatingdiagnosticerrorwithoutagoldstandard.Biometrics,60:427–435,2004.F.B.BakerandS.Kim.ItemResponseTheory:ParameterEstimationTechniques.CRCPress,2edition,2004.B.Carpenter.Multilevelbayesianmodelsofcategoricaldataannotation.TechnicalReportavailableathttp://lingpipe-blog.com/lingpipe-white-papers/,2008.S.R.Cholleti,S.A.Goldman,A.Blum,D.G.Politte,andS.Don.Veritas:Combiningexpertopinionswithoutlabeleddata.InProceedingsofthe200820thIEEEinternationalConferenceonToolswithArticialintelligence,2008.A.P.DawidandA.M.Skene.Maximumlikeihoodestimationofobservererror-ratesusingtheEMalgorithm.AppliedStatistics,28(1):20–28,1979.O.DekelandO.Shamir.VoxPopuli:Collectinghigh-qualitylabelsfromacrowd.InCOLT2009:Proceedingsofthe22ndAnnualConferenceonLearningTheory,2009a.O.DekelandO.Shamir.Goodlearnersforevilteachers.InICML2009:Proceedingsofthe26thInternationalConferenceonMachineLearning,pages233–240,2009b.1320 LEARNINGFROMCROWDSA.P.Dempster,N.M.Laird,andD.B.Rubin.MaximumlikelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety:SeriesB,39(1):1–38,1977.P.Donmez,J.G.Carbonell,andJ.Schneider.Efcientlylearningtheaccuracyoflabelingsourcesforselectivesampling.InKDD2009:Proceedingsofthe15thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages259–268,2009.E.FrankandM.Hall.Asimpleapproachtoordinalclassication.LectureNotesinComputerScience,pages145–156,2001.G.Fung,B.Krishnapuram,J.Bi,M.Dundar,V.C.Raykar,S.Yu,R.Rosales,S.Krishnan,andR.B.Rao.Miningmedicalimages.InFifteenthAnnualSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining:ThirdWorkshoponDataMiningCaseStudiesandPracticePrize,2009.J.Howe.Crowdsourcing:WhythePoweroftheCrowdIsDrivingtheFutureofBusiness.2008.S.L.HuiandS.D.Walter.Estimatingtheerrorratesofdiagnostictests.Biometrics,36:167–171,1980.S.L.HuiandX.H.Zhou.Evaluationofdiagnostictestswithoutgoldstandards.StatisticalMethodsinMedicalResearch,7:354–370,1998.R.JinandZ.Ghahramani.Learningwithmultiplelabels.InAdvancesinNeuralInformationProcessingSystems15,pages897–904.2003.B.Krishnapuram,J.Stoeckel,V.C.Raykar,R.B.Rao,P.Bamberger,E.Ratner,N.Merlet,I.Stain-vas,M.Abramov,andA.Manevitch.Multiple-instancelearningimprovesCADdetectionofmassesindigitalmammography.InIWDM2008:Proceedingsofthe9thinternationalworkshoponDigitalMammography,pages350–357.2008.G.Lugosi.Learningwithanunreliableteacher.PatternRecognition,25(1):79–87,1992.R.M.NealandG.E.Hinton.AviewoftheEMalgorithmthatjustiesincremental,sparse,andothervariants.InLearninginGraphicalModels,pages355–368.KluwerAcademicPublishers,1998.V.C.Raykar,S.Yu,L.H.Zhao,A.Jerebko,C.Florin,G.H.Valadez,L.Bogoni,andL.Moy.Supervisedlearningfrommultipleexperts:Whomtotrustwheneveryoneliesabit.InICML2009:Proceedingsofthe26thInternationalConferenceonMachineLearning,pages889–896,2009.V.S.Sheng,F.Provost,andP.G.Ipeirotis.Getanotherlabel?improvingdataqualityanddataminingusingmultiple,noisylabelers.InProceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages614–622,2008.P.Smyth.Learningwithprobabilisticsupervision.InComputationalLearningTheoryandNaturalLearningSystems3,pages163–182.MITPress,1995.1321 RAYKAR,YU,ZHAO,VALADEZ,FLORIN,BOGONIANDMOYP.Smyth,U.Fayyad,M.Burl,P.Perona,andP.Baldi.Inferringgroundtruthfromsubjectivelabellingofvenusimages.InAdvancesinNeuralInformationProcessingSystems7,pages1085–1092.1995.R.Snow,B.O'Connor,D.Jurafsky,andA.Ng.Cheapandfast-Butisitgood?Evaluatingnon-expertannotationsfornaturallanguagetasks.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages254–263,2008.A.SorokinandD.Forsyth.UtilitydataannotationwithAmazonMechanicalTurk.InProceedingsoftheFirstIEEEWorkshoponInternetVisionatCVPR08,pages1–8,2008.J.S.UebersaxandW.M.Grove.Alatenttraitnitemixturemodelfortheanalysisofratingagreement.Biometrics,49:823–835,1993.S.K.Wareld,K.H.Zou,andW.M.Wells.Simultaneoustruthandperformancelevelestimation(STAPLE):analgorithmforthevalidationofimagesegmentation.IEEETransactionsonMedicalImaging,23(7):903–921,2004.J.Whitehill,P.Ruvolo,T.Wu,J.Bergsma,andJ.Movellan.Whosevoteshouldcountmore:Opti-malintegrationoflabelsfromlabelersofunknownexpertise.InAdvancesinNeuralInformationProcessingSystems22,pages2035–2043.2009.1322