/
TUNINGCAUSALDISCOVERYALGORITHMSTuningCausalDiscoveryAlgorithmsKonstant TUNINGCAUSALDISCOVERYALGORITHMSTuningCausalDiscoveryAlgorithmsKonstant

TUNINGCAUSALDISCOVERYALGORITHMSTuningCausalDiscoveryAlgorithmsKonstant - PDF document

holly
holly . @holly
Follow
342 views
Uploaded On 2021-08-31

TUNINGCAUSALDISCOVERYALGORITHMSTuningCausalDiscoveryAlgorithmsKonstant - PPT Presentation

BIZAETALprotocolsdevisedforsupervisedlearningproblemssuchaskfoldcrossvalidationholdoutandrepeatedholdoutWhetheroptimizingcausalalgorithmswithrespecttopredictiveperformancewillalsoresultinoptimizingl ID: 873625

2010 150 validation con 150 2010 con validation 2009 2003 bizaetal bic 2018 148 stars gurations gurationa inaddition 147

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "TUNINGCAUSALDISCOVERYALGORITHMSTuningCau..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 TUNINGCAUSALDISCOVERYALGORITHMSTuningCau
TUNINGCAUSALDISCOVERYALGORITHMSTuningCausalDiscoveryAlgorithmsKonstantinaBizaKONBIZA@GMAIL.COMComputerScienceDepartment,UniversityofCreteIoannisTsamardinosTSAMARD.IT@GMAIL.COMComputerScienceDepartment,UniversityofCreteSoaTriantallouSOT16@PITT.EDUDepartmentofBiomedicalInformatics,UniversityofPittsburghAbstractTherearenumerousalgorithmsproposedintheliteratureforlearningcausalgraphicalprobabilisticmodels.Eachoneofthemistypicallyequippedwithoneormoretuninghyper-parameters.Thechoiceofoptimalalgorithmandhyper-parametervaluesisnotuniversal;itdependsonthesizeofthenetwork,thedensityofthetruecausalstructure,thesamplesize,aswellasthemetricofqualityoflearningacausalstructure.Thus,thechallengetoapractitionerishowto“tune”thesechoices,giventhatthetruegraphisunknownandthelearningtaskisunsupervised.Inthepaper,weevaluatetwopreviouslyproposedmethodsfortuning,onebasedonstabilityofthelearnedstructureunderperturbationsoftheinputdataandtheotherbasedonbalancingthein-samplettingofthemodelwiththemodelcomplexity.Weproposeandcomparativelyevaluateanewmethodthattreatsacausalmodelasasetofpredictivemodels:oneforeachnodegivenitsMarkovBlanket.Itthentunesthechoicesusingout-of-sampleprotocolsforsupervisedmethodssuchascross-validation.Theproposedmethodperformsonparorbetterthanthepreviousmethodsformostmetrics.Keywords:CausalDiscovery,Tuning,BayesianNetworks.1.IntroductionAwidepaletteofalgorithmsforlearningcausalprobabilisticgraphicalmodelsfromobservationaldatahasbeenproposedintheliterature.Thealgorithmsdifferintermsoftheirdistributionalas-sumptions,theoreticalproperties,searchheuristicsfortheoptimalstructure,approximations,orothercharacteristicsthatmakethemmoreorlessappropriateandeffectivetoagivenlearningtask.Inaddition,mostalgorithmsare“tunable”byemployingasetofhyper-parametersthatdeterminestheirbehavior,suchastheirsensitivitytoidentifyingcorrelations.Thechoiceofalgorithmandcor-respondinghyper-parametervalues(hereafter,calledaconguration)canhaveasizableimpactonthelearningquality.Practitionersarefacedwithoptimizingthecongurationforthetaskathand.Unfortunately,giventhattheproblemisunsupervised,standardout-of-sampleestimationmethodsusedforsupervisedproblems,suchascross-validationcannotbedirectlyapplied.Inthiswork,weproposeanoptimizationmethodoverasetofcongurations,calledOut-of-sampleCausalTuningorOCT.ItisbasedonthepremisethatacausalnetworkGinducesasetofpredictivemodels,namelyamodelforeachnodeV,usingaspredictorsthevariablesinitsMarkovBlanket,denotedasMBG(V).TheMBG(V)istheminimalsetofnodesthatleadstoanoptimallypredictivemodelforV(undersomeconditions,see(TsamardinosandAliferis,2003)).Thus,acongurationthatoutputsacausalnetworkwithalltheco

2 rrectMarkovBlanketswillexhibittheoptimal
rrectMarkovBlanketswillexhibittheoptimalaverage(overallnodes)out-of-samplepredictionpowerandbeselectedoverallothercongurations.Hence,onecouldoptimizethecongurationbyemployingout-of-sampleestimation1 BIZAETAL.protocolsdevisedforsupervisedlearningproblems,suchask-foldcross-validation,holdout,andrepeatedholdout.Whetheroptimizingcausalalgorithmswithrespecttopredictiveperformancewillalsoresultinoptimizinglearningwithrespecttothecausalstructureisthemajorresearchquestionofthepaper.Theproposedmethodiscomparativelyevaluatedagainstthreeothermethodsintheliterature:StARS,andselectingthecongurationthatminimizestheBayesianInformationCriterion(BIC)ortheAkaikeInformationCriterion(AIC).StARSselectsthecongurationthatexhibitsthehigh-eststabilityofcausalinductiontoperturbationsoftheinputdata,whileBICandAICselectthecongurationsthattthedatabest,accordingtosomeparametricassumptions,andpenalizingforcomplexity.Inasetofcomputationalexperimentsoverseveralnetworksizesanddensities,aswellassamplesizes,weshowthatOCTperformsonparorbetterthanthepriorsuggestions.However,wealsoindicatethatthereisstillmuchroomforimprovementfornoveltuningmethodologies.OCTisanapproachexploitingtheconnectionbetweencausalmodelsandpredictivemodels:acausalmodelimpliesasetofpredictivemodels.Ittriestoleverageresultsinsupervised,predictivemodelingtoimprovecausaldiscovery,inthiscasetheuseofout-of-sampleestimationprotocols.Conceivably,othertechniquesfortuningpredictivemodels,suchastheonesintheeldofauto-matedmachinelearning(AutoML)forintelligentlysearchingthespaceofhyper-parameterscouldimprovecausaltuningandcausallearning.2.ProblemDenitionInanapplicationofcausaldiscoveryalgorithmstorealdataapractitionerisfacedwithselectingtheappropriatealgorithmtouse.Inaddition,eachalgorithmrequiresthechoiceofthevaluesofacertainnumberofhyper-parametersthatdetermineitsbehavior,suchasthesensitivityofidentifyingpatterns.Hyper-parametersdifferfrommodelparametersinthesensethattheformeraresetbytheuser,whilethelatterareestimatedfromthedata.Theimpactofthechoiceofthehyper-parametervaluesforagivenalgorithmhasbeennotedinseveralpapers(Raghuetal.,2018;RamseyandAn-drews,2017).Optimizingoverbothalgorithmandhyper-parametershasbeencoinedtheCombinedAlgorithmSelectionandHyper-parameteroptimizationprobleminthesupervisedlearninglitera-ture(Thorntonetal.,2012),CASHforshort,or“tuning”.Weadoptthesamenomenclatureinthispaper.Noticethatwecanrepresentthechoiceofthelearningalgorithmwithanewhyper-parameter.Aninstantiationofallhyper-parametervalues(includingthealgorithm)iscalledaconguration.Arelatedprobleminstatisticsistheproblemofmodelselection.Inbothcases,oneoptimizesamongasetofpossiblechoices.However,the

3 reareconceptualdifferencesofperspectives
reareconceptualdifferencesofperspectivesintotheproblem.Historicallyinstatistics,differentmodelsaretandthenthenalmodelisselectedamongalltheonest.Thechoiceisoftenmanualbyvisualizingthemodel'standresiduals.Prin-cipledmethodsformodelselectiontypicallyscorethetrade-offbetweenmodelttingandmodelcomplexity.SuchamodelselectioncriterionistheBayesianInformationCriterion(orsimilarly,theAkaikeInformationCriterion)scoringttingusingthein-sampledatalikelihoodandpenalizingforthemodel'sdegreesoffreedom.Amainobservationinmodelselectionisthatallmodelsaretrainedonalltrainingdata;selectionisbasedonthein-sampledatat.Incontrast,theCASHperspective(orarguably,theMachineLearningperspective)focusesonthelearningalgorithm,notthespecicmodels(modelinstancestobeprecise).Itisnotthemodelthatisselected,butthealgorithmanditshyper-parameters(theconguration)tobeappliedonalldata.Forexample,duringcross-validatinganalgorithmseveralmodelsareproduced.Noneofthem2 TUNINGCAUSALDISCOVERYALGORITHMSisthenalmodel.Theyonlyservetoestimatehowaccuratearethemodelsproducedonaveragebythealgorithm.Thenalmodeltoreturnisthemodeltrainedonalldatausingthelearningalgorithm;allothermodelsserveonlyforestimatingperformancepurposes.Thus,CASHselectsalgorithms,notmodels,typicallyusingout-of-sampleestimationprotocols(e.g.,cross-validation).ItisnotstraightforwardtoapplytheabovetechniquestotheCASHproblemincausaldiscovery.Arstreasonisthatthetaskisinherentlyunsupervised.Thetruecausalnetworkisunknown.Thus,thereisnodirectwayofestimatinghowwellamodelapproximatesthetruecausalitiesinthedata.Asecondproblemhastodowiththeperformancemetrictooptimize.Performancemetricsfortypicalsupervisedlearningtasks,suchasbinaryclassicationandregressionhavereachedmaturity.However,forcausaldiscoverythereisarangeofmetrics,someconsideringonlythecausalstructure,partsofthecausalstructure(e.g.,edgepresenceorabsence),andothersthatconsiderthenetworksparameters(e.g.,effectsizes).Despiteitsobviousimportancetopractitioners,theproblemoftuninghasnotbeenextensivelystudiedinthecontextofcausaldiscovery.3.PreliminariesCausalBayesianNetworks(CBN)consistofaDirectedAcyclicGraph(DAG)Gandasetofproba-bilitiesP(Pearl,2009).NodesintheDAGrepresentvariables(weusethetermsnodeandvariableinterchangeably)anddirectededgesrepresentdirectcausality(inthecontextofnodesinthegraph).Eachnodecanrepresentacontinuousordiscretevariable.IfX!YinaDAGG,wesaythatXisaparentofYandYisachildofXinG.Twonodesthatshareacommonchildarecalledspouses.ThegraphandthedistributionareconnectedthroughtheCausalMarkovCondition(CMC).CMCstatesthateveryvariableisindependentofitsnon-effectsgivenitsdirectcauses.GiventheCMC,P(V)canbefactorizedasP(X1;:::;Xn)=QiP(

4 XijPaG(Xi)),wherePaG(Xi)denotetheparents
XijPaG(Xi)),wherePaG(Xi)denotetheparentsofXiinG.Equivalently,theCMCentailsasetofconditionalindependenciesexpectedtoholdinthejointprobabilitydistributionofvariablesinthegraph.CBNscanonlymodelcausallysufcientsystems,meaningthatnopairofvariablesshareanunmeasuredcommoncause(confounder).ExtensionsofCBNssuchasMaximalAncestralGraphs(MAGs)(RichardsonandSpirtes,2002)modelcausalrelationshipsincausallyinsufcientsystems.Therearetwomajortypesofmethodsforlearningcausalstructure:constraint-basedandscore-based.Constraint-basedmethodsapplytestsofconditionalindependencetoadataset,andthentrytoidentifyallcausalgraphsthatareconsistentwiththese(in)dependencies.Hyper-parametersofthesemethodsincludethetypeofconditionalindependencetest,theconditioningsetsize,andthesignicancethresholdforrejectingthenullhypothesis.Score-basedmethodstrytoidentifythegraphicalmodelthatleadstoafactorizationthatisclosesttotheoneestimatedbytheobservationaldata(CooperandHerskovits,1992).Hyper-parametersofscore-basedmethodsincludesamplingandstructurepriors.IntheCASHperspective,thechoiceofalgorithmisalsoahyper-parameter.Inmostcases,thecausalstructurecannotbeuniquelyidentiedfromthedata.Instead,asetofDAGswillentailthesameindependencerelationships(orequivalentfactorizations).ThesegraphsarecalledMarkovequivalent,andsharethesameedgesandsomeorientations.Bothconstraint-basedandscore-basedalgorithmstypicallyreturnaPartiallyDirectedAcyclicGraph(PDAG),thatsummarizestheinvariantfeaturesofMarkovequivalentgraphs.3 BIZAETAL.4.ApproachestoTuningforCausalNetworkDiscoveryTwopriorandonenovelapproachtotuningforcausaldiscoveryarepresented.Theycanbeviewedasrepresentativesofthreedistinctmethodologiestotheproblem,followingdifferentprinciples.4.1NetworkstabilityandtheStARSalgorithmOneprincipleforselectingcongurationisbasedonthestabilityofthenetworksoutputbythecongurationw.r.t.smallchangestotheinputdata.AspecicinstantiationofthisprincipleistheStARSmethod(Liuetal.,2010)thatwasinitiallyintroducedspecicallyfortuningthelambdapenaltyhyper-parameterofthegraphicallasso(Friedmanetal.,2008);itwaslatterappliedasatuningmethodin(Raghuetal.,2018).Thebasicideaistoselectacongurationthatminimizestheinstabilityofthenetworkoverperturbationsoftheinputdata.Foragivencongurationa,thenetworkinstabilityNaiscomputedastheaverageedgeinstability.Inturn,foreachedge(X;Y)theinstabilityXYiscomputedasfollows:theprobabilitypXYofitspresenceinthenetworkisestimatedusingsubsampling(learningmultiplenetworksusingthesamehyper-parametersonresamplesofthedatawithoutreplacement).TheinstabilityoftheedgeisdenedasXY2pXY(1�pXY),i.e.,itistwicethevarianceofaBernoullidistributionwithparameterpXY.Itisloww

5 henpXYiscloseto0or1,andhighwhenitisclose
henpXYiscloseto0or1,andhighwhenitiscloseto0.5.Selectingthecongurationwiththeminimuminstability(maximumstability)seemsreason-ableatarstglance,butitcouldleadtocongurationsthatconsistentlyselecttheemptyorthefullgraph.Toavoidthissituation,theauthorsofStARSproposethefollowing:rst,toordercongurationswithincreasingdensity,andthen“monotonize”theirinstabilitymetric,i.e.,dene N(aj)=maxijN(ai),whereN(a)istheaverageinstabilityforcongurationaoveralledges.Subsequently,toselectthecongurationa=argmaxaif N(ai)j N(ai) g,where isahyper-parameteroftheStARSmethod.Intuitively,thisstrategyselectsthecongurationthatproducesthedensestnetworkwithanacceptedvalueofinstability(belowathreshold ).Thepseudo-codeispresentedinAlgorithm1.Thevalueof issuggestedtobe0.05(Liuetal.,2010;Raghuetal.,2018).Liuetal.(2010)compareStARStoBICandcross-validation,andndthatStARSoutper-formsthealternativemethodsforhigh-dimensionalsettings.However,wenotethatStARSwasevaluatedspecicallyfortuningthegraphicalLassolambdaparameter.Itwasnotevaluatedasamethodtoalsotunethealgorithmchoiceorothertypesofhyper-parameters.Thisisaresearchquestionexploredintheexperimentalsection.StARSisbasedontheintuitionthatcongurationswhoseoutputisverysensitivetothespe-cicsamplesincludedinthedatasetareundesirable.Analternativemetricofasortofinstabilityappearedin(Roumpelakietal.,2016).Theintuitioninthislatterworkisthatmarginalizingoutsomevariables,shouldproducenetworkswhosecausalrelationsdonotconictwiththenetworklearnedoverallvariables.Analgorithmisproposedtomeasureconsistencyundermarginalizations.Itcouldalsobeemployedfortuning,similartoStARS.TheStARSapproachmeasuressensitivityofacongurationtothespecicsampling,whilethelattermeasuressensitivityoftheoutputtothespecicchoicesofvariablestomeasure.Anotherstability-basedapproachisin(MeinshausenandB¨uhlmann,2010)specicallyaimingtocontrolthefalsepositiverateofnetworkedges.AtheoreticaladvantageofStARSisthatitwillselectcongurationsthatarerobusttoafewoutliersinthedata.Adisadvantageisthatitmeasuresstabilitywithrespecttoedgepresence,i.e.,thenetworkskeleton;causaldirectionalityisignored.Thiscouldbeamelioratedbyconsideringedge-arrowsinthestabilitycalculations.Anotherobviousdisadvantageisthatitdoesnoteval-4 TUNINGCAUSALDISCOVERYALGORITHMSuatethemodelttothedata.Thus,analgorithmthatmakesthesamesystematicerrorwillbefavored.Thisisnotaproblemwhenusingasinglealgorithm(graphicalLasso),hencenotidenti-edasaproblemintheoriginalStARSpaper.Itis,however,aseriousdisadvantagethatrequiresmorefundamentalchangestothebasicprinciplesoftheStARSalgorithmtomakeitsuccessfultoabroadertuningcont

6 ext,onethatincludesothernetworkalgorithm
ext,onethatincludesothernetworkalgorithmsandhyper-parametertypes. Algorithm1:StARS Input:DatasetDovernodesV,CongurationsA,SubsamplesS,threshold Output:Congurationa1fora2Ado 2fors2Sdo 3Ga;s causalAlga(Ds);4Da;s DensityofGa;s; 5foreachpairofvariablesX;Ydo 6pa;XY Frequencyofanedge(X;Y)infGa;sgs2S;7a;XY=2pa;XY(1�pa;XY); 8N(a) a;XYoveralledges; 9RankN(a)byincreasingdensity;10 N(aj) maxijN(ai);11a=argmaxaif N(ai)j N(ai) g; Algorithm2:BICselection Input:DatasetDovernodesV,CongurationsAOutput:Congurationa1fora2Ado 2Ga causalAlga(D);3G0a pdagToDag(Ga);4LLa 2logP(DjG0a);5BICa log(n)k�LLa; 6a=argmina2ABICa; 4.2BalancingFittingwithModelComplexityAnotherprincipleforselectingmodelandcorrespondingcongurationistoselectbasedonthebesttradeoffbetweenin-samplettingofthedataandthemodelcomplexity.Thisapproachhasalsobeenusedforotherunsupervisedproblems,likeclustering(Hofmeyr,2018).Aspecicinstantiationoftheprincipleforcausaldiscoverytuningappearedin(Maathuisetal.,2009)andisbasedontheBayesianInformationCriterion(BIC).BICscoringwasalsocomparedwithStARS(Liuetal.,2010)forlearningMarkovNetworksusingthegraphicalLasso.BICscoresacausalmodelbasedonthelikelihoodofthedatagiventhecausalmodelandpenalizeswiththedegreesoffreedomofthemodel(Schwarz,1978).Alternatively,onecouldusetheAIC,whichhasbeenshowntobeasymptoticallyequivalenttoleave-one-outcrossvalidation(Stone,1977).WhenthemodelisaCBN,thelikelihoodofthedataiscomputedbasedonthecorrespondingfactorization:P(DjG)=YijP(xijjG)=YijP(xijjPa(i))(1)wherePistheprobabilityorprobabilitydensityfunction,xijisthevalueofthei-thvariableofthej-thsample,andPa(i),theparentsofthevariableiinG.InordertocomputeP(xijjPa(i))aparametric,statisticalmodelneedstobetwithoutcomeeachvariableigivenitsparentsPa(i).TheBICofagraphisthelikelihood,penalizedforthenumberofparameters.LowervaluesofBICandAICarebetter.Anyalternativeprincipleforscoringthetradeoffofmodelttingwithcom-5 BIZAETAL.plexitycouldalsobeemployed.ExamplesincludetheMinimumMessageLength,theMinimumDescriptionLength,PACerrorboundsonthemodel,oreventheKolmogorovcomplexity.ThereareseveraladvantagestoBICandAICscoringformodelselection.First,severalalgo-rithmsuseBICinternallytosearchandscorethebestpossiblecausalmodel,provingitseffective-ness.UsingEq.1oneneedstotmodelsforeachnodeXifromonlyitsparentsinthegraphPa(Xi).Incomparison,theproposedmethodbelowemploysmodelsforeachnodegivenitsMarkovBlan-ket.Thelatterisasupersetoftheparentsandthusrequiresmoresamplestobetaccurately.Therearealsosomedisadvantages,amajoronebeingthatitrequiresthecomputationoflikelihoodandthedegreesoffreedomofthecausalmodel.Thisistypicallypossibleonlyw

7 ithstatistical,parametricmodelssuchasGau
ithstatistical,parametricmodelssuchasGaussianandmultinomialemployedinthecurrentpaper.ComputingtheBICforDecisionTrees,RandomForestsandothertypesofnon-parametricmachinelearningmodelsisnotpossible.Thus,usingBICdoesnotallowthefulluseoftheMLarsenalinpredictivemodeling.Inaddition,BICandAICcannotbecomputedforMaximalAncestralGraphswithdiscretevariables.4.3TuningbasedonPredictivePerformanceThemainprincipletoourproposedapproachistotreatacausalmodelasasetofpredictivemod-els.Subsequently,wecanevaluatethecongurationsproducingcausalmodelsusingout-of-sampleperformanceestimationprotocolssuchascross-validation.Asimilarapproachhasbeensuggestedforotherunsupervisedlearningtasks,suchasdimensionalityreductionwiththePCAalgorithm(Perry,2009)andforclustering(FuandPerry,2017).Intheframeworkofpotentialoutcomes,out-of-sampleprotocolshavealsobeenusedtoimprovetheestimationofconditionalaveragetreatmenteffects(SaitoandYasui,2019).Specically,acausalmodelGinducesaMarkovBlanketMB(X)foreachnodeofthegraph.TheMB(X)istheminimalsetthatrendersXconditionallyindependentofanyothernode.ItisuniquefordistributionsfaithfultothegraphanditisinvariantamongallgraphsinthesameMarkovequivalenceclass.Moreover,undersomeconditionsitistheminimalsetofnodesthatisnecessaryandsufcientforoptimalpredictionofX(Tsamardinosetal.,2003).Hence,assumingthatwemodelthefunctionalrelationshipbetweenMB(X)andXcorrectly(e.g.,usealearningalgorithmthatcorrectlyrepresentthedistribution),abest-performingpredictivemodelcanbeconstructed.Wenowproposeanalgorithmthatselectsthecongurationresultinginthebestsetofpredictivemodels(oneforeachnode),usinganout-of-sampleprotocol.ThealgorithmiscalledOCTandisdescribedinAlgorithm3.IttakesasinputadatasetovervariablesVandasetofcongurationsofcausaldiscoveryalgorithmsA.ThealgorithmalsotakesasinputthenumberoffoldsKforthecrossvalidation.Foreachcongurationaandeachfoldk,weestimateacausalgraphbyrunningthecorrespondingcongurationcausalAlgaonthetrainingdatasetDtraink.Subsequently,OCTevaluatesthepredictiveperformanceofcausalAlgabyidentifyingtheMarkovBlanketMB(X)ofeachvariableX,buildingapredictivemodelforXbasedonMB(X),andestimatingthepredictionerrorofeachpredictivemodelonthetestsetDtestk.TheoverallperformanceofcausalAlgainfoldkistheaverageperformanceofallofthepredictivemodels(oneforeachvariable)inthatfold.Asymptotically,thetruecausalgraphwillbeamongthemodelsthatachievethebestperformance:Theorem1Assumingthatthefollowingconditionshold:(a)DataaregeneratedbyacausalBayesianNetworkGtrueovervariablesV.(b)Thelearningalgorithmcanexactlylearnthecon-ditionaldistributionofeachnodeV2VgivenitsMB.(c)Thelearningalgorithmusesaproper6 TUNINGCAUSALDISCOVERYALGORITHMS Algorithm3:Out-of-sampleCa

8 usalTuning(OCT) Input:DatasetDovernodesV
usalTuning(OCT) Input:DatasetDovernodesV,CongurationsA,FoldsKOutput:Congurationa1fora2Ado 2fork=1toKdo 3Ga;k=causalAlga(Dtraink);4forX2Vdo 5MBa;k;X MarkovBlanketofXinGa;k;6Ma;k;X fitModel(X;MBa;k;X;Dtraink);7Pa;k;X evaluatePerf(Ma;k;X;Dtestk) 8Pa;k Pa;k;XoverV. 9Pa Pa;koverk. 10a argmaxa2APa scoringcriterion.ThenanyDAGGforwhichMBG(V)=MBGtrue(V)8V2Vwillasymptoti-callyhavethemaximumscore.ProofIfaproperscoringruleisused,thenthehighestperformanceforeachvariableVwillbeobtainedbythetrueprobabilitydistributionP(VjVnV),whichisequaltoP(VjMB(V)). InducedcausalmodelsthatmissmembersofaMB(X)willachievealowerpredictiveperfor-mancethanpossible,astheylackinformationalpredictors.CausalmodelsthataddfalsepositivemembersofaMB(X)mayresultinoverttinginnitesamples.However,inthelargesamplelimit,MarkovBlanketsthatincludeasupersetofthetruesetwillalsoachievethemaximumscore,andcouldbeselectedbyOCT.Toaddressthisproblem,wealsoexaminedaversionofOCT,whichwecallOCTs:Inthisversion,amongcongurationswhoseperformanceissimilar(notstatisti-callysignicantlydifferent)totheoptimalperformance,wepicktheonewiththesmallestMarkovBlanketsets(averageoverallvariablesandfolds).4.4StrengthsandLimitationsAdvantagesofOCTarethatitdoesnotinherentlyneedtomakeparametricassumptionsaboutthedatadistribution;onecouldpotentiallyemployanyapplicablemodellingmethodinMachineLearningorstatistics,anditwillasymptoticallyselecttheoptimalcongurationwithrespecttoprediction(assumingtheconditionsinTheorem1hold).Ontheotherhand,therearetwomajorlimitationsofOCT.Therstisthatitdoesnotdirectlypenalizefalsepositives.Modernlearningalgorithmsarerobusttoirrelevantorredundantvariables.Thus,evenifMarkovBlanketscontainfalsepositives,theywillbeignoredbythelearningalgorithmsandtheirpredictiveperformancemaystillbeoptimal.Penalizingdensegraphsmayamelioratethisshortcoming.Asecondproblemisthechoiceofthepredictivemodelingalgorithm.Ifthealgorithmcannotapproximatethetruedistribution,theprocedurecancollapse.Ideally,oneshouldperformCASHoneachmodelwithobviousimpactsonthecomputationalperformanceofthemethod.7 BIZAETAL.5.ExperimentalSetupGraphsandDatasimulation.ForourexperimentswefocusedonsimulatedrandomDAGsvaryingthenumberofnodesandtheirdensity.Wetestedourmethodsoncontinuousanddiscretedata.Foreachcombinationofdatatype,densityandsamplesize,wesimulate20datasetsfromrandomparametrizations,asfollows:Fordiscretedata,thenumberofcategoriesrangefrom2to5andtheconditionalprobabilitytablesaresampledrandomlyfromaDirichletdistributionwitha=0:5.Forcontinuousdata,wesimulatelinearGaussianstructuralequationmodelswithabsolutecoefcientsrangingbetween0.1and0.9.AlgorithmsandPackages.Weusedavarietyofcausaldiscoveryalgo

9 rithmsandhyper-parametersas“con
rithmsandhyper-parametersas“congurations”:WeusedTetradimplementationsofPC,PC-stable,CPC,CPC-stableandFGES(http://www.phil.cmu.edu/tetrad).WeusedMMHCfromtherecentversionofCausalExplorer(Aliferisetal.,2003)andthebnlearnpackage(Scutari,2010)fordiscreteandcontinuousdatarespectively.Forconstraintbasedalgorithms,wevariedthelevelofsignicancefortheconditionalindependencetest(0.001,0.005,0.01,0.05and0.1)andthemaximumconditioningset(4andunlimited).ForFGESwevariedtheBICpenaltydiscount(1,2,4)forcontinuousdataandtheBDeusampleandstructurepriors(1,2,4)fordiscretedata.Overall,wehave48algorithmandhyper-parametercombinations(congurations)forcontinuousand54fordiscretedatasets.OCTcongurationForOCTandcontinuousdata,weusedalinearregressionmodelasthelearningalgorithm.Westandardizedthedataandcomputedthepredictiveperformanceasthesquarerootofthesumofsquaredresidualsoverallnodesandsamples(pooledresidualsoverallfolds).FortheOCTsversion,weusedat-testtocomparethesumofsquaredresiduals.Fordiscretedata,weusedaclassicationtree.ForOCTs,weusedaccuracyofpredictionsastheperformancemetric.Noticethataccuracyisnotaproperscoringrule.EvaluationMetricsWeevaluatethegraphsaccordingtothefollowingmetrics.TheStructuralHammingDistance(SHD)countsthenumberofthestepsneededtoreachthegroundtruthCPDAGfromthePDAGoftheestimatedgraph.Thesemodicationsincludeedgeremoval,additionandchangesinorientation(Tsamardinosetal.,2006).StructuralInterventionDistance(SID)countsthenumberofpairsforwhichtheinterventiondistributionisfalselyestimatedonthelearntgraph(PetersandB¨uhlmann,2015).SIDuandSIDlaretheupperandlowerlimitsofSIDintheMarkovEquivalenceclassofthenetwork.Duetospaceconstraints,weonlyshowSIDuinthegures.ResultsforSIDlaresimilar.5.1ComparativeEvaluationInFigure1wenowcomparetheperformanceofOCT,OCTs,BIC,AICandStARSforselectingoptimalcongurations,overincreasingdensity.Eachtuningmethodwasruntoobtaintheselectedcongurationforeachdataset;anetworkisthenproducedusingtheselectedcongurationonthecompletedataset.Wealsoincludetheperformanceoftherandomselectionofacongurationwithuniformprobability(Random).They-axescorrespondtothedifferenceofperformanceachievedbyatuningmethodcomparedtotheoracleselectionforthegivenmetric.Anoptimalselectionofcongurationcorrespondsto0difference;lowerisbetterforallplotsandmetrics.Greyareashowstherangefrombesttoworstperformance.Performanceisestimatedondatasetswith50variablesand1000samples,simulatedfromgraphswithincreasingmeanparentspernode(2,4,and6).IngeneralOCTand/orOCTsperformonparorbetterthanothermethodsw.r.t.mostmetricsandscenarios,withsomeexceptionsdiscussedbelow:Fordensecontinuousnetworks(6numberof8 TUNINGCAUSALDISCOVERYALGORITHMS

10 Figure1:Comparisonofthetuningmethodsforc
Figure1:Comparisonofthetuningmethodsforcontinuousanddiscretedataoverincreasingnumberofparentspernode,ondatasetswith50variablesand1000samples.Eachpointcorrespondstothedifferenceofperformanceachievedbyeachmethodcomparedtotheoracleselection(zeroisoptimal,lowerisbetter).OCTperformsonparorbetterthanothermethodsinmanysettings.parents),alltuningmethodsperformworsethanrandomw.r.t.SHD.OCTselectscongurationsthatleadtodensergraphsandhavefewerfalsenegativeedges.Forthisreason,thecongurationselectedbyOCTisclosetooptimalintheSIDumetric.SHDisametricthatpenalizesfalsepositivesandfalsenegativesequally.However,forprobabilisticandcausalpredictions,falsenegativesoftenleadtolargerbiasesthanfalsepositives.Inadditiontodensecontinuousnetworks,OCTsperformsworsethanmostalgorithmsindiscretedata.Thisisduetotheuseofaccuracyasaperformancemetric.Accuracyisnotaproperscoringrule,andper-nodeaccuraciesamongbest-scoringcongurationsareoftenidentical.Thus,OCTsselectsverysparsegraphs,leadingtohighSHDsandSIDus.Inthefuture,weplantoexploreadditionalmetrics,suchasmulti-classAUCs(Tangetal.,2011).Figure2comparesthemethodsovernetworksizeandsamplesize,w.r.ttheSHDandSIDuforcontinuousdatawith50variablesand2parentspernodeonaverage.OCT,OCTs,BIC,andAICperformsimilarly,andselectcongurationsthatareclosetotheoptimal.NoticethatSHDincreaseswithincreasingsamplesize.Whilethisseemscounter-intuitive,wenotethatthisisthedifferenceinSHDsbetweentheselectedandtheoptimalconguration.TheactualmeanSHDislowerforlargersamplesizes,asexpected.Theaveragecomputationaltimeon10graphsof50nodesisasfollows.ForcontinuousdataOCT:90(1.6),BIC:49(0.6),StARS:136(2.2)whilefordiscretedata,OCT:2065(1599),BIC:317(241),StARS:342(151)(timeinseconds,standarddeviationinparenthesis).Thus,OCTisalwaysmoreexpensivethanBIC,andmoreexpensivethanStARSinthediscretecase.However,wenotethatthecomputationalcostisnotprohibitiveforgraphsofmodestsize(20-100nodes).OurresultsshowthatOCTandOCTshavesimilarbehaviorincontinuousvariablesandtheperformanceofOCTisbetterfordiscretedata.Inaddition,theyperformequallywellwithBIC9 BIZAETAL. Figure2:Comparisonofthemethodsoverincreasingnetwork(left)andsamplesizes(right),oncontinuousdatasetswith2parentspervariable.AllmethodsbesidesStARSperformsimilarly. Figure3:Comparisonofthemethodsoverincreasingnumberofparents(left)andnetworksize(right)fornetworkswithhiddenconfounders.OCTandOCTsmarginallyincreaseperformance.andAICapproaches.StARSconsistentlyunderperforms.WhilethemethodisabletotunethelambdaparameterofgraphicalLasso,itssuccessdoesnotseemtotransfertoBayesianNetworktuningofalgorithmsandtheirhyper-parameters,atleastinitscurrentimplementation.DespitethecomputationalcostofOCTitisgeneralizabletomultip

11 lecaseswhereBICorAICcannotbecomputed.For
lecaseswhereBICorAICcannotbecomputed.Forexample,BICandAICarenotsuitableformodelswithoutlikelihoodestimators.Inaddition,theyarenotapplicableindiscretecausalmodelswithhiddenvariables.Asaproofofconcept,weappliedourmethodfortuningalgorithmsthatlearncausalgraphswithhiddenconfounders:Wesimulated50DAGswithvaryingdensityandnumberofnodes,andset30%ofthenodesaslatent.WeusedFCIandRFCIimplementedinTetrad,varyingthelevelofsignicanceofconditionalindependencetestandthemaximumconditioningset,asbefore.WetunedthealgorithmswithOCT,OCTsandStARS.Theresults(Fig3),showthatbothOCTandOCTsbehavereasonably,andleadtoincreasedperformance,albeitmarginally.StARSandrandom10 TUNINGCAUSALDISCOVERYALGORITHMSguessingperformslightlyworse,butsimilarly.Thisisprobablyduetothefactthatthenumberofalgorithmiccongurationsissmaller,andallalgorithmsareofthesametype(constraint-based).6.DiscussionandConclusionsItisimpossibletoevaluateacausaldiscoverymethodonitscausalpredictionsbasedonobser-vationaldataalone.However,wecanevaluatecausalmodelsw.r.t.theirpredictivepower.Weproposeanalgorithm,calledOut-of-sampleCausalTuning(OCT),thatemploysthisprincipletoselectamongseveralchoicesofalgorithmsandtheirhyper-parametervaluestouseonagivencausaldiscoveryproblem.ItperformsonparorbetterthantwopreviousotherselectionmethodsbasedonnetworkstabilityacrosssubsamplesandtheBayesianandtheAkaikeInformationCriteria.However,theoptimalselectionofalgorithmandhyper-parametervaluesdependsonthemetricofperformance,thusnotuningmethodcansimultaneouslyoptimizeforallmetrics.EventhoughOCToptimizesthepredictivepoweroftheresultingcausalmodels,itstillmanagestosimultaneouslyoptimizethecausalitiesencodedinthenetwork,reasonablywell.Inthepresentstudyweusedonlysyntheticdatafollowingparametricdistributions:eithervariablesareallcontinuousfollowingamultivariateGaussianorarealldiscretefollowingamultinomial.AmajorcurrentlimitationofOCTisthatitassumesanappropriatepredictivemodelingalgorithmforthedataisemployed(lin-earregressionandDecisionTree,withinthescopeoftheexperiments).TheefcienttuningofthepredictivemodelingrequiredbyOCTisanaturalnextstepinthisresearchdirection.AcknowledgmentsTheresearchleadingtotheseresultshasreceivedfundingfromtheEuropeanResearchCouncilun-dertheEuropeanUnion'sSeventhFrameworkProgramme(FP/2007-2013)/ERCGrantAgreementn.617393ReferencesC.F.Aliferis,I.Tsamardinos,A.R.Statnikov,andL.E.Brown.Causalexplorer:Acausalproba-bilisticnetworklearningtoolkitforbiomedicaldiscovery.InMETMBS,2003.G.F.CooperandE.Herskovits.Abayesianmethodfortheinductionofprobabilisticnetworksfromdata.MachineLearning,9:309–347,1992.J.H.Friedman,T.J.Hastie,andR.Tibshirani.Sparseinversecovarianceestimationwiththegraphicallasso.Biostatistic

12 s,93:432–41,2008.W.FuandP.Perry.Est
s,93:432–41,2008.W.FuandP.Perry.Estimatingthenumberofclustersusingcross-validation.JournalofComputa-tionalandGraphicalStatistics,2017.D.P.Hofmeyr.Degreesoffreedomandmodelselectionforkmeansclustering.ArXiv,2018.H.Liu,K.Roeder,andL.A.Wasserman.Stabilityapproachtoregularizationselection(stars)forhighdimensionalgraphicalmodels.NeurIPS,242:1432–1440,2010.11 BIZAETAL.M.Maathuis,M.Kalisch,andP.Buhlmann.Estimatinghigh-dimensionalinterventioneffectsfromobservationaldata.AnnalsofStatistics,37:3133–3164,2009.N.MeinshausenandP.B¨uhlmann.Stabilityselection.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),72(4):417–473,2010.J.Pearl.Causality:Models,ReasoningandInference.CambridgeUniversityPress,USA,2ndedition,2009.P.O.Perry.Cross-validationforunsupervisedlearning.PhDthesis,StanfordUniversity,2009.J.PetersandP.B¨uhlmann.Structuralinterventiondistanceforevaluatingcausalgraphs.NeuralComputation,27:771–799,2015.V.K.Raghu,A.Poon,andP.V.Benos.Evaluationofcausalstructurelearningmethodsonmixeddatatypes.Proceedingsofmachinelearningresearch,92:48–65,2018.J.D.RamseyandB.Andrews.Acomparisonofpubliccausalsearchpackagesonlinear,gaussiandatawithnolatentvariables.CoRR,2017.T.RichardsonandP.Spirtes.Ancestralgraphmarkovmodels.Ann.Statist.,30(4):962–1030,082002.A.Roumpelaki,G.Borboudakis,S.Triantallou,andI.Tsamardinos.Marginalcausalconsistencyinconstraint-basedcausallearning.InCFA@UAI,2016.Y.SaitoandS.Yasui.Counterfactualcross-validation:Effectivecausalmodelselectionfromob-servationaldata.arXivpreprintarXiv:1909.05299,2019.G.Schwarz.Estimatingthedimensionofamodel.Ann.Statist.,6(2):461–464,031978.M.Scutari.Learningbayesiannetworkswiththebnlearnrpackage.JournalofStatisticalSoftware,Articles,35(3):1–22,2010.M.Stone.Anasymptoticequivalenceofchoiceofmodelbycross-validationandakaike'scriterion.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),39(1):44–47,1977.K.Tang,R.Wang,andT.Chen.Towardsmaximizingtheareaundertheroccurveformulti-classclassicationproblems.InProceedingsoftheTwenty-FifthAAAIConferenceonArticialIntelligence,AAAI'11,page483–488.AAAIPress,2011.C.Thornton,F.Hutter,H.H.Hoos,andK.Leyton-Brown.Auto-weka:combinedselectionandhyperparameteroptimizationofclassicationalgorithms.InKDD'13,2012.I.TsamardinosandC.F.Aliferis.Towardsprincipledfeatureselection:Relevancy,ltersandwrappers.InAISTATS,2003.I.Tsamardinos,C.F.Aliferis,andA.Statnikov.Timeandsampleefcientdiscoveryofmarkovblanketsanddirectcausalrelations.InKDD,pages673–678,2003.I.Tsamardinos,L.E.Brown,andC.F.Aliferis.Themax-minhill-climbingbayesiannetworkstructurelearningalgorithm.MachineLearning,65:31–78,20

Related Contents


Next Show more