/
JournalofMachineLearningResearch12(2011)691-730Submitted2/10;Revised11 JournalofMachineLearningResearch12(2011)691-730Submitted2/10;Revised11

JournalofMachineLearningResearch12(2011)691-730Submitted2/10;Revised11 - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
389 views
Uploaded On 2015-09-13

JournalofMachineLearningResearch12(2011)691-730Submitted2/10;Revised11 - PPT Presentation

CHOIANDKIMIRLisanaturalwaytoexamineanimalandhumanbehaviorsIfthedecisionmakerisassumedtofollowtheprincipleofrationalityNewell1982itsbehaviorcouldbeunderstoodbytherewardfunctionthatthedecisionmaker ID: 127998

CHOIANDKIMIRLisanaturalwaytoexamineanimalandhumanbehaviors.Ifthedecisionmakerisassumedtofollowtheprincipleofrationality(Newell 1982) itsbehaviorcouldbeunderstoodbytherewardfunctionthatthedecisionmaker

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch12(2011)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch12(2011)691-730Submitted2/10;Revised11/10;Published3/11InverseReinforcementLearninginPartiallyObservableEnvironmentsJaedeugChoiJDCHOI@AI.KAIST.AC.KRKee-EungKimKEKIM@CS.KAIST.AC.KRDepartmentofComputerScienceKoreaAdvancedInstituteofScienceandTechnologyDaejeon305-701,KoreaEditor:ShieMannorAbstractInversereinforcementlearning(IRL)istheproblemofrecoveringtheunderlyingrewardfunctionfromthebehaviorofanexpert.MostoftheexistingIRLalgorithmsassumethattheenvironmentismodeledasaMarkovdecisionprocess(MDP),althoughitisdesirabletohandlepartiallyobservablesettingsinordertohandlemorerealisticscenarios.Inthispaper,wepresentIRLalgorithmsforpartiallyobservableenvironmentsthatcanbemodeledasapartiallyobservableMarkovdecisionprocess(POMDP).Wedealwithtwocasesaccordingtotherepresentationofthegivenexpert'sbehavior,namelythecaseinwhichtheexpert'spolicyisexplicitlygiven,andthecaseinwhichtheexpert'strajectoriesareavailableinstead.TheIRLinPOMDPsposesagreaterchallengethaninMDPssinceitisnotonlyill-posedduetothenatureofIRL,butalsocomputationallyintractableduetothehardnessinsolvingPOMDPs.Toovercometheseobstacles,wepresentalgorithmsthatexploitsomeoftheclassicalresultsfromthePOMDPliterature.ExperimentalresultsonseveralbenchmarkPOMDPdomainsshowthatourworkisusefulforpartiallyobservablesettings.Keywords:inversereinforcementlearning,partiallyobservableMarkovdecisionprocess,inverseoptimization,linearprogramming,quadraticallyconstrainedprogramming1.IntroductionInversereinforcementlearning(IRL)wasrstproposedbyRussell(1998)asfollows:Given(1)measurementsofanagent'sbehaviorovertime,inavarietyofcircumstances,(2)mea-surementsofthesensoryinputstotheagent,(3)amodelofthephysicalenvironment(includ-ingtheagent'sbody).Determinetherewardfunctionthattheagentisoptimizing.ThesignicanceofIRLhasemergedfromtheconnectionbetweenreinforcementlearning(RL)andotherresearchareassuchasneurophysiology(MontagueandBerns,2002;CohenandRanganath,2007),behavioralneuroscience(Leeetal.,2004;Niv,2009)andeconomics(ErevandRoth,1998;BorgersandSarin,2000;Hopkins,2007).Intheseresearchareas,therewardfunctionisgenerallyassumedtobexedandknown,butitisoftennon-trivialtocomeupwithanappropriaterewardfunctionforeachproblem.Hence,aprogressinIRLcanhaveasignicantimpactonmanyresearchareas.c\r2011JaedeugChoiandKee-EungKim. CHOIANDKIMIRLisanaturalwaytoexamineanimalandhumanbehaviors.Ifthedecisionmakerisassumedtofollowtheprincipleofrationality(Newell,1982),itsbehaviorcouldbeunderstoodbytherewardfunctionthatthedecisionmakerinternallyoptimizes.Inaddition,wecanexploitthecomputedrewardfunctiontogenerateanagentthatimitatesthedecisionmaker'sbehavior.Thiswillbeausefulapproachtobuildanintelligentagent.AnotheradvantageofIRListhatthesolutionofIRLproblems,thatis,therewardfunction,isoneofthemosttransferablerepresentationsoftheagent'sbehavior.Althoughitisnoteasytotransferthecontrolpolicyoftheagenttootherproblemsthathaveasimilarstructurewiththeoriginalproblem,therewardfunctioncouldbeappliedsinceitcompactlyrepresentstheagent'sobjectivesandpreferences.Inthelastdecade,anumberofstudiesonIRLhavebeenreported.However,mostofthepreviousIRLalgorithms(NgandRussell,2000;AbbeelandNg,2004;RamachandranandAmir,2007;NeuandSzepesvari,2007;SyedandSchapire,2008;Ziebartetal.,2008)assumethattheagentactsinanenvironmentthatcanbemodeledasaMarkovdecisionprocess(MDP).AlthoughtheMDPassumptionprovidesagoodstartingpointfordevelopingIRLalgorithms,theimplicationisthattheagenthasaccesstothetrueglobalstateoftheenvironment.Theassumptionofanomniscientagentisoftentoostronginpractice.Eventhoughtheagentisassumedtobeanexpertinthegivenenvironment,theagentmaybe(andoftenis)makingoptimalbehaviorswithalimitedsensorycapability.Hence,torelaxthestrongassumptionandwidentheapplicabilityofIRLtomorerealisticscenarios,theIRLalgorithmsshouldbeextendedtopartiallyobservableenvironments,whichcanbemodeledaspartiallyobservableMarkovdecisionprocessesApartiallyobservableMarkovdecisionprocess(POMDP)(Sondik,1971;Monahan,1982;Kaelblingetal.,1998)isageneralmathematicalframeworkforsingle-agentplanningunderuncer-taintyabouttheeffectofactionsandthetruestateoftheenvironment.Recently,manyapproximatetechniqueshavebeendevelopedtocomputeanoptimalcontrolpolicyforlargePOMDPs.Thus,POMDPshaveincreasinglyreceivedasignicantamountofattentionindiverseresearchareassuchasrobotnavigation(SpaanandVlassis,2004;Smith,2007),dialoguemanagement(WilliamsandYoung,2007),assisteddailyliving(Hoeyetal.,2007),cognitiveradio(Zhaoetal.,2007)andnetworkintrusiondetection(LaneandBrodley,2003).However,inordertoaddressreal-worldproblemsusingPOMDPs,rst,amodeloftheenvironmentandtherewardfunctionshouldbeobtained.Theparametersforthemodelofanenvironment,suchastransitionprobabilitiesandob-servationprobabilities,canbecomputedrelativelyeasilybycountingtheeventsifthetruestatecanbeaccessed,butdeterminingtherewardfunctionisnon-trivial.Inpractice,therewardfunctionisrepeatedlyhand-tunedbydomainexpertsuntilasatisfactorypolicyisacquired.Thisusuallyentailsalaborintensiveprocess.Forexample,whendevelopingaspokendialoguemanagementsystem,POMDPisapopularframeworkforcomputingthedialoguestrategy,sincewecancomputeanopti-malPOMDPpolicythatisrobusttospeechrecognitionerrorandmaintainsmultiplehypothesesoftheuser'sintention(WilliamsandYoung,2007).Inthisdomain,transitionprobabilitiesandobser-vationprobabilitiescanbecalculatedfromthedialoguecorpuscollectedfromawizard-of-ozstudy.However,thereisnostraightforwardwaytocomputetherewardfunction,whichshouldrepresentthebalanceamongtherewardofasuccessfuldialogue,thepenaltyofanunsuccessfuldialogue,andthecostofinformationgathering.Itismanuallyadjusteduntilasatisfyingdialoguepolicyisobtained.Therefore,asystematicmethodisdesiredtodeterminetherewardfunction.Inthispaper,wedescribeIRLalgorithmsforpartiallyobservableenvironmentsextendingourpreviousresultsinChoiandKim(2009).Specically,weassumethattheenvironmentismodeledasaPOMDPandtrytocomputetherewardfunctiongiventhattheagentfollowsanoptimalpolicy.692 IRLINPARTIALLYOBSERVABLEENVIRONMENTSThealgorithmismainlymotivatedbytheclassicalIRLalgorithmbyNgandRussell(2000)andweadaptthealgorithmtoberobustforlargeproblemsbyusingthemethodssuggestedbyAbbeelandNg(2004).WebelievethatsomeofthemorerecentlyproposedIRLalgorithms(RamachandranandAmir,2007;NeuandSzepesvari,2007;SyedandSchapire,2008;Ziebartetal.,2008)alsocouldbeextendedtohandlepartiallyobservableenvironments.Theaimofthispaperistopresentageneralframeworkfordealingwithpartiallyobservableenvironments,thecomputationalchallengesinvolvedindoingso,andsomeapproximationtechniquesforcopingwiththechallenges.Also,webelievethatourworkwillproveusefulformanyproblemsthatcouldbemodeledasPOMDPs.Theremainderofthepaperisstructuredasfollows:Section2reviewssomedenitionsandnotationsofMDPandPOMDP.Section3presentsanoverviewoftheIRLalgorithmsbyNgandRussell(2000)andAbbeelandNg(2004).Section4givesaformaldenitionofIRLforpar-tiallyobservableenvironments,anddiscussesthefundamentaldifcultiesofIRLandthebarriersofextendingIRLtopartiallyobservableenvironments.InSection5,wefocusontheproblemofIRLwiththeexplicitlygivenexpert'spolicy.Wepresenttheoptimalityconditionsoftherewardfunctionandtheoptimizationproblemswiththecomputationalchallengesandsomeapproximationtechniques.Section6dealswithmorepracticalcaseswherethetrajectoriesoftheexpert'sactionsandobservationsaregiven.Wepresentalgorithmsthatiterativelyndtherewardfunction,compar-ingtheexpert'spolicyandotherpoliciesfoundbythealgorithm.Section7showstheexperimentalresultsofouralgorithmsinseveralPOMDPdomains.Section8brieyreviewsrelatedworkonIRL.Finally,Section9discussessomedirectionsforfuturework.2.PreliminariesBeforewepresenttheIRLalgorithms,webrieyreviewsomedenitionsandnotationsofMDPandPOMDPtoformallydescribethecompletelyobservableenvironmentandthepartiallyobservableenvironment.2.1MarkovDecisionProcessAMarkovdecisionprocess(MDP)providesamathematicalframeworkformodelingasequentialdecisionmakingproblemunderuncertaintyabouttheeffectofanagent'sactioninanenvironmentwherethecurrentstatedependsonlyonthepreviousstateandaction,namely,theMarkovproperty.AnMDPisdenedasatuplehS;A;T;R;giSisthenitesetofstates.Aisthenitesetofactions.TSA!P(S)isthestatetransitionfunction,whereT(s;a;s0)denotestheprobabilityP(s0js;a)ofreachingstates0fromstatesbytakingactionaRSA!Ristherewardfunction,whereR(s;a)denotestheimmediaterewardofexecutingactionainstates,whoseabsolutevalueisboundedbyRmaxg2[0;1)isthediscountfactor.ApolicyinMDPisdenedasamappingpS!A,wherep(s)=adenotesthatactionaisalwaysexecutedinstatesfollowingthepolicyp.Thevaluefunctionofpolicypatstatesisthe693 CHOIANDKIMexpecteddiscountedreturnofstartinginstatesandexecutingthepolicy.Thevaluefunctioncanbecomputedas:Vp(s)=R(s;p(s))+gås02ST(s;p(s);s0)Vp(s0):(1)GivenanMDP,theagent'sobjectiveistondanoptimalpolicypthatmaximizesthevalueforallthestates,whichshouldsatisfytheBellmanoptimalityequation:V(s)=maxahR(s;a)+gås02ST(s;a;s0)V(s0)i:ItisoftenusefultoexpresstheaboveequationintermsofQ-function:pisanoptimalpolicyifandonlyifp(s)2argmaxa2AQp(s;a);whereQp(s;a)=R(s;a)+gås02ST(s;a;s0)Vp(s0);(2)whichistheexpecteddiscountedreturnofexecutingactionainstatesandthenfollowingthepolicyp2.2PartiallyObservableMarkovDecisionProcessApartiallyobservableMarkovdecisionprocess(POMDP)isageneralframeworkformodelingthesequentialinteractionbetweenanagentandapartiallyobservableenvironmentwheretheagentcannotcompletelyperceivetheunderlyingstatebutmustinferthestatebasedonthegivennoisyobservation.APOMDPisdenedasatuplehS;A;Z;T;O;R;b0;giS;A;T;RandgaredenedinthesamemannerasinMDPs.Zisthenitesetofobservations.OSA!P(Z)istheobservationfunction,whereO(s;a;z)denotestheprobabilityP(zjs;a)ofperceivingobservationzwhentakingactionaandarrivinginstatesb0istheinitialstatedistribution,whereb0(s)denotestheprobabilityofstartinginstatesSincethetruestateishidden,theagenthastoactbasedonthehistoryofexecutedactionsandperceivedobservations.Denotingthesetofallpossiblehistoriesatthe-thtimestepasHt=(AZ)t,apolicyinPOMDPisdenedasamappingfromhistoriestoactionspHt!A.However,sincethenumberofpossiblehistoriesgrowsexponentiallywiththenumberoftimesteps,manyPOMDPalgorithmsusetheconceptofbelief.Formally,thebeliefbistheprobabilitydistributionoverthecurrentstates,whereb(s)denotestheprobabilitythatthestateissatthecurrenttimestep,andDdenotesajSj1dimensionalbeliefsimplex.Thebeliefupdateforthenexttimestepcanbecomputedfromthebeliefatthecurrenttimestep:Giventheactionaatthecurrenttimestepandtheobservationzatthenexttimestep,theupdatedbeliefbazforthenexttimestepisobtainedbybaz(s0)=P(s0jb;a;z)=O(s0;a;z)åsT(s;a;s0)b(s) P(zjb;a);(3)694 IRLINPARTIALLYOBSERVABLEENVIRONMENTSwherethenormalizingfactorP(zjb;a)=ås0O(s0;a;z)åsT(s;a;s0)b(s).Hence,thebeliefservesasasufcientstatisticforfullysummarizinghistories,andthepolicycanbeequivalentlydenedasamappingpD!A,wherep(b)=aspeciesactionatobeselectedatthecurrentbeliefbbythepolicyp.Usingbeliefs,wecanviewPOMDPsasbelief-stateMDPs,andthevaluefunctionofanoptimalpolicysatisestheBellmanequation:V(b)=maxahåsb(s)R(s;a)+gås0;zT(s;a;s0)O(s0;a;z)V(baz)i:(4)Alternatively,apolicyinPOMDPcanberepresentedasanitestatecontroller(FSC).AnFSCpolicyisdenedbyadirectedgraphhN;Ei,whereeachnoden2Nisassociatedwithanactiona2Aandhasanoutgoingedgeez2Eperobservationz2Z.Thepolicycanberegardedasp=hy;hiwhereyistheactionstrategyassociatingeachnodenwithanactiony(n)2Aandhistheobservationstrategyassociatingeachnodenandobservationzwithasuccessornodeh(n;z)2NGivenanFSCpolicyp=hy;hi,thevaluefunctionVpistheexpecteddiscountedreturnofexecutingpandisdenedoverthejointspaceofFSCnodesandPOMDPstates.Itcanbecomputedbysolvingasystemoflinearequations:Vp(n;s)=R(s;a)+gån0;s0Ta;os(hn;si;hn0;s0i)Vp(n0;s0);(5)whereTa;os(hn;si;hn0;s0i)=T(s;a;s0)åz2Zs.t.os(z)=n0O(s0;a;z);(6)witha=y(n)andos(z)=h(n;z).ThevalueatnodenforbeliefbiscalculatedbyVp(n;b)=åsb(s)Vp(n;s);(7)andthestartingnodefortheinitialbeliefb0ischosenbyn0=argmaxnVp(n;b0).WecanalsodeneQ-functionforanFSCpolicypQp(hn;si;ha;osi)=R(s;a)+gån0;s0Ta;os(hn;si;hn0;s0i)Vp(n0;s0);whichistheexpecteddiscountedreturnofchoosingactionaatnodenandmovingtonodeos(z)uponobservationz,andthenfollowingpolicyp.Also,Q-functionfornodenatbeliefbiscomputedbyQp(hn;bi;ha;osi)=åsb(s)Qp(hn;si;ha;osi):WithanFSCpolicyp,wecansortthereachablebeliefsintonodes,suchthatBndenotesthesetofbeliefsthatarereachablefromtheinitialbeliefb0andthestartingnoden0whenthecurrentnodeisn.NotethatjBnj1foreverynoden695 CHOIANDKIM3.IRLinCompletelyObservableMarkovianEnvironmentsTheMDPframeworkprovidesagoodstartingpointfordevelopingIRLalgorithmsincompletelyobservableMarkovianenvironmentsandmostofthepreviousIRLalgorithmsaddresstheproblemsintheMDPframework.Inthissection,weoverviewtheIRLalgorithmsproposedbyNgandRussell(2000)andAbbeelandNg(2004)asbackgroundtoourwork.TheIRLproblemincompletelyobservableMarkovianenvironmentsisdenotedwithIRLforMDPnR,whichisformallystatedasfollows:GivenanMDPnRhS;A;T;giandanexpert'spolicypE,ndtherewardfunctionRthatmakespEanoptimalpolicyforthegivenMDP.Theproblemcanbecategorizedintotwocases:Therstcaseiswhenanexpert'spolicyisexplicitlygivenandthesecondcaseiswhenanexpert'spolicyisimplicitlygivenbyitstrajectories.3.1IRLforMDPnRfromPoliciesLetusassumethatanexpert'spolicypEisexplicitlygiven.NgandRussell(2000)presentanec-essaryandsufcientconditionfortherewardfunctionRofanMDPtoguaranteetheoptimalityofpEQpE(s;pE(s))QpE(s;a);8s2S;8a2A;(8)whichstatesthatdeviatingfromtheexpert'spolicyshouldnotyieldahighervalue.Fromthecondition,theysuggestthefollowing:Theorem1[NgandRussell,2000]LetanMDPnRhS;A;T;gibegiven.Thenthepolicypisopti-malifandonlyiftherewardfunctionRsatisesRpRa+g(TpTa)(IgTp)1Rp0;8a2A;(9)wherethematrixnotationsandthematrixoperatoraredenedasfollows:TpisajSjjSjmatrixwith(s;s0)elementbeingT(s;p(s);s0)TaisajSjjSjmatrixwith(s;s0)elementbeingT(s;a;s0)RpisajSjvectorwiths-thelementbeingR(s;p(s))RaisajSjvectorwiths-thelementbeingR(s;a)VpisajSjvectorwiths-thelementbeingVp(s)XY,X()Y(),foralli,ifthelengthofXisthesameasthatofY.ProofEquation(1)canberewrittenasVp=Rp+gTpVp.Thus,Vp=(IgTp)1Rp:(10)BythedenitionofanoptimalpolicyandEquation(2),pisoptimalifandonlyifp(s)2argmaxa2AQp(s;a);8s2S=argmaxa2A(R(s;a)+gås0T(s;a;s0)Vp(s0));8s2S,R(s;p(s))+gås0T(s;p(s);s0)Vp(s0)R(s;a)+gås0T(s;a;s0)Vp(s0);8s2S;8a2A:696 IRLINPARTIALLYOBSERVABLEENVIRONMENTS maximizeRåsåa2AnpE(s)hQpE(s;pE(s))QpE(s;a)ilkRk1subjecttoRpERa+g(TpETa)(IgTpE)1RpE0;8a2AjR(s;a)jRmax;8s2S;8a2A Table1:OptimizationproblemofIRLforMDPnRfromtheexpert'spolicy.ByrephrasingwiththematrixnotationsandsubstitutingwithEquation(10),Rp+gTpVpRa+gTaVp;8a2A,Rp+gTp(IgTp)1RpRa+gTa(IgTp)1Rp;8a2A,RpRa+g(TpTa)(IgTp)1Rp0;8a2A: Equation(9)boundsthefeasiblespaceoftherewardfunctionsthatguaranteetheoptimalityoftheexpert'spolicy,andthereexistinnitelymanyrewardfunctionsthatsatisfyEquation(9).Asadegeneratecase,R=0isalwaysasolution.Thus,giventheexpert'spolicypE,whichisassumedtobeoptimal,therewardfunctionisfoundbysolvingtheoptimizationprobleminTable1,wherelisanadjustableweightforthepenaltyofhavingtoomanynon-zeroentriesintherewardfunction.Theobjectiveistomaximizethesumofthemargins1betweentheexpert'spolicyandallotherpoliciesthatdeviateasinglestepfromtheexpert'spolicy,inthehopethattheexpert'spolicyisoptimalwhilefavoringsparsenessintherewardfunction.3.2IRLforMDPnRfromSampledTrajectoriesInsomecases,wehavetoassumethattheexpert'spolicyisnotexplicitlygivenbutinsteadthetrajectoriesoftheexpert'spolicyinthestateandactionspacesareavailable2Them-thtrajectoryoftheexpert'spolicyisdenedastheH-stepstateandactionsequencesfsm0;sm1;;smH1gandfam0;am1;;amH1gInordertoaddressproblemswithlargestatespaces,NgandRussell(2000)usealinearapprox-imationfortherewardfunction,andwealsoassumethattherewardfunctionislinearlyparameter-izedasR(s;a)=a1f1(s;a)+a2f2(s;a)++adfd(s;a)=aTf(s;a);(11)whereknownbasisfunctionsfSA![0;1]dandtheweightvectora=[a1;a2;;ad]T2RdWealsoassumewithoutlossofgeneralitythata2[1;1]d 1.Wefounditmoresuccessfultousethesum-of-marginsapproachthantheminimum-of-marginsapproachintheoriginalpaper,sincethelattermayfailwhentherearemultipleoptimalpolicies.2.Althoughonlythetrajectoriesofstatesandactionsareavailable,thetransitionfunctionTisassumedtobeknowninMDPnR.697 CHOIANDKIM Algorithm1IRLforMDPnRfromthesampledtrajectoriesusingLP. Input:MDPnRhS;A;T;i,basisfunctionsf,Mtrajectories1:Choosearandominitialpolicyp1andsetP=fp1g.2:fork=1toMaxIterdo3:Findˆabysolvingthelinearprogram:maximizeˆaåp2PpˆVpE(s0)ˆVp(s0)subjecttojˆaij1i=1;2;;d4:Computeanoptimalpolicypk+1fortheMDPwithˆR=ˆaTf.5:ifˆVpE(s0)Vpk+1(s0)then6:returnˆR7:else8:P=P[fpk+1g9:endif10:endfor11:returnˆROutput:therewardfunctionˆR Then,fromthegivenMtrajectories,thevalueofpEforthestartingstates0isestimatedbytheaverageempiricalreturnforanestimatedrewardfunctionˆR=ˆaTfˆVpE(s0)=1 MMåm=1H1åt=0gtˆR(smt;amt)=1 MˆaTMåm=1H1åt=0gtf(smt;amt):ThealgorithmispresentedinAlgorithm1.ItstartswiththesetofpoliciesPinitializedbyabasecaserandompolicyp1.Ideally,thetruerewardfunctionRshouldyieldVpE(s0)Vp(s0)for8p2Psincetheexpert'spolicypEisassumedtobeanoptimalpolicywithrespecttoRThevaluesofotherpolicieswithacandidaterewardfunctionˆRareeitherestimatedbysamplingtrajectoriesorareexactlycomputedbysolvingtheBellmanequation,Equation(1).ThealgorithmiterativelytriestondabetterrewardfunctionˆR,giventhesetofpoliciesPfoundbythealgorithmP=fp1;:::;pkguptoiterationk,bysolvingtheoptimizationprobleminline3,wherep(x)isafunctionthatfavorsx�0.3Thealgorithmthencomputesanewpolicypk+1thatmaximizesthevaluefunctionunderthenewrewardfunction,andaddspk+1toP.Thealgorithmcontinuesuntilithasfoundasatisfactoryrewardfunction.TheabovealgorithmwasextendedfortheapprenticeshiplearningintheMDPframeworkbyAbbeelandNg(2004).Thegoalofapprenticeshiplearningistolearnapolicyfromanexpert'sdemonstrationswithoutarewardfunction,soitdoesnotcomputetheexactrewardfunctionthattheexpertisoptimizingbutratherthepolicywhoseperformanceisclosetothatoftheexpert'spolicyontheunknownrewardfunction.Thisisworthreviewing,asweadaptthisalgorithmtoaddresstheIRLproblemsinpartiallyobservableenvironments.WeassumethattherearesomeknownbasisfunctionsfandtherewardfunctionislinearlyparameterizedwiththeweightvectoraasinEquation(11).Also,assumekak11toboundRmaxby1.Thevalueofapolicypcanbewrittenusingthefeatureexpectationµ(p)forthereward 3.NgandRussell(2000)chosep(x)=xifx0,andp(x)=2xifx0inordertofavorx&#x-0.9;ٰ耀0butmorepenalizex0.Thecoefcientof2washeuristicallychosen.698 IRLINPARTIALLYOBSERVABLEENVIRONMENTS Algorithm2ApprenticeshiplearningusingQCP. Input:MDPnRhS;A;T;i,basisfunctionsf,Mtrajectories1:ChoosearandominitialweightaandsetP=/0.2:repeat3:ComputeanoptimalpolicypfortheMDPwithR=aTf.4:P=P[fpg5:Solvethefollowingoptimizationproblem:maximizet;atsubjecttoaTµEaTµ(p)+t;8p2Pkak216:untiltOutput:therewardfunctionR Algorithm3Apprenticeshiplearningusingtheprojectionmethod. Input:MDPnRhS;A;T;i,basisfunctionsf,Mtrajectories1:Choosearandominitialpolicyp0.2:Set¯µ0=µ0andi=1.3:repeat4:Seta=µE¯µi1.5:ComputeanoptimalpolicypifortheMDPwithR=aTf6:ComputeanorthogonalprojectionofµEontothelinethrough¯µi1andµi.¯µi=¯µi1+(µi¯µi1)T(µE¯µi1) (µi¯µi1)T(µi¯µi1)(µi¯µi1)7:Sett=kµE¯µik2,andi=i+1.8:untiltOutput:therewardfunctionR 699 CHOIANDKIMfunctionR=aTfasfollows:Vp(s0)=Eh¥åt=0gtR(st;p(st))js0i=Eh¥åt=0gtaTf(st;p(st))js0i=aTEh¥åt=0gtf(st;p(st))js0i=aTµ(p);whereµ(p)=E[å¥t=0gtf(st;p(st))js0].Sincetheexpert'spolicyisnotexplicitlygiven,thefeatureexpectationoftheexpert'spolicycannotbeexactlycomputed.Thus,weempiricallyestimatetheexpert'sfeatureexpectationµE=µ(pE)fromthegivenexpert'sMtrajectoriesofthevisitedstatesfsm0;sm1;;smH1gandtheexecutedactionsfam0;am1;;amH1gbyˆµE=1 MMåm=1H1åt=0gtf(smt;amt):AbbeelandNg(2004)proposeapprenticeshiplearningalgorithmsforndingapolicywhosevalueissimilartothatoftheexpert'spolicybasedontheideathatthedifferenceofthevaluesbetweentheobtainedpolicypandtheexpert'spolicypEisboundedbythedifferencebetweentheirfeatureexpectations.Formally,thisiswrittenasfollows:jVpE(s0)Vp(s0)j=jaTµ(pE)aTµ(p)jkak2kµEµ(p)k2kµEµ(p)k2(12)sincekak1isassumedtobeboundedby1.ThealgorithmispresentedinAlgorithm2.Theopti-mizationprobleminline5canbeconsideredastheIRLstepthattriestondtherewardfunctionthattheexpertisoptimizing.ItissimilartotheoptimizationprobleminAlgorithm1,exceptthat,theoptimizationproblemcannotbemodeledasalinearprogramming(LP)problembutratherasaquadraticallyconstrainedprogramming(QCP)problembecauseofL2normconstraintona.Algo-rithm3isanapproximationalgorithmusingtheprojectionmethodinsteadofQCP,whereµidenotesµ(pi)forall.Bothalgorithmsterminatewhene.Itisprovedthatbothalgorithmstakeanitenumberofiterationstoterminate(AbbeelandNg,2004).4.IRLinPartiallyObservableEnvironmentsWedenotetheproblemofIRLinpartiallyobservableenvironmentsasIRLforPOMDPnRandtheobjectiveistodeterminetherewardfunctionthattheexpertisoptimizing.Formally,IRLforPOMDPnRisdenedasfollows:GivenaPOMDPnRhS;A;Z;T;O;b0;giandanexpert'spolicypE,ndtherewardfunctionRthatmakespEanoptimalpolicyforthegivenPOMDP.Hence,therewardfunctionfoundbyIRLforPOMDPnRshouldguaranteetheoptimalityoftheexpert'spolicyforthegivenPOMDP.IRLforPOMDPnRmainlysuffersfromtwosources:First,IRLisfundamentallyill-posed,andsecond,computationalintractabilityarisesinIRLforPOMDPnRincontrastwithIRLforMDPnR.Wedescribethesechallengesbelow.AnIRLproblemisanill-posedproblem,whichisamathematicalproblemthatisnotwell-posed.Thethreeconditionsofawell-posedproblemareexistence,uniqueness,andstabilityofthe700 IRLINPARTIALLYOBSERVABLEENVIRONMENTSsolution(Hadamard,1902).IRLviolatestheconditionoftheuniqueness.AnIRLproblemmayhaveaninnitenumberofsolutionssincetheremaybeaninnitenumberofrewardfunctionsthatguaranteetheoptimalityofthegivenexpert'spolicy.AdegenerateoneisthesolutionofeveryIRLproblemsinceR=0yieldseverypolicyoptimal.Also,givenanoptimalpolicyforarewardfunction,wecanndsomeotherrewardfunctionthatyieldsthesameoptimalpolicywithoutanymodicationtotheenvironmentbythetechniqueofrewardshaping(Ngetal.,1999).AssuggestedbyNgandRussell(2000),wecanguaranteetheoptimalityoftheexpert'spolicybycomparingthevalueoftheexpert'spolicywiththatofallpossiblepolicies.However,thereareaninnitenumberofpoliciesinanitePOMDP,sinceapolicyinaPOMDPisdenedasamappingfromacontinuousbeliefspacetoanitestatespaceorrepresentedbyanFSCpolicythatmighthaveaninnitenumberofnodes.Incontrast,thereareanitenumberofpoliciesinaniteMDP,sinceapolicyinanMDPisdenedasamappingfromanitestatespacetoaniteactionspace.Inaddition,inordertocomparetwopoliciesinaPOMDP,thevaluesofthosepoliciesshouldbecomparedforallbeliefs,becausethevaluefunctionisdenedonabeliefspace.ThisintractabilityofIRLforPOMDPnRoriginatesfromthesamecauseasthedifcultyofsolvingaPOMDP.TheoptimalpolicyofaPOMDPisthesolutionofabelief-stateMDPusingtheconceptofbelief.ItisthendifculttosolveanMDPwithacontinuousstatespace,sinceapolicyanditsvaluefunctionarerespectivelydenedasamappingfromthecontinuousstatespacetotheniteactionspaceandtherealnumbers.Inthefollowingsections,weaddresstheproblemofIRLforPOMDPnR,consideringtwocasesasintheapproachestoIRLforMDPnR.Therstcaseiswhentheexpert'spolicyisexplicitlyrepresentedintheformofanFSC.Thesecondcaseiswhentheexpert'spolicyisimplicitlygivenbythetrajectoriesoftheexpert'sexecutedactionsandthecorrespondingobservations.Althoughthesecondcasehasmorewideapplicabilitythantherstcase,therstcasecanbeappliedtosomepracticalproblems.Forexample,whenbuildingdialoguemanagementsystems,wemayalreadyhaveadialoguepolicyengineeredbyhumanexperts,butwestilldonotknowtherewardfunctionthatproducestheexpert'spolicy.WeproposeseveralmethodsfortheproblemsofIRLforPOMDPnRinthesetwocases.Fortherstcase,weformulatetheproblemwithconstraintsfortherewardfunctionsthatguaranteetheoptimalityoftheexpert'spolicy.ToaddresstheintractabilityofIRLforPOMDPnR,wederiveconditionsinvolvingasmallnumberofpoliciesandexploitingtheresultoftheclassicalPOMDPresearch.Forthesecondcase,weproposeiterativealgorithmsofIRLforPOMDPnR.ThemotivationforthisapproachisfromNgandRussell(2000).WealsoextendthealgorithmsproposedbyAbbeelandNg(2004)topartiallyobservableenvironments.5.IRLforPOMDPnRfromFSCPoliciesInthissection,wepresentIRLalgorithmsforPOMDPnRwhentheexpert'spolicyisexplicitlygiven.Weassumethattheexpert'spolicyisrepresentedintheformofanFSC,sincetheFSCisoneofthemostnaturalwaystospecifyapolicyinPOMDPs.Weproposethreeconditionsfortherewardfunctiontoguaranteetheoptimalityoftheexpert'spolicybasedoncomparingQ-functionsandusingthegeneralizedHoward'spolicyimprovementtheorem(Howard,1960)andthewitnesstheorem(Kaelblingetal.,1998).Wethencompletetheoptimizationproblemstodetermineadesiredrewardfunction.701 CHOIANDKIM5.1Q-functionBasedApproachWecouldderiveasimpleandnaiveconditionfortheoptimalityoftheexpert'spolicybycomparingthevalueoftheexpert'spolicywiththoseofallotherpolicies.Givenanexpert'spolicypEdenedbyadirectedgraphhN;EiVpE(n;b)Vp0(n0;b);8b2Dn;8n02N0;(13)forallnodesn2Nandallotherpoliciesp0denedbyadirectedgraphhN0;E0i,whereDndenotesthesetofallthebeliefswherenodenisoptimal.SinceVpEandVp0arelinearintermsoftherewardfunctionRbyEquations(5)and(7),theaboveinequalityyieldsthesetoflinearconstraintsthatdenesthefeasibleregionoftherewardfunctionsthatguaranteestheexpert'spolicytobeoptimal.However,enumeratingalltheconstraintsisclearlyinfeasiblebecausewehavetotakeintoaccountallotherpoliciesp0includingthosewithaninnitenumberofnodes,aswellasalltheinnitelymanybeliefsinDn.Inotherwords,Equation(13)yieldsinnitelymanylinearconstraints.Hence,weproposeasimpleheuristicforchoosinganitesubsetofconstraintsthathopefullyyieldsatightspecicationofthefeasibleregionforthetruerewardfunction.First,amongtheinnitelymanypolicies,weonlyconsiderpolicesthatareslightlymodiedfromtheexpert'spolicysincetheyaresimilartotheexpert'spolicyyetmustbesuboptimal.Weselectasthesimilarpoliciesthosedeviateonestepfromtheexpert'sactionandobservationstrategies,analogoustoEquation(8).Foreachnoden2N,therearejAjjNjjZjwaystodeviatefromtheexpert'sactionandobservationstrategies,henceweconsideratotalofjNjjAjjNjjZjpoliciesthatdeviateonestepfromtheexperts'policy.Second,insteadofconsideringallpossiblebeliefsinDn,weonlyconsiderthenitelysampledbeliefsreachablebytheexpert'spolicy.ThemotivationforusingthesampledbeliefscomesfromthefactthatonlythesetofbeliefsreachableundertheoptimalpolicyisimportantforsolvingPOMDPs,anditisalsowidelyusedinmostoftherecentapproximatePOMDPsolvers(SpaanandVlassis,2005;SmithandSimmons,2005;Pineauetal.,2006;Jietal.,2007;Kurniawatietal.,2008).Theaboveheuristicyieldsthefollowingnitesetoflinearconstraints:Givenanexpert'spolicypE=hy;hiQpE(hn;bi;hy(n);h(n;)i)QpE(hn;bi;ha;osi);8b2Bn;8a2A;8os2NZ;(14)foreverynodeninpE,whereBnDndenotesthesetofsampledbeliefsthatarevisitedatnodenwhenfollowingtheexpert'spolicypEfromtheinitialbeliefb0.Theaboveconditionstatesthatanypolicythatdeviatesonestepfromtheexpert'sactionandobservationstrategiesshouldnothaveahighervaluethantheexpert'spolicydoes.Notethattheconditionisanecessarythoughnotasufcientone,sincewedonotusethesetofallotherpoliciesbutusethesetofjNjjAjjNjjZjpoliciesthathavethesame(orpossiblyfewer)numberofnodesastheexpert'spolicy,nordoweusethesetofallbeliefsinDnbutusethesetofsampledbeliefs.Weuseasimpleexampleillustratingtheapproach.ConsideraPOMDPwithtwoactionsandtwoobservations,andtheexpert'spolicypEistheFSCrepresentedbysolidlinesinFigure1.Thenodesarelabeledwithactions(a0anda1)andtheedgesarelabeledwithobservations(z0andz1).InordertondtheregionoftherewardfunctionsthatyieldspEasoptimal,webuildone-stepdeviatingpoliciesasmentionedabove.Thepoliciesp00;p01;;p07inthegurearetheone-stepdeviatingpoliciesfornoden0ofpE.Notethatp0ivisitsnoden0iinsteadoftheoriginalnoden0andthenexactlyfollowspE.WethenenumeratetheconstraintsinEquation(14),comparingthevalueofpEtothat702 IRLINPARTIALLYOBSERVABLEENVIRONMENTS Figure1:Setofthepoliciesthatdeviateonestepfromnoden0oftheexpert'spolicy.ofeachone-stepdeviatingpolicy.Specically,thevalueatnoden0ofpEisconstrainedtobenotlessthanthevalueatnoden0iofp0i,sincedeviatingfromtheexpert'spolicyshouldbesuboptimal.TobuildthecompletesetofconstraintsinEquation(14),weadditionallygenerateone-stepdeviatingpoliciesfornoden1ofpEinasimilarmanner.WethushavejNjjAjjNjjZj=2222=16policiesthatdeviateonestepfrompE5.2DynamicProgramming(DP)UpdateBasedApproachAmoresystematicapproachtodeningthesetofpoliciestobecomparedwiththeexpert'spolicyistousethesetofFSCpoliciesthatariseduringtheDPupdateoftheexpert'spolicy.Giventheexpert'spolicypE,theDPupdategeneratesjAjjNjjZjnewnodesforallpossibleactionandobservationstrategies,andthesenodescanpotentiallybeanewstartingnode.Theexpert'spolicyshouldbeoptimalifthevalueisnotimprovedforanybeliefbythedynamicprogrammingupdate.ThisideacomesfromthegeneralizedHoward'spolicyimprovementtheorem(Howard,1960):Theorem2[Hansen,1998]IfanFSCpolicyisnotoptimal,theDPupdatetransformsitintoanFSCpolicywithavaluefunctionthatisasgoodorbetterforeverybeliefstateandbetterforsomebeliefstate.ThecompleteproofofthegeneralizedpolicyimprovementtheoremcanbefoundinHansen(1998)butwegivethefullproofofthetheoremfortheconvenienceofthereaders.First,weshouldprovethefollowinglemma.Lemma1GivenanFSCpolicyp=hy;hiandanodennew,whichisnotincludedinp,thevaluefunctionofnodennew2NnewwiththeactionstrategyofselectingactionaandobservationstrategyosiscomputedbyVnew(nnew;s)=R(s;a)+gån0;s0Ta;os(hnnew;si;hn0;s0i)Vp(n0;s0);(15)whereVpiscalculatedfromEquation(5)andTa;osisdenedinEquation(6).Forsomenodeninp,ifVnew(nnew;s)Vp(n;s),for8s2S,thevalueoftheoriginalpolicypwillnotbegreaterthanthatofthepolicytransformedbydiscardingnodenandredirectingalltheincomingedgesofnodentonodennewProofWebuildanewpolicypkthatfollowstheoriginalpolicyp,butexecutestheactionandobservationstrategiesofnnewfortherstktimesthatnodenisvisited.Thelemmaisprovedbytheinductiononthenumberoftimesk703 CHOIANDKIMForthebasestepk=1,thenewpolicyp1executestheactionandobservationstrategiesofnnewonlyforthersttimethatnodenisvisited,andfollowspfortherestofthetimesteps.Then,foranybeliefstatebVp1(n;b)=åsb(s)Vp1(n;s)=åsb(s)Vnew(nnew;s)åsb(s)Vp(n;s)=Vp(n;b)sinceVp1(n;s)=Vnew(nnew;s)foralls2Sbytheconstruction.Fortheinductivestep,weabusenotationtodenoteRpk(st;at)astherewardatthe-thtimestepbyfollowingthepolicypkandstartingfrombeliefbandnoden.Then,foranybeliefstatebVp0k(n;b)=Eh¥åt=0gtRpk(st;at) bi=EhTk1åt=0gtRpk(st;at)+¥åt=TkgtRpk(st;at) bi=EhTk1åt=0gtRpk(st;at) bi+Eh¥åt=TkgtRpk(st;at) bi=EhTk1åt=0gtRpk1(st;at) bi+gTkEVpk(n;bTk)jbEhTk1åt=0gtRpk1(st;at) bi+gTkEVp(n;bTk)jb=Vpk1(n;b)whereTkrepresentsthek-thtimethatnodenisvisited.Therstequalityholdsbythedenitionofthevaluefunction.Thefourthequalityholdsbytheconstructionofpk1andpkandthedenitionofthevaluefunction.ThefthinequalityholdsbyVpk(n;bTk)=Vnew(nnew;bTk)Vp(n;bTk),sincepkexecutestheactionandobservationstrategiesofnnewatbTkandexecutesthoseofnfortherestofthetime.Hence,byinduction,itfollowsthatthevalueofthetransformedpolicycannotbedecreasedbyreplacingnwithnnew Usingtheabovelemma,wecanproveTheorem2.Proof(ofTheorem2)Thepolicyiterationalgorithm(Hansen,1998)transformsthepolicybyreplacingthenodeswithnewnodesgeneratedbytheDPupdateusingthefollowingrules:(1)Ifthereisanoldnodewhoseactionandobservationstrategiesarethesameasthoseofanewnode,theoldnodeisunchanged.(2)Ifthevalueatanoldnodeislessthanthevalueatanewnode,foranystate,theoldnodeisdiscardedandalltheincomingedgesoftheoldnodeareredirectedtothenewnode.(3)Therestofnewnodesareaddedtotheoriginalpolicy.Sincethevalueisnotdecreasedbyleavingthepolicyunchangedoraddinganodetothepolicy,therstandthethirdtransformationrulescannotdecreasethevalue.Also,bytheabovelemma,thesecondtransformationrulecannotdecreasethevalue.Thus,thevalueofthetransformedpolicyusingtheDPupdatedoesnotdecrease.704 IRLINPARTIALLYOBSERVABLEENVIRONMENTS Figure2:SetofthenewlygeneratednodesbytheDPupdate.Also,ifeverynodegeneratedbytheDPupdateisaduplicateofanodeintheoriginalpolicy,theoptimalityequation,Equation(4)issatisedandtheoriginalpolicyisoptimal.Thus,ifthepolicyisnotoptimal,theDPupdatemustgeneratesomenon-duplicatenodesthatchangethepolicyandimprovethevaluesforsomebeliefstate. WeshouldproceedwithcautionhoweverinthesensethattheDPupdatedoesnotgenerateallthenecessarynodestoguaranteetheoptimalityoftheexpert'spolicyforeverybelief:Thenodesintheexpert'spolicyareonlythosereachablefromthestartingnoden0,whichyieldsthemaximumvalueattheinitialbeliefb0.Nodesthatyieldthemaximumvalueatsomeotherbeliefs(i.e.,useful)butarenotreachablefromn0arenotpresentintheexpert'spolicy.Toguaranteetheoptimalityoftheexpert'spolicyforeverybelief,weneedtogeneratethosenon-existentbutusefulnodes.However,sincethereisnowaytorecoverthem,weonlyusenodesintheexpert'spolicyandconsideronlythereachablebeliefsbytheexpert'spolicy.LetNnewbethesetofnodesnewlygeneratedwhentransformingtheexpert'spolicybytheDPupdate,thenjNnewj=jAjjNjjZj.Thevaluefunctionofnodennew2NnewiscomputedbyEqua-tion(15).ThevaluefunctionofpolicypEshouldsatisfyVpE(n;b)Vnew(nnew;b);8b2Bn;8nnew2Nnew(16)foreverynoden2Niftheexpert'spolicypEisoptimal.NotethatVnewaswellasVpEarelinearintermsoftherewardfunctionRToillustratetheapproach,wereusetheexampleinSection5.1.Figure2showspEinsolidlinesandthesetNnewofnodesgeneratedbytheDPupdateindashedlines.WehavejAjjNjjZj=222=8nodesgeneratedbytheDPupdate,thusNnew=fn00;n01;;n07gisthecompletesetofnodeswithallpossibleactionandobservationstrategies.WethenenumeratetheconstraintsinEquation(16),makingthevalueateachnodeofpEnolessthanthevaluesatthenodesinNnew.SincethenumberofthenewlygeneratednodesbytheDPupdateissmallerthanthatofthepoliciesgeneratedbytheQ-functionbasedapproachinSection5.1,thecomputationalcomplexityissignicantlyreduced.5.3WitnessTheoremBasedApproachAmorecomputationallyefcientwaytogeneratethesetNnewofnewnodesistousethewitnesstheorem(Kaelblingetal.,1998).Wewillexploitthewitnesstheoremtondasetofusefulnodes705 CHOIANDKIMthatyieldthefeasibleregionforthetruerewardfunctionasthewitnessalgorithmincrementallygeneratesnewpolicytreesthatimprovethecurrentpolicytrees.Here,wesaythatanodeisusefulifithasgreatervaluethananyothernodesatsomebeliefs.Formallyspeaking,givenanFSCpolicyp,wedeneasetB(n;U)ofbeliefswherethevaluefunctionofnodendominatesthoseofallothernodesinthesetUB(n;U)=fbjVnew(n;b)�Vnew(n0;b);for8n02Unfng;8b2DgwhereVnew(n;b)=åsb(s)Vnew(n;s)andVnew(n;s)iscomputedbyEquation(15).NodenisusefulifB(n;U)=0,andUisasetofusefulnodesifB(n;U)=0foralln2U.Were-statethewitnesstheoremintermsofFSCpoliciesasthefollowing:Theorem3[Kaelblingetal.,1998]AnFSCpolicypisgivenasadirectedgraphhN;Ei.Let˜Uabeanonemptysetofusefulnodeswiththeactionstrategyofchoosingactiona,andUabethecompletesetofusefulnodeswiththeactionstrategyofchoosingactiona.Then,˜Ua=Uaifandonlyifthereissomenode˜n2˜Ua,observationz,andnoden02NforwhichthereisabeliefbsuchthatVnew(nnew;b)Vp(n;b)foralln2˜Ua,wherennewisanodethatagreeswith˜ninitsactionandallitssuccessornodesexceptforobservationz,forwhichh(nnew;z)=n0ProofThedirectionofthestatementissatisedbecausebisawitnesspointfortheexistenceofausefulnodemissingfrom˜UaTheonlyifdirectioncanberephrasedas:If˜Ua=Uathenthereisanode˜n2˜Ua,abeliefstateb,andanewnodennewthathasalargervaluethananyothernodesn2˜UaatbChoosesomenoden2Ua˜Ua.Sincenisuseful,theremustbeabeliefbsuchthatVnew(n;b)�Vnew(n0;b)forallnoden02˜Ua.Let˜n=argmaxn002˜UaVnew(n00;b).Then,bytheconstruction,Vnew(n;b)�Vnew(˜n;b):(17)Notethatactionaisalwaysexecutedatnand˜n,sinceweconsideronlythenodeswiththeactionstrategyofchoosingactionainthetheorem.Assumethatforeveryobservationzåsb(s)ås0T(s;a;s0)O(s0;a;z)Vp(h(n;z);s0)åsb(s)ås0T(s;a;s0)O(s0;a;z)Vp(h(˜n;z);s0):ThenVnew(n;b)=åsb(s)R(s;a)+gås0T(s;a;s0)åzO(s0;a;z)Vp(h(n;z);s0)åsb(s)R(s;a)+gås0T(s;a;s0)åzO(s0;a;z)Vp(h(˜n;z);s0)=Vnew(˜n;b)706 IRLINPARTIALLYOBSERVABLEENVIRONMENTS Figure3:Setofthenewlygeneratednodesbythewitnesstheorem.whichcontradicts(17).Thus,theremustbesomeobservationzsuchthatåsb(s)ås0T(s;a;s0)O(s0;a;z)Vp(h(n;z);s0)�åsb(s)ås0T(s;a;s0)O(s0;a;z)Vp(h(˜n;z);s0):Now,if˜nandndifferinonlyonesuccessornode,thentheproofiscompletewithn,whichcanserveasthennewinthetheorem.Ifnandndifferinmorethanonesuccessornode,wewillidentifyanothernodethatcanactasnnew.Denennewtobeidenticalto˜nexceptforobservationz,forwhichh(nnew;z)=h(n;z).Fromthis,itfollowsthatVnew(nnew;b)=åsb(s)R(s;a)+gås0T(s;a;s0)åzO(s0;a;z)Vp(h(nnew;z);s0)�åsb(s)R(s;a)+gås0T(s;a;s0)åzO(s0;a;z)Vp(h(˜n;z);s0)=Vnew(˜n;b)Vnew(n;b)foralln2˜Ua.Therefore,thenodes˜nandnnew,theobservationzn0=h(n;z),andthebeliefstatebsatisfytheconditionsofthetheorem. Thewitnesstheoremtellsusthatifapolicypisoptimal,thenthevalueofnnewgeneratedbychangingthesuccessornodeofeachsingleobservationshouldnotincreaseforanypossiblebeliefs.ThisleadsustoasmallersetofinequalityconstraintscomparedtoEquation(16),bydeningNnewinadifferentway.LetNa=fn2Njy(n)=agandAN=fa2AjNa=0g.Foreachactiona=2AN,wegeneratenewnodesbythewitnesstheorem:Foreachnode˜n2Naz2Z,andn02N,wemakennewsuchthaty(nnew)=y(˜n)=aandh(nnew;z)=h(˜n;z)forallz2Zexceptforz,forwhichh(nnew;z)=n0.ThemaximumnumberofnewlygeneratednodesbythewitnesstheoremisåajNajjNjjZjjNj2jZj.Then,foreachactiona2AN,weusetheDPupdatetogeneratejANjjNjjZjadditionalnodes.ThenumberofnewlygeneratednodesjNnewjisnomorethanjNj2jZj+jANjjNjjZj.NotethatthisnumberisoftenmuchlessthanjAjjNjjZj,thenumberofnewlygeneratednodesbyDPupdate,sincethenumberofactionsjANjthatisnotexecutedatallbytheexpert'spolicyistypicallymuchfewerthanjAjWeagainreusetheexampleinSection5.1toillustratetheapproach.WebuildthesetNnewofnewnodesusingthewitnesstheorem.TheleftpanelofFigure3showstheconstructionof707 CHOIANDKIM maximizeRån2Nåb2Bnåa2Any(n)os2NZn(n;)hVp(hn;bi)Qp(hn;bi;ha;osi)ilkRk1subjecttoQp(hn;bi;hy(n);h(n;)i)Qp(hn;bi;ha;osi);8b2Bn;8a2A;8os2NZ;8n2NjR(s;a)jRmax;8s;8a Table2:OptimizationproblemusingQ-functionbasedoptimalityconstraint. maximizeRån2Nåb2Bnånnew2NnewhVp(n;b)Vnew(nnew;b)ilkRk1subjecttoVp(n;b)Vnew(nnew;b);8b2Bn;8nnew2Nnew;8n2NjR(s;a)jRmax;8s;8a Table3:OptimizationproblemusingtheDPupdateorthewitnesstheorembasedoptimalitycon-straint.newnoden00fromnoden0suchthaty(n00)=y(n0)=a0andh(n00;z1)=h(n0;z1).Theoriginalobservationstrategyofn0forz0transitston1(shownindottedline),anditischangedton0(shownindashedline).TherightpanelinthegurepresentsthecompletesetNnewofgeneratednodesusingthewitnesstheorem(shownindashedlines).Nodesn00andn01aregeneratedfromnoden0whereasnodesn02andn03arefromnoden1.NotethatAN=0sincepEexecutesallactionsinthemodel.Wethushaveatotalof4generatednodes,whichissmallerthanthosegeneratedbyeithertheQ-functionbasedortheDPupdatebasedapproach.5.4OptimizationProblemIntheprevioussections,wesuggestedthreeconstraintsfortherewardfunctionthatstemfromtheoptimalityoftheexpert'spolicy,butinnitelymanyrewardfunctionscansatisfytheconstraintsinEquations(14)and(16).Wethuspresentconstrainedoptimizationproblemswithobjectivefunctionsthatencodeourpreferenceonthelearnedrewardfunction.AsinNgandRussell(2000),wepreferarewardfunctionthatmaximizesthesumofthemarginsbetweentheexpert'spolicyandotherpolicies.Atthesametime,wewanttherewardfunctionassparseaspossible,whichcanbeaccomplishedbyadjustingthepenaltyweightontheL1-normoftherewardfunction.IfweuseQ-functionbasedoptimalityconstraint,thatis,Equation(14),thevalueoftheexpert'spolicyiscomparedwiththoseofallotherpoliciesthatdeviatefromtheexpert'sactionandobservationstrategies,giveninTable2.WhenusingtheDPupdateorthewitnesstheorembasedoptimalityconstraint,thatis,Equation(16),thepoliciesotherthantheexpert'spolicyarecapturedinnewlygeneratednodesnnew,hencetheoptimizationproblemnowbecomestheonegiveninTable3.Sincealltheinequalitiesandtheobjectivefunctionsintheoptimizationproblemsarelinearinterms708 IRLINPARTIALLYOBSERVABLEENVIRONMENTSoftherewardfunction,thedesiredrewardfunctioncanbefoundefcientlybysolvingthelinearprogrammingproblems.WhenusingQ-functionortheDPupdatebasedapproach,thenumberofpoliciescomparedwiththeexpert'sisexponentialtothenumberofobservations,andhencethenumberofconstraintsintheoptimizationproblemsincreasesexponentially.Thismaybecomeintractableevenforasmallsizeexpert'spolicy.Wecanaddressthislimitationusingthewitnesstheorembasedapproach,sinceitissufcienttoconsiderasfewasjNj2jZjnodesiftheexpert'spolicyexecutesallactions,whichiscommoninmanyPOMDPbenchmarkproblems.6.IRLforPOMDPnRfromSampledTrajectoriesInsomecases,theexpert'spolicymaynotbeexplicitlygiven,buttherecordsoftheexpert'strajectoriesmaybeavailableinstead.4Here,weassumethatthesetofH-stepbelieftrajecto-riesisgiven.Them-thtrajectoryisdenotedbyfbm0;bm1;:::;bmH1g,wherebm0=b0forallm2f1;2;;Mg.Ifthetrajectoriesoftheperceivedobservationsfzm0;zm1;:::;zmH1gandtheexecutedactionsfam0;am1;:::;amH1gfollowingtheexpert'spolicyareavailableinstead,wecanreconstructthebelieftrajectoriesbyusingthebeliefupdateinEquation(3).InordertoobtainanIRLalgorithmforPOMDPnRfromthesampledbelieftrajectories,welinearlyparameterizetherewardfunctionusingtheknownbasisfunctionsfSA![0;1]dandtheweightvectora2[1;1]dasinEquation(11):R(s;a)=aTf(s;a).Thisassumptionisusefulfortheproblemswithlargestatespaces,becausewithsomepriorknowledgeabouttheproblems,wecanrepresenttherewardfunctioncompactlyusingthebasisfunctions.Forexample,inrobotnavigationproblems,thebasisfunctioncanbechosentocapturethefeaturesofthestatespace,suchaswhichlocationsareconsidereddangerous.Intheworstcasewhennosuchpriorknowledgeisavailable,thebasisfunctionsmaybedesignedforeachpairofstateandactionsothatthenumberofbasisfunctionsisjSjjAj.TheobjectiveofIRListhentodeterminethe(unknown)parameteraoftherewardfunctionR=aTfInthissection,weproposethreetrajectory-basedIRLalgorithmsforPOMDPnR.Thealgo-rithmssharethesameframeworkthatiterativelyrepeatsestimatingtheparameteroftherewardfunctionusinganIRLalgorithmandcomputinganoptimalpolicyfortheestimatedrewardfunctionusingaPOMDPsolver.Therstalgorithmndstherewardfunctionthatmaximizesthemarginbetweenthevaluesoftheexpert'spolicyandotherpoliciesforthesampledbeliefsusingLP.ThisisasimpleextensiontoNgandRussell(2000).Thesecondalgorithmcomputestherewardfunctionthatmaximizesthemarginbetweenthefeatureexpectationsoftheexpert'spolicyandotherpoliciesusingQCP.Thelastalgorithmapproximatesthesecondusingtheprojectionmethod.ThesecondandthirdalgorithmsareextendedfromthemethodsoriginallysuggestedforMDPenvironmentsbyAbbeelandNg(2004). 4.AsintheIRLforMDPnRfromsampledtrajectories,weassumethatthetransitionandobservationfunctionsareknowninPOMDPnR.709 CHOIANDKIM Algorithm4IRLforPOMDPnRfromthesampledtrajectoriesusingtheMMVmethod. Input:POMDPnRhS;A;Z;T;O;b0;i,basisfunctionsf,Mtrajectories1:ChooseasetBpEofalltheuniquebeliefsinthetrajectories.2:Choosearandominitialpolicyp1andsetP=fp1g.3:fork=1toMaxIterdo4:Findˆabysolvingthelinearprogram:maximizeˆaåp2Påb2BpEpˆVpE(b)Vp(b)lkˆaTfk1subjecttojˆaij1;i=1;2;;d5:Computeanoptimalpolicypk+1forthePOMDPwithˆR=ˆaTkf.6:ifjˆVpE(b)Vpk+1(b)j,8b2BpEthen7:returnˆR=ˆaTkf8:else9:P=P[fpk+1g10:endif11:endfor12:K=argmink:pk2Pmaxb2BpEjˆVpE(b)Vpk(b)j13:returnˆR=ˆaTKfOutput:therewardfunctionˆR 6.1Max-MarginbetweenValues(MMV)MethodWerstevaluatethevaluesoftheexpert'spolicyandotherpoliciesfortheweightvectoraofarewardfunctioninordertocomparetheirvalues.TherewardforbeliefbisthencalculatedbyR(b;a)=ås2Sb(s)R(s;a)=ås2Sb(s)aTf(s;a)=aTf(b;a);wheref(b;a)=ås2Sb(s)f(s;a).WealsocomputeˆVpE(bm0)tobetheempiricalreturnoftheexpert'sm-thtrajectorybyˆVpE(bm0)=H1åt=0gtR(bmt;amt)=H1åt=0gtaTf(bmt;amt):Notingthatbm0=b0forallm,theexpert'saverageempiricalreturnatb0isgivenbyˆVpE(b0)=1 MMåm=1ˆVpE(bm0)=aT1 MMåm=1H1åt=0gtf(bmt;amt);(18)whichislinearintermsofa.Inasimilarmanner,wecancomputetheaverageempiricalreturnoftheexpert'strajectoriesatotherbeliefsbjbyˆVpE(bj)=1 MjMåm=1ˆVpE(bj)=aT1 MjMåm=1H1åt=HmjgtHmjf(bmt;amt);(19)whereHmjisthersttimethatbjisfoundinthem-thtrajectoryandMjisthenumberoftrajectoriesthatcontainbj710 IRLINPARTIALLYOBSERVABLEENVIRONMENTSGiventheabovedenitions,therestofthederivationisfairlystraightforward,andleadstoasimilaralgorithmtothatofNgandRussell(2000).ThealgorithmisshowninAlgorithm4.ItiterativelytriestondarewardfunctionparameterizedbyathatmaximizesthesumofthemarginsbetweenthevalueˆVpEoftheexpert'spolicyandthevalueVpofeachFSCpolicyp2Pfoundsofarbythealgorithmatalltheuniquebeliefsb2BpEinthetrajectories.Wecouldconsidertheinitialbeliefb0alone,similartoNgandRussell(2000)consideringtheinitialstates0alone.However,wefounditmoreeffectiveinourexperimentstoincludeadditionalbeliefs,sincetheyoftenprovidebetterguidanceinthesearchoftherewardfunctionbytighteningthefeasibleregion.Inordertoconsidertheadditionalbeliefs,weshouldbeabletocomputethevalueVpoftheintermediatepolicypatbeliefb2BpE,butitisnotwelldened.bmaybeunreachableunderpanditisnotknownthatwewillvisitbatwhichnodeofp.Inourwork,weuseanupperboundapproximationgivenasVp(b)maxnVp(n;b);(20)whereVp(n;b)iscomputedbyEquation(7).TheIRLstepinline4ndstherewardfunctionthatguaranteestheoptimalityoftheexpert'spolicy.Intheoptimizationproblem,weconstrainthevalueoftheexpert'spolicytobegreaterthanthatofotherpoliciesinordertoensurethattheexpert'spolicyisoptimal,andmaximizethesumofthemarginsbetweentheexpert'spolicyandotherpoliciesusingamonotonicallyincreasingfunctionp5Inaddition,wepreferthesparserewardfunctionandthesparsityofthelearnedrewardfunctioncanbeachievedbytuningthepenaltyweightl.NotethatwecansolvetheIRLstepinAlgorithm4usingLPsinceallthevariablessuchasˆVpEandVparelinearfunctionsintermsofafromEquations(18),(19),and(20).Whenpk+1matchespE,thedifferencesinthevaluefunctionsforallbeliefswillvanish.Hence,thealgorithmterminateswhenallthedifferencesinthevaluesarebelowthethresholde,ortheiterationnumberhasreachedthemaximumnumberofstepsMaxItertoterminatethealgorithminanitenumberofiterations.6.2Max-MarginBetweenFeatureExpectations(MMFE)MethodWecanre-writethevalueofaFSCpolicypinPOMDPsusingthefeatureexpectationµ(p),pro-posedbyAbbeelandNg(2004)asfollows:Vp(b0)=Eh¥åt=0gtR(bt;at)jp;b0i=Eh¥åt=0gtaTf(bt;at)jp;b0i=aTEh¥åt=0gtf(bt;at)jp;b0i=aTµ(p);whereµ(p)=E[å¥t=0gtf(bt;at)jp;b0],anditisassumedthatkak11toboundRmaxby1.Inordertocomputethefeatureexpectationµ(p)exactly,wedenetheoccupancydistributionoccp(s;n)ofthepolicypthatrepresentstherelativefrequencyofvisitingstatesatnodenwhenfollowingthepolicyp=hy;hiandstartingfrombeliefb0andnoden0.Itcanbecalculatedbysolvingthe 5.Wesimplychoosep(x)=xifx�0andp(x)=2xifx0asinNgandRussell(2000).Thisgivesmorepenaltytoviolatingtheoptimalityoftheexpert'spolicy.711 CHOIANDKIM Algorithm5IRLforPOMDPnRfromthesampledtrajectoriesusingtheMMFEmethod. Input:POMDPnRhS;A;Z;T;O;b0;i,basisfunctionsf,Mtrajectories1:Choosearandominitialweighta0.2:P=/0,W=/0,andt=¥.3:fork=1toMaxIterdo4:Computeanoptimalpolicypk1forthePOMDPwithR=aTk1f.5:P=P[fpk1gandW=W[fak1g.6:iftthen7:break8:endif9:Solvethefollowingoptimizationproblem:maximizet;aktsubjecttoaTkµEaTkµ(p)+t;8p2Pkakk2110:endfor11:K=argmink:pk2PkµEµ(pk)k212:returnR=aTKfOutput:therewardfunctionR followingsystemoflinearequations:occp(s0;n0)=b0(s0;n0)dn0;n0+gås;z;noccp(s;n)T(s;y(n);s0)O(s0;y(n);z)dn0;(n;z);8s02S;8n02N;wheredx;ydenotestheKroneckerdeltafunction,denedasdx;y=1ifx=yanddx;y=0otherwise.Withtheoccupancydistribution,thevalueofthepolicypcanbecomputedbyVp(b0)=ås;noccp(s;n)R(s;y(n))=ås;noccp(s;n)aTf(s;y(n))=aTµ(p);whereµ(p)=ås;noccp(s;n)f(s;y(n)).However,thefeatureexpectationoftheexpert'spolicypEcannotbeexactlycomputed,becauseweonlyhavethesetoftrajectoriesonthebeliefspace,whicharerecoveredfromthegiventrajectoriesoftheactionsandtheobservations,insteadoftheexplicitFSCformoftheexpert'spolicy.Hence,weestimatetheexpert'sfeatureexpectationµ(pE)=µEempiricallybyˆµE=1 MMåm=1H1åt=0gtf(bmt;amt):Fromthesedenitions,wecanderivethefollowinginequalities,whicharesimilartoEqua-tion(12),jVpE(b0)Vp(b0)j=jaTµEaTµ(p)jkak2kµEµ(p)k2kµEµ(p)k2:(21)712 IRLINPARTIALLYOBSERVABLEENVIRONMENTS Algorithm6IRLforPOMDPnRfromthesampledtrajectoriesusingthePRJmethod. Input:POMDPnRhS;A;Z;T;O;b0;i,basisfunctionsf,Mtrajectories1:Choosearandominitialweighta0.2:Computeanoptimalpolicyp0forthePOMDPR=aT0f.3:P=fp0g,W=fa0g,¯µ0=µ0andt=¥.4:fork=1toMaxIterdo5:ak=µE¯µk1.6:ComputeanoptimalpolicypkforthePOMDPR=aTkf.7:P=P[fpkgandW=W[fakg.8:iftthen9:break10:endif11:ComputeanorthogonalprojectionofµEontothelinethrough¯µk1andµk¯µk=¯µk1+(µk¯µk1)T(µE¯µk1) (µk¯µk1)T(µk¯µk1)(µk¯µk1)12:t=kµE¯µkk213:endfor14:K=argmink:pk2PkµEµkk2.15:returnR=aTKfOutput:therewardfunctionR Thelastinequalityholdssinceweassumekak11.Theaboveinequalitiesstatethatthedifferencebetweentheexpert'spolicypEandanypolicypisboundedbythedifferencebetweentheirfeatureexpectations,whichisthesameresultasinAbbeelandNg(2004).BasedonEquation(21),wecaneasilyextendAlgorithm2toaddresstheIRLproblemforPOMDPnRfromthesampledtrajectories.ThealgorithmispresentedinAlgorithm5.WhilewecansolveAlgorithm4usingLP,thealgorithmrequiresaQCPsolver,sincetheoptimizationprobleminline9hasa2-normconstraintona.NotethatitisprovedthatthealgorithmwillterminateinanitenumberofiterationsinAbbeelandNg(2004).AbbeelandNg(2004)constructapolicybymixingthepoliciesfoundbythealgorithminordertondthepolicythatisasgoodasthegivenexpert'spolicy.Theychoosetheweightofthepoliciesbycomputingtheconvexcombinationoffeatureexpectationsthatminimizesthedistancetotheexpert'sfeatureexpectation.However,thismethodcannotbeadaptedtoourIRLalgorithm,becausethereisnowaytorecovertherewardfunctionthatprovidesthecomputedmixedpolicy.Thus,wereturntherewardfunctionthatyieldstheclosestfeatureexpectationtothatoftheexpert'spolicyamongtheintermediaterewardfunctionsfoundbythealgorithm.ByEquation(21),thevalueofthepolicythatgeneratestheclosestfeatureexpectationisassuredtobesimilartothevalueoftheexpert'spolicyandwehopethattherewardfunctionthatyieldstheclosestfeatureexpectationwillbesimilartotherewardfunctionthattheexpertisoptimizing.6.3Projection(PRJ)MethodIntheprevioussection,wedescribedtheIRLalgorithmforPOMDPnRfromthesampledtra-jectoriesusingQCP.Wecannowaddresstheproblemusingasimplermethod,asAbbeelandNg(2004)proposed.TheIRLstepinAlgorithm5canbeconsideredforndingtheunitvectorµkor-713 CHOIANDKIMthogonaltothemaximummarginhyperplanethatclassiesfeatureexpectationsintotwosets:Onesetconsistsoftheexpert'sfeatureexpectationandtheothersetconsistsofthefeatureexpectationsofthepoliciesfoundbythealgorithm.Theunitvectorµkcanthenbeapproximatelycomputedbyprojectingtheexpert'sfeatureexpectationonthelinebetweenthefeatureexpectationsofthemostrecentpolicyandthepreviouslyprojectedpoint.ThealgorithmisshowninAlgorithm6.Inthealgorithm,µidenotesµ(pi)foralland¯µdenotesthepointwheretheexpert'sfeatureexpectationisprojected.SimilartoAlgorithm5,thealgorithmreturnstherewardfunctionthatyieldstheclosestfeatureexpectationtothatoftheexpert'spolicyamongtheintermediaterewardfunctionsfoundbythealgorithm.7.ExperimentalResultsInthissection,wepresenttheresultsfromtheexperimentsonsomePOMDPbenchmarkdomains-Tiger1dMaze55GridWorldHeaven/Hell,andRockSampleproblems.ThecharacteristicsofeachproblemispresentedinTable4andbriefexplanationsaregivenbelow.TheTigerand1dMazeproblemsareclassicPOMDPbenchmarkproblems(Cassandraetal.,1994).IntheTigerproblem,anagentisstandinginfrontoftwodoors.Thereisarewardbehindoneofthedoorsandatigerbehindtheother.Iftheagentopensthedoorwiththetiger,itgetsalargepenalty(-100).Otherwise,itreceivesthereward(+10).Theagentinitiallydoesnotknowthelocationofthetiger.Itcaninferthelocationofthetigerbylisteningforthesoundofthetigerwithasmallcost(-1)andthecorrectinformationisgivenwithsomeprobability(0.85).Inthe1dMazeproblem,thereare4statesaspresentedintherstpanelofFigure4.Thethirdstatefromtheleftisthegoalstate.Anagentisinitiallysettothenon-goalstateswithequalprobabilitiesandcanmoveleftorright.Theagentobserveswhetheritisatthegoalstateornot.Whentheagentreachesthegoalstate,itisrandomlymovedtoanon-goalstateafterexecutinganyactionThe55GridWorldproblemisinspiredbyaprobleminNgandRussell(2000),wherethestatesarelocatedasshowninthesecondpanelofFigure4.Anagentcanmovewest,east,northorsouth,andtheireffectsareassumedtobedeterministic.Theagentalwaysstartsfromthenorth-westcornerofthegridandthegoalisatthesouth-eastcorner.Aftertheagentreachesthegoalstate,theagentrestartsfromthestartstatebyexecutinganyactioninthegoalstate.Thecurrentpositioncannotbeobserveddirectlybutthepresenceoftheadjacentwallscanbeperceivedwithoutnoise.Hence,therearenineobservations,eightofthemcorrespondingtoeightpossiblecongurationsofthenearbywallswhenontheborder(N,S,W,E,NW,NE,SW,andSE),andonecorrespondingtonowallobservationwhennotontheborder(Null).TheHeaven/Hellproblem(GeffnerandBonet,1998)isanavigationproblemoverthestatesdepictedinthethirdpanelofFigure4.Thegoalstateiseitherposition4or6.Oneoftheseisheavenandtheotherishell.Whentheagentreachesheaven,itreceivesareward(+1).Whenitreacheshell,itreceivesapenalty(-1).Itstartsatposition0,anddoesnotknowthepositionofheaven.However,itcangettheinformationaboutthepositionofheavenaftervisitingthepriestatposition9.Theagentalwaysperceivesitscurrentpositionwithoutanynoise.Afterreachingheavenorhell,itismovedattheinitialposition.TheRockSampleproblem(SmithandSimmons,2004)modelsaroverthatmovesaroundanareaandsamplesrocks.Thelocationsoftheroverandtherocksareknown(therocksaremarkedwithstarsinthefourthpanelofFigure4),butthevalueoftherocksareunknown.Ifitsamplesagoodrock,itreceivesareward(+10),butifitsamplesabadrock,itreceivesapenalty(-10).When714 IRLINPARTIALLYOBSERVABLEENVIRONMENTS ProblemjSjjAjjZjjjj[n2NBnj Tiger2320.75551dMaze4220.753455GridWorld25490.90213Heaven/Hell204110.991819RockSample[4,3]129820.951622 Table4:Characteristicsoftheproblemdomainsusedintheexperiments.g:Thediscountfactor.jNj:Thenumberofnodesintheoptimalpolicy.j[n2NBnj:Thetotalnumberofbeliefsreachablebytheoptimalpolicy.    \n   \n   \n \r ! "#$%&Figure4:Mapsforthe1dMaze,55GridWorldHeaven/Hell,andRockSample[4,3]problems.therovertriestosampleatthelocationwithoutanyrocks,itreceivesalargepenalty(-100).Therovercanobservethevalueoftherockswithanoisylongrangesensor.Inaddition,itgetsareward(+1)ifitreachestherightsideofthemap.Whenitreachesothersidesofthemap,itgetsalargepenalty(-100).Theroverisimmediatelymovedtothestartpositionwhenittraversesoutsideofthemap.TheRockSampleproblemisinstantiatedasRockSample[n;k],whichdescribesthatthesizeofthemapisnnandthenumberoftherocksonthemapisk,andourexperimentwasperformedonRockSample[4,3].ToevaluatetheperformanceoftheIRLalgorithms,wecouldnaivelycomparethetruerewardfunctionsintheoriginalproblemstotherewardfunctionsfoundbythealgorithms.However,itisnotonlydifcultbutalsomeaninglesstosimplycomparethenumericalvaluesoftherewardfunctions,sincetherewardfunctionrepresentstherelativeimportanceofexecutinganactioninastate.Completelydifferentbehaviorsmaybederivedfromtworewardfunctionsthathaveasmalldifference,andanidenticaloptimalpolicymaybeinducedbytworewardfunctionsthathavealargedifference.Forexample,threerewardfunctionsintheTigerproblemarepresentedinTable5,whereRisthetruerewardfunctionandR1andR2aretworewardfunctionschosenforexplainingthephenomenon.WhenthedistancesaremeasuredbyL2norm,Dist(R;R)=kRRk2=r ås2S;a2A(R(s;a)R(s;a))2;therewardfunctionR2ismoresimilartoRthantherewardfunctionR1.However,asshowninFigure5,theoptimalpoliciesforRandR1areexactlythesamewhiletheoptimalpolicyforR2715 CHOIANDKIM ListenSuccessFailureDist(R;R)Vp(b0;R) R-110-10001.93R1-15.68-1006.101.93R23.0510-1005.731.02 Table5:ThreerewardfunctionsintheTigerproblem.Risthetruerewardfunction.Listen:Thenegativecostoflistening.Success:Therewardofopeningthecorrectdoor.Failure:Thenegativepenaltyofchoosingthedoorwiththetiger.Dist(R;R):Thedistancefromthetruerewardfunctions.Vp(b0R):Thevalueoftheoptimalpolicyforeachrewardfunctionmeasuredonthetruerewardfunction. Figure5:OptimalpoliciesfortherewardfunctionsinTable5.Thenodesarelabeledwithactions(Listen,OL:Open-left,OR:Open-right).Theedgesarelabeledwithobservations(TL:Tiger-left,TR:Tiger-right).Left:TheoptimalpolicyforRandR1Right:TheoptimalpolicyforR2isdifferentfromthatforR.Ifwestillwanttodirectlyevaluatethecomputedrewardfunctionusingadistancemeasure,wecouldapplythepolicy-invariantrewardtransformationonthetruerewardfunctionandcomputetheminimumdistance,butitisnon-trivialtodososincethereisaninnitenumberoftransformationstochoosefromincludingthepositivelineartransformationandthepotential-basedshaping(Ngetal.,1999).Therefore,wecomparethevaluefunctionsoftheoptimalpoliciesinducedfromthetrueandlearnedrewardfunctionsinsteadofdirectlymeasuringthedistancebetweentherewardfunctions.Theperformanceofthealgorithmsareevaluatedbythedifferencesinthevaluesoftheexpert'spolicyandtheoptimalpolicyforthelearnedrewardfunction.Intheevaluations,thevalueofeachpolicyismeasuredonthetruerewardfunctionRandthelearnedrewardfunctionRL,andwedenethevalueVp(b0R)ofapolicypattheinitialbeliefb0measuredonarewardfunctionRasVp(b0R)=ås2Sb0(s)Vp(n0;sR);wheren0isthestartingnodeofapolicypandVp(n0;sR)iscomputedbyEquation(5)usingtherewardfunctionR716 IRLINPARTIALLYOBSERVABLEENVIRONMENTS ProblemD(R)D(RL)jnewjTime Q-IRLD-IRLW-IRLQ-IRLD-IRLW-IRL Tiger0037575390.070.040.031dMaze005418120.020.020.0255GridWorld0040962048104478.063.001.54Heaven/Hell004:6310152:5710143260n.a.n.a.6.20RockSample[4,3]13.42032768204863477.9919.203.77 n.a.=notapplicableTable6:ResultsofIRLforPOMDPnRfromFSCpolicies.Q-IRL,D-IRL,andW-IRLre-spectivelydenotetheQ-functionbasedapproach,theDPupdatebasedapproach,andthewitnesstheorembasedapproach.D(R)=jVpE(b0R)VpL(b0R)jD(RL)=jVpE(b0RL)VpL(b0RL)jjNnewjdenotesthenumberofnewlygeneratedpolicies.TheaveragecomputationtimesarereportedinsecondsOuralgorithmrequiresaPOMDPsolverforcomputingtheexpert'spolicyandtheintermediateoptimalpoliciesofthelearnedrewards.SinceweassumethepolicyisintheformofanFSC,weusePBPI(Jietal.,2007),whichndsanoptimalFSCpolicyapproximatelyonthereachablebeliefs.OptimizationproblemsformulatedinLPandQCParesolvedusingILOGCPLEX.Theexperimentsareorganizedintotwocasesaccordingtotherepresentationoftheexpert'spolicy.Intherstcase,theexpert'spolicyisexplicitlygivenintheformofaFSC,andinthesecondcase,thetrajectoriesoftheexpert'sexecutedactionsandthecorrespondingobservationsaregiveninstead.7.1ExperimentsonIRLfromFSCPoliciesTherstsetofexperimentsconcernsthecaseinwhichtheexpert'spolicyisexplicitlygivenusingtheFSCrepresentation.WeexperimentedwithallthreeapproachesinSection5:TheQ-functionbasedapproach,theDPupdatebasedapproach,andthewitnesstheorembasedapproach.AsinthecaseofIRLforMDPnR,wewereabletocontrolthesparsenessintherewardfunctionbytuningthepenaltyweightl.Withasuitablevalueforl,allthreeapproachesyieldedthesamerewardfunction.6AsummaryoftheexperimentsisgiveninTable6.SincetheHeaven/HellproblemhasalargernumberofobservationsthanotherproblemsandtheQ-functionandtheDPupdatebasedapproachesgenerateexponentiallymanynewpolicieswithrespecttothenumberofobservations,theoptimiza-tionproblemsoftheQ-functionandtheDPupdatebasedapproacheswerenotabletohandletheHeaven/Hellproblem.Hence,theHeaven/Hellproblemcouldonlybesolvedbythewitnesstheo-rembasedapproach.Also,thewitnesstheorembasedapproachwasabletosolvetheotherproblemsmoreefcientlythantheQ-functionbasedapproachandtheDPupdatebasedapproach. 6.Withanyvalueof,therewardfunctionscomputedbyalltheproposedoptimizationproblemsshouldguaranteetheoptimalityoftheexpert'spolicy,exceptforthedegeneratedcaseR=0duetoanoverlylargevalueof.However,weobservedthattheoptimalityofoursolutionsisoftensubjecttonumericalerrorsintheoptimization,whichisaninterestingissueforfuturestudies.717 CHOIANDKIM Reward 0 1 Move left Left Middle Goal Right Move right Left Middle Goal Right Figure6:Comparisonofthetrueandlearnedrewardfunctionsandtheexpert'spolicyinthe1dMazeproblem.Blackbars:Thetruerewardfunction.Whitebars:Thelearnedrewardfunction.InTable6,D(R)=jVpE(b0R)VpL(b0R)jisthedifferencebetweenthevaluesoftheexpert'spolicypEandtheoptimalpolicypLforthelearnedreward,whicharemeasuredonthetruerewardfunctionRD(RL)=jVpE(b0RL)VpL(b0RL)jisthedifferencebetweenthevaluesmeasuredonthelearnedrewardfunctionRL.ThedifferencesmeasuredonthetruerewardfunctionintheTiger1dMaze,55GridWorld,andHeaven/Hellarezero,meaningthatthelearnedrewardfunctiongeneratedapolicywhoseperformanceisthesameasthatoftheexpert'spolicy.However,ouralgorithmsfailedtondtherewardthatgeneratesapolicythatisoptimalonthetruerewardintheRockSample[4,3].Nevertheless,wecansaythatthelearnedrewardfunctionRLsatisestheoptimalityoftheexpert'spolicypEsincethepolicypLisanoptimalpolicyonthelearnedrewardfunctionRLandjVpE(b0RL)VpL(b0RL)j=0.Thus,thereasonforouralgorithms'failureintheRockSample[4,3]mightbethattheobjectivefunctionsintheoptimizationproblemsarenotwellformulatedtochooseanappropriaterewardfunctionthatyieldsapolicysimilartotheexpert's,amongtheinnitelymanyrewardfunctionsinthespacespeciedbytheconstraintsoftheoptimizationproblems.Wefurtherdiscussthedetailsoftheresultsfromeachproblembelow.ThelearnedrewardfunctionsarecomparedtothetruerewardfunctionsfortheTiger1dMaze,55GridWorld,andHeaven/Hellproblems,buttherewardfunctionintheRockSample[4,3]problemisomittedsinceithastoomanyelementstopresent.IntheTigerproblem,thetrueandlearnedrewardfunctionsarerespectivelyrepresentedasRandR1inTable5.Thetruerewardfunctionisnotsparse.Everyactionisassociatedwithanon-zeroreward.Sinceourmethodsfavorsparserewardfunctions,thereissomedegreeofdifferencebetweenthetrueandthelearnedrewardfunctions,mostnotablyforthelistenaction,whereourmethodsassignazerorewardinsteadof-1asinthetruereward.However,wecanapplythepolicy-invariantrewardtransformation(Ngetal.,1999)onthelearnedrewardfunctionsothatlistenactionyields-1reward.R1isthetransformedlearnedrewardfunction.Itisclosetothetruerewardfunctionandproducestheoptimalpolicywhosevalueisequaltothevalueoftheexpert'spolicywhenmeasuredonthetruerewardfunction.Forthe1dMazeproblem,thelearnedrewardfunctioniscomparedtothetruerewardfunctionintheleftpanelofFigure6andtheexpert'spolicyispresentedintherightpanelofFigure6.Theexpert'spolicyhasthreenodes:Noden2(thestartingnode)choosestomoveright,andchangestonoden1uponobservingNothingortonoden0uponobservingGoal;noden1choosestomoverightandalwayschangestonoden0;noden0choosestomoveleft,andchangestonoden2uponobservingNothingortoitselfuponobservingGoal.Followingtheexpert'spolicy,movingleftis718 IRLINPARTIALLYOBSERVABLEENVIRONMENTS Reward 0 0.5 1 Move north s13 s17 s18 s24 Move south s13 s17 s18 s24 Move west s13 s17 s18 s24 Move east s13 s17 s18 s24 Figure7:Comparisonofthetrueandthelearnedrewardfunctionsandtheexpert'spolicyinthe55GridWorldproblems.Blackbars:Thetruereward.Whitebars:Thelearnedreward.  \n  \n Figure8:LearnedrewardfunctionintheHeaven/Hellproblem.Blackarrow+1rewardformov-inginthedirectionofthearrowineachstate.Blankgrid:Zerorewardforallactionsineachstate.alwaysexecutedafterperceivingthegoalstate.Thiscausesthealgorithmstoassignthepositiverewardtomovingleftinthegoalstateasthetrueone,butthezerorewardtomovingrightinthegoalstateunlikethetrueone.Consequently,thealgorithmsndtherewardfunctionthatexplainsthebehavioroftheexpert'spolicy,andtheoptimalpolicyfromthePOMDPwithrespecttothelearnedrewardfunctionisthesameastheexpert'spolicy.Inthe55GridWorldproblem,theexpert'spolicyissimpleasdepictedintherightpanelofFigure7:Theagentalternatesmovingsouthandeastfromthestart,visitingthestatesinthediagonalpositions(i.e.,fs0;s5;s6;s11;s12;s17;s18;s23;s24gandfs0;s1;s6;s7;s12;s13;s18;s19;s24g).ThelearnedrewardfunctionispresentedwiththetruerewardfunctionintheleftpanelofFigure7.Ourmethodsassignasmallpositiverewardformovingsouthinstates13and18andmovingeastinstates17and18.Also,therewardformovingsouthandeastinstate24isassignedto+1forreachingthegoal.Thelearnedrewardfunctioncloselyreectsthebehaviorofthegivenexpert'policy.Again,eventhoughthelearnedrewardfunctionisdifferentfromthetrueone,ityieldsthesameoptimalpolicy.Finally,intheHeaven/Hellproblem,thetruerewardfunctionis+1forstates4and16beingheaven,and1forstates6and14beinghell.ThelearnedrewardispresentedinFigure8wheretheagentgetsa+1rewardwhenmovinginthedirectionofthearrowineachstate.Thelearnedrewardfunctionexactlydescribesthebehavioroftheexpert,whichrstvisitsthepriestinstates9and19startingfromstates0and10toacquirethepositionofheavenandthenmovestoheaveninstates4and16.AsshowninTable6,thelearnedrewardfunctionintheHeaven/Hellproblemalsoyieldsthepolicywhosevalueisequaltothatoftheexpert'spolicy.719 CHOIANDKIM7.2ExperimentsonIRLfromSampledTrajectoriesThesecondsetofexperimentsinvolvesthecasewhentheexpert'strajectoriesaregiven.Weexper-imentedonthesamesetofveproblemswithallthreeapproachesinSection6:themax-marginbetweenvalues(MMV),themax-marginbetweenfeatureexpectations(MMFE),andtheprojection(PRJ)methods.Inthissection,therewardfunctionisassumedtobelinearlyparameterizedwiththebasisfunc-tionsandwepreparefoursetsofbasisfunctionstoexaminetheeffectofthechoiceofbasisfunctionsontheperformanceofthealgorithms:Compact:Thesetofbasisfunctionsthatcapturesthenecessarypairsofstatesandactionstopresentthestructureofthetruerewardfunction.LetF=fF0;F1;;FNgbeapartitionofSAsuchthat8(s;a)2FihavethesamerewardvalueR(s;a).ThecompactbasisfunctionsforthepartitionFisdenedsuchthatthe-thbasisfunctionfi(s;a)=1if(s;a)2Fiandfi(s;a)=0otherwise.Non-compact:Thesetofbasisfunctionsthatincludesallthecompactbasisfunctionsandsomeextraredundantbasisfunctions.Eachbasisfunctionfiisassociatedwithsomesetofstate-actionpairsasabove.State-wise:Thesetofbasisfunctionsthatconsistsoftheindicatorfunctionsforeachstate.The-thbasisfunctionisdenedasfi(s)=ds0(s)if-thstateiss07State-action-wise:Thesetofbasisfunctionsconsistsoftheindicatorfunctionsforeachpairofstateandaction.The-thbasisfunctionisdenedasfi(s;a)=d(s0;a0)(s;a)if-thpairofstateandactionis(s0;a0)Forsmallproblems,suchastheTiger1dMaze,and55GridWorldproblems,weexper-imentedwithstate-action-wisebasisfunctions.Forthetwolargerproblems,threesetsofbasisfunctionsareselected.FortheHeaven/Hellproblem,therstsetconsistsofthecompactsetofbasisfunctions.Table7showsthesetFiofpairsofstatesandactionsforeachbasisfunction.Thesecondsetconsistsofthestate-wisebasisfunctionsandthethirdsetconsistsofthestate-action-wisebasisfunctions.FortheRockSample[4,3]problem,therstsetconsistsofthecompactsetofbasisfunctions.TheleftsideofTable8showsthesetFiofpairsofstatesandactionsforeachbasisfunction.Thesecondsetconsistsofthenon-compactsetofbasisfunctionsincludingtheredundantfunctionsthatpresenttherover'susingitssensor(f10),movingonthemap(f11),samplingatsomelocationswithoutrocks(f12–f15),andsamplingattherestofthelocations(f16).TherightsideofTable8presentsthesetofthepairsofstatesandactionsforthenon-compactbasisfunctions.Thethirdsetconsistsofthestate-action-wisebasisfunctions.Foreachexperiment,wesampled2000belieftrajectories.EachtrajectoryistruncatedafteralargenitenumberHoftimesteps.IfwetruncatethetrajectoriesafterHe=log(e(1g)=Rmax)timesteps,theerrorinestimatingthevaluewouldbenogreaterthane.Table9showsthenumberoftimestepsforeachproblem.Asintheprevioussection,wecompareVpL(b0R)ateachiteration,whichisthevalueofthepolicypLfromthelearnedrewardfunctionRLevaluatedonthetruerewardfunctionR.TheresultsareshowninFigure9.Allthealgorithmsfoundtherewardfunctionthatgeneratethepolicy 7.Here,weusetheKroneckerdeltafunction,thatis,di()=1if=,anddi()=0if=.720 IRLINPARTIALLYOBSERVABLEENVIRONMENTS FiStatesActions F0s4F1s16F2s6F3s14F4Snfs4;s6;s14;s16g Table7:Setsofstate-actionpairsforthecompactsetofbasisfunctionsintheHeaven/Hellproblem.Thestatess4ands16representheavenandthestatess6ands14representhell. FiStatesActions F0x=0MovewestF1x=3MoveeastF2y=0MovesouthF3y=3MovenorthF4hx;yi=L0;r0=trueSampleF5hx;yi=L0;r0=falseSampleF6hx;yi=L1;r1=trueSampleF7hx;yi=L1;r1=falseSampleF8hx;yi=L2;r2=trueSampleF9hx;yi=L2;r2=falseSampleF10hx;yi=2fLi;8igSampleF11Theremainingstate-actionpairs FiStatesActions F0;;F9SameasinthecompactsetF10UsethesensorF11MoveF12hx;yi=L00SampleF13hx;yi=L01SampleF14hx;yi=L02SampleF15hx;yi=L03SampleF16hx;yi=2fLi;8i;L0j;8jgSampleF17Theremainingstate-actionpairs Table8:Setsofstate-actionpairsforthecompact(Left)andnon-compactsetofbasisfunctions(Right)intheRockSample[4,3]problem.hx;yidenotesthelocationoftherover.Liisthelocationof-throck.L0iisarandomlychosenlocationwithoutrocks.riistheBooleanvariableforrepresentingwhetherthe-throckisgoodornot.closetotheexpert'spolicyinsmallproblems,thatis,theTiger1dMaze,and55GridWorldproblems.Theyalsoconvergedtotheoptimalvalueinafewiterationswhenusingthecompactsetofbasisfunctionsinthetwolargerproblems,thatis,theHeaven/HellandRockSample[4,3]problems.However,moreiterationswererequiredtoconvergewhenothersetsofbasisfunctionswereused.Thisisduetothefactthatalargernumberofbasisfunctionsinducesalargersearchspace.IntheHeaven/Hellproblem,theMMVmethodconvergedtoasub-optimalsolutionusingthestate-wisebasisfunctionsalthoughthetruerewardfunctioncanberepresentedexactlyusingthestate-wisebasisfunctions.TheMMVmethodhadnosuchissueswhenusingthestate-action-wisebasisfunctions.IntheRockSample[4,3]problem,theMMVmethodalsoconvergedtoasub-optimalsolutionusingthestate-action-wisebasisfunctionswith1024basisfunctions,mostofthembeingredundantsincethereareonly12basisfunctionsinthecompactset.Hence,theMMVmethodissensitivetotheselectionofbasisfunctions,whereastheMMFEandPRJmethodsrobustlyyieldoptimalsolutions.Ourreasoningonthisphenomenonisgivenintheendofthissubsection.Meanwhile,thevalueofthelearnedpoliciestendstooscillateinthebeginningofthelearningphase,particularlyintheTigerandRockSample[4,3]problems,sinceourmethodsare721 CHOIANDKIM Problem#ofstepsVpE(b0;R) Tiger201.931dMaze201.0255GridWorld500.70Heaven/Hell3008.64RockSample[4,3]20021.11 Table9:Congurationforeachproblemandthevalueoftheexpert'spolicymeasuredonthetruerewardfunction. ProblemfjfjVpL(b0;R)Time MMVMMFEPRJMMVMMFEPRJ TigerSA61.791.931.9310.04(72.27)7.04(41.56)3.97(96.33)1dMazeSA81.021.021.020.88(75.07)5.18(10.83)0.71(82.13)55GridWorldSA1000.700.700.7024.10(95.11)20.07(96.88)21.49(98.16)Heaven/HellC58.498.648.6418.54(63.02)11.80(79.75)8.99(88.66)S205.708.648.64375.03(96.48)332.97(98.51)937.59(99.75)SA808.478.648.64443.57(98.31)727.87(99.30)826.37(99.68)RockSample[4,3]C1120.8420.0520.388461.65(99.16)8530.18(52.03)10399.61(59.86)NC1720.8320.6220.1621438.83(89.88)10968.81(25.05)27808.79(79.41)SA1024-26.4217.8319.0531228.85(72.45)13486.41(78.25)16351.59(80.57) Table10:ResultsofIRLforPOMDPnRfromsampledtrajectories.ThesetsofthebasisfunctionsaredenotedbyC(compact),NC(non-compact),S(state-wise),andSA(state-action-wise).TheaveragecomputationtimeforeachtrialisreportedinsecondsandthenumbersintheparenthesesnexttothecomputationtimearethepercentagesofthetimetakenbythePOMDPsolver.notguaranteedtoimprovemonotonicallyandarehencepronetoyieldingpoorintermediaterewardfunctions.However,thesepoorintermediaterewardfunctionswilleffectivelyrestricttheregionoftherewardfunctionsforthenalresult.WesummarizeVpL(b0R)returnedattheendofthealgorithmsandthecomputationtimeforeachtrialswiththecomputationtimeforsolvingintermediatePOMDPsinTable10.Asnotedintheabove,inmostoftheexperiments,thealgorithmseventuallyfoundthepolicywhoseperformanceisthesameastheexpert's,whichmeansthealgorithmsfoundtherewardfunctionthatsuccessfullyrecoverstheexpert'spolicy.Thecomputationtimeincreasedwhenthesizeofbasisfunctionsandthesizeoftheproblemswereincreased.Whenthestate-action-wisebasisfunctionswereappliedfortheRockSample[4,3]problem,ittookabout8hoursonaveragefortheMMVmethodtoconverge.However,thelargerportionofthecomputationtimewasspentforsolvingintermediatePOMDPs.TheaveragepercentageofthetimespentforsolvingintermediatePOMDPswas78.83%.Thethirdsetofexperimentswasconductedforexaminingtheperformanceofthealgorithmsasthenumberofsampledbelieftrajectoriesvaried.WeexperimentedwiththeMMV,MMFE,andPRJmethodsintheTigerproblem.Eachtrajectorywastruncatedafter20timesteps.Figure10presentstheresultswherethevalueofpolicyismeasuredbyVpL(b0R).TheMMFEmethodrequiredfewernumberoftrajectoriestoattainthepolicythatperformsclosetotheexpert'sthantheMMVandPRJmethodsrequired.TheperformanceofthePRJmethodwastheworstwhengiven722 IRLINPARTIALLYOBSERVABLEENVIRONMENTSfewsampledtrajectories,butitimprovedfastasthenumberoftrajectoriesincreased.However,theMMVmethodneededmanytrajectoriestondthenear-optimalsolution.WeconcludethissubsectionwithourreasoningonwhytheMMFEandPRJmethodstypicallyoutperformtheMMVmethod.TheMMFEandPRJmethodsdirectlyusethedifferencesinfea-tureexpectations(line11inAlgorithm5andline14inAlgorithm6),whereastheMMVmethodusesthedifferencesinvaluesobtainedfromtheweightvectorsandfeatureexpectations(line12inAlgorithm4).Usingthedifferencesinvaluescanbeproblematicbecauseitisoftenpossiblethataweightvectorverydifferentfromthetrueonecanyieldaverysmalldifferenceinvalues.Hence,itispreferabletodirectlyusethedifferencesinfeatureexpectationssinceitstillboundsthedifferencesinvalueswithoutdependingontheweightvectors.8.RelatedWorkIncontroltheory,recoveringarewardfunctionfromdemonstrationshasreceivedsignicantatten-tion,andhasbeenreferredtoastheinverseoptimalcontrol(IOC)problem.ItwasrstproposedandstudiedforlinearsystemsbyKalman(1964).IRLiscloselyrelatedtoIOC,butthefocusisontheproblemofinverseoptimalitywithintheframeworkofRL.Asalreadymentionedintheintro-duction,Russell(1998)proposedIRLasanimportantprobleminmachinelearning,suggestingthatitwillbeusefulinmanyresearchareassuchasstudiesonanimalandhumanbehaviorssincetherewardfunctionreectstheobjectiveandthepreferenceofthedecisionmaker.IRLisalsousefulforreinforcementlearningsincesimilarbutdifferentdomainsoftensharethesamerewardfunctionstructurealbeitdifferentdynamics.Inthiscase,transferringtherewardfunctionlearnedfromonedomaintoanotherdomainmaybeuseful.Besidesthetaskofrewardlearning,IRLhasgainedinterestinapprenticeshiplearning,wherethetaskistondthepolicywithpossiblybetterperformancethantheonedemonstratedbyanexpert.Apprenticeshiplearningisusefulwhenexplicitlyspecifyingtherewardfunctionisdifcultbuttheexpert'sbehaviorsareavailableinstead.Apprenticeshiplearningisapromisingapproachinroboticssinceitprovidesaframeworkforarobottoimitatethedemonstratorwithoutafullspecicationofwhichstatesaregoodorbad,andtowhatdegree.SinceRussell(1998),anumberofalgorithmsforIRLandapprenticeshiplearninghavebeenproposedinthelastdecade.Mostofthealgorithmsassumeacompletelyobservablesetting,wheretheagenthascapabilitytoaccessthetrueglobalstateoftheenvironmentoftenmodeledasanMDP.Inthissection,webrieyreviewsomeofthesepreviousworksontheIRLandapprenticeshiplearningproblem.OneoftherstapproachestoIRLintheMDPsettingwasproposedbyNgandRussell(2000),whichwehavecoveredinSection3.Theypresentedasufcientandnecessaryconditionontherewardfunctionswhichguaranteestheoptimalityoftheexpert'spolicy,andprovidedsomeheuris-ticstochoosearewardfunctionsincethedegeneraterewardfunctionsalsosatisfytheoptimalitycondition.TheIRLproblemwasformulatedasLPwiththeconstraintscorrespondingtotheop-timalityconditionandtheobjectivefunctioncorrespondingtotheheuristics.Thealgorithmwasshowntoproducereasonablygoodsolutionsintheexperimentsonsomebenchmarkproblems.WehaveextendedthisalgorithmtothepartiallyobservablesettinginSection5andSection6.1.AbbeelandNg(2004)presentedanapprenticeshiplearningalgorithmbasedonIRL,whichwehavedescribedinSection3.2.Oneoftheimportantaspectsofthealgorithmwastocomparethefeatureexpectationsbetweentheexpert'sandthelearnedpoliciesratherthantheestimatedvalues.723 CHOIANDKIM 5 10 15 -150 -100 -50 0 IterationsValue of policyTiger (SA) 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 IterationsValue of policy1d Maze (SA) 1 2 3 4 5 0 0.2 0.4 0.6 IterationsValue of policy5´5 Grid World (SA) MMV MMFE PROJ Opt. 5 10 15 0 2 4 6 8 IterationsValue of policyHeaven/Hell (C) 10 20 30 40 0 2 4 6 8 IterationsValue of policyHeaven/Hell (S) 5 10 15 20 0 2 4 6 8 IterationsValue of policyHeaven/Hell (SA) 10 20 30 -150 -100 -50 0 IterationsValue of policyRock Sample (C) 10 20 30 40 -150 -100 -50 0 IterationsValue of policyRock Sample (NC) 10 20 30 40 -150 -100 -50 0 IterationsValue of policyRock Sample (SA) Figure9:ThevalueofthepoliciesproducedbythelearnedrewardfunctionateachiterationbythealgorithmsofIRLforPOMDPnRfromsampledtrajectories.Thevalueismeasuredonthetruerewardfunctionforeachproblem.TheoptimalvalueisdenotedbyOpt.inthelegend. 1 2 3 -1.5 -1 -0.5 0 0.5 1 1.5 2 log10(# of trajectories)Value of policyTiger (SA) MMV MMFE PROJ Opt. Figure10:ThevalueofthepoliciesproducedbythelearnedrewardfunctionbythealgorithmsofIRLforPOMDPnRfromvaryingnumberofsampledtrajectories.Averagesover100trialsarepresentedwith95%condenceintervals.Thex-axisrepresentsthenumberofsampledtrajectoriesonalog10scale.724 IRLINPARTIALLYOBSERVABLEENVIRONMENTSThealgorithmcomeswithatheoreticalguaranteethatthelearnedpolicyissimilartotheexpert'spolicywhenevaluatedonthetruerewardfunction.Thealgorithmwasshowntosuccessfullylearndifferentdrivingstylesinasimulatedcardrivingtask.Thisworkwasfurtherextendedusinganumberofdifferentapproaches.WehaveextendedthisalgorithmtothepartiallyobservablesettinginSection6.2andSection6.3.Thestructuredmax-marginoptimizationtechnique(Taskaretal.,2005)wasappliedtoappren-ticeshiplearningbyRatliffetal.(2006).TheyformulatedaQPproblemtondtheweightvectoroftherewardbasisfunctionsthatmaximizesthemarginbetweentheexpert'spolicyandallotherpoli-cies.Theyalsoprovidedthemaximummarginplanning(MMP)algorithmbasedonthesubgradientmethod,whichisfasterthantheQPmethod.TheMMPwasshowntosolveproblemsofpracticalsizes,suchasrouteplanningforoutdoormobilerobots,wheretheQPmethodwasnotapplicable.NeuandSzepesvari(2007)proposedanalgorithmforapprenticeshiplearningthatuniesthedirectandindirectmethods:Thedirectmethod,usingsupervisedlearningmethods,ndsthepolicythatminimizeslossfunctionsthatpenalizedeviatingfromtheexpert'spolicy.TheindirectmethodndsthepolicyusingthelearnedrewardfunctionfromIRL.Sincethelossfunctionsaredenedonthepolicyspace,thealgorithmusesnaturalgradientstomapthegradientsinthepolicyspacetothoseintheweightvectorspaceofrewardfunctions.Whereasmostoftheapprenticeshiplearningalgorithmsfocusonapproximatingtheperfor-manceoftheexpert'spolicy,SyedandSchapire(2008)proposedamethodcalledmultiplicativeweightsforapprenticeshiplearning(MWAL),whichtriestoimproveontheexpert'spolicy.Thiswasachievedinagame-theoreticframeworkusingatwopersonzero-sumgame,wherethelearnerselectsapolicythatmaximizesitsperformancerelativetotheexpert'sandtheenvironmentadver-sariallyselectsarewardfunctionthatminimizestheperformanceofthelearnedpolicy.Thegamewassolvedusingthemultiplicativeweightsalgorithm(FreundandSchapire,1999)forndingap-proximatelyoptimalstrategiesinzero-sumgames.Oneofthedifcultiesinapprenticeshiplearningisthatmostproposedalgorithmsinvolvesolv-ingMDPsineachiteration.Syedetal.(2008)addressedthisissuebyidentifyingtheoptimizationperformedintheMWALalgorithm,andformulatingitintoanLPproblem.Theyshowedthatthisdirectoptimizationapproachusinganoff-the-shelfLPsolversignicantlyimprovestheperfor-manceintermsofrunningtimeovertheMWALalgorithm.AsmentionedinSection4,IRLisanill-posedproblemsincethesolutionofIRLisnotunique.Toaddressthenon-uniquenessinthesolution,theaboveapproachesadoptsomeheuristics,forexample,maximizingthemarginbetweentheexpert'spolicyandotherpolicies.Wecouldalsohandletheuncertaintyintherewardfunctionusingprobabilisticframeworks.RamachandranandAmir(2007)suggestedaBayesianframeworkforIRLandapprenticeshiplearning.Theexternalknowledgeabouttherewardfunctionisformulatedintheprior,andtheposterioriscomputedbyupdatingthepriorusingtheexpert'sbehaviordataasevidence.Ziebartetal.(2008)proposedanapprenticeshiplearningalgorithmadoptingthemaximumentropyprincipleforchoosingthelearnedpolicyconstrainedtomatchfeatureexpectationsoftheexpert'sbehavior.Recently,NeuandSzepesvari(2009)providedauniedframeworkforinterpretinganumberofincrementalIRLalgorithmslistedabove,anddiscussedthesimilaritiesanddifferencesamongthealgorithmsbydeningthedistancefunctionandtheupdatestepemployedineachalgorithm.Eachalgorithmwascharacterizedbythedistancefunctionthatmeasuresthedifferencebetweentheexpert'sbehaviordataandthepolicyfromthelearnedrewardfunction,andtheupdatestepthatcomputesnewparametervaluesfortherewardfunction.725 CHOIANDKIMThequestionofwhethertheIRLandtheapprenticeshiplearningalgorithmslistedabovecanbeextendedtothepartiallyobservablesettinginanefcientwayremainsasanimportantopenproblem.9.ConclusionTheobjectiveofIRListondtherewardfunctionthatthedomainexpertisoptimizingfromthegivendataofherorhisbehaviorandthemodeloftheenvironment.IRLwillbeusefulinvariousareasconnectedwithreinforcementlearningsuchasanimalandhumanbehaviorstudies,econo-metrics,andintelligentagents.However,theapplicabilityofIRLhasbeenlimitedsincemostofthepreviousapproachesemployedtheassumptionofanomniscientagentusingtheMDPframework.WepresentedanIRLframeworkfordealingwithpartiallyobservableenvironmentsinordertorelaxtheassumptionofanomniscientagentinthepreviousIRLalgorithms.First,wederivedtheconstraintsoftherewardfunctiontoguaranteetheoptimalityoftheexpert'spolicyandbuiltopti-mizationproblemstosolveIRLforPOMDPnRwhentheexpert'spolicyisexplicitlygiven.TheresultsfromtheclassicalPOMDPresearch,suchasthegeneralizedHoward'spolicyimprovementtheorem(Howard,1960)andthewitnesstheorem(Kaelblingetal.,1998),wereexploitedtore-ducethecomputationalcomplexityofthealgorithms.Second,weproposediterativealgorithmsofIRLforPOMDPnRfromtheexpert'strajectories.Weproposedanalgorithmthatusesmax-marginbetweenvaluesviaLP,andthen,inordertoaddresslargerproblemsrobustly,weadaptedtheal-gorithmsforapprenticeshiplearningintheMDPframeworktoIRLforPOMDPnR.ExperimentalresultsonseveralPOMDPbenchmarkdomainsshowedthat,inmostcases,ouralgorithmsrobustlyndsolutionsclosetothetruerewardfunction,generatingpoliciesthatacquirevaluesclosetothatoftheexpert'spolicy.WedemonstratedthattheclassicalIRLalgorithmonMDPnRcouldbeextendedtoPOMDPnR,andwebelievethatmorerecentIRLtechniquesaswellassomeoftheIRL-basedapprenticeshiplearningtechniquescouldbesimilarlyextendedbyfollowingourlineofthought.However,thereareanumberofinterestingissuesthatshouldbeaddressedinfuturestudies.9.1FindingtheOptimalityConditionTheproposedconditionsinSection5arenotsufcientconditionsoftherewardfunctiontoguar-anteetheoptimalityoftheexpert'spolicy.TheconditionbasedonthecomparisonofQ-functionsinEquation(14)shouldbeevaluatedforeverypossiblepolicythatmayhaveaninnitenumberofnodes.TheconditionusingtheDPupdateandthewitnesstheoreminEquation(16)shouldbeevaluatedforsomeusefulnodesthattheexpert'spolicymaynothaveduetotheirunreachabilityfromthestartingnode.Also,Equations(14)and(16)shouldbeextendedtoassessthevalueforallbeliefs.Thus,itiscrucialtondasufcientconditionthatcanbeefcientlycomputedinordertorestrictthefeasibleregionoftherewardfunctionstightlysothattheoptimizationproblemscanndtherewardfunctionthatguaranteestheoptimalityofthegivenexpert'spolicy.9.2BuildinganEffectiveHeuristicAlthoughtheconstraintsfortherewardfunctionarenotsufcientconditions,weempiricallyshowedthatjVpE(b0RL)VpL(b0RL)j=0,whichimpliesthatthevalueoftheexpert'spolicypEisequaltothatoftheoptimalpolicypLproducedbythelearnedrewardRLwhenthevalueisevaluatedon726 IRLINPARTIALLYOBSERVABLEENVIRONMENTSthelearnedreward.Inotherwords,theexpert'spolicyisanotheroptimalpolicyforthelearnedre-wardandthelearnedrewardstillsatisestheoptimalityconditionoftheexpert'spolicy.However,theoptimalpolicyforthelearnedrewarddoesnotachievethesamevalueastheexpert'spolicywhenthevalueisevaluatedonthetruereward.Thereasonforthealgorithms'failuretondtheappropriaterewardfunctionmaylieintheshortcomingsoftheheuristicfortheobjectivefunctions.Inthispaper,weusetheheuristicoriginallyproposedbyNgandRussell(2000).Itpreferstherewardfunctionthatmaximizesthesumofthedifferencesbetweenthevalueoftheexpert'spolicyandtheotherpolicieswhileforcingtherewardfunctiontobeassparseaspossible.Unfortunately,thisheuristicfailedinsomecasesinourexperiments.Hence,amoreeffectiveheuristicshouldbedevisedtondtherewardfunctionthatprovidessimilarbehaviortotheexpert'spolicy.ThiscanbeaddressedbyadaptingmorerecentIRLapproachessuchastheBayesianIRL(Ramachan-dranandAmir,2007)andthemaximumentropyIRL(Ziebartetal.,2008)topartiallyobservableenvironments.TheBayesianIRLpreferstherewardfunctioninducingthehighprobabilityofexe-cutingactionsinthegivenbehaviordata,andthemaximumentropyIRLpreferstherewardfunctionmaximizingtheentropyofthedistributionoverbehaviorswhilematchingthefeatureexpectations.9.3ScalabilityThealgorithmswepresentedarecategorizedintotwosets:Therstisforthecaseswhentheexpert'spolicyisexplicitlygivenintheFSCrepresentationsandthesecondisforthecaseswhenthetrajectoriesoftheexpert'sexecutedactionsandthecorrespondingobservationsaregiven.Fortherstsetofthealgorithms,thecomputationalcomplexityisreducedbasedonthegeneralizedHoward'spolicyimprovementtheorem(Howard,1960)andthewitnesstheorem(Kaelblingetal.,1998).Thealgorithmsstillsufferfromahugenumberofconstraintsintheoptimizationproblem.Thequestionisthenwhetheritispossibletoselectamorecompactsetofconstraintsthatdenethevalidregionoftherewardfunctionwhileguaranteeingtheoptimalityoftheexpert'spolicy,whichisagainrelatedtondingthesufcientcondition.Forthesecondsetofthealgorithms,thescalabilityismoreaffectedbytheefciencyofthePOMDPsolverthanbythenumberofconstraintsintheoptimizationproblem.AlthoughPBPI(Jietal.,2007),thePOMDPsolverusedinthispaper,isknowntobeoneofthefastestPOMDPsolverswhichreturnFSCpolicies,itwasobservedintheexperimentsthatthealgorithmsspentmorethan95%ofthetimetosolvetheintermediatePOMDPproblems.ComputinganoptimalpolicyintheintermediatePOMDPproblemtakesamuchlongertimethansolvingausualPOMDPproblem,sinceanoptimalpolicyoftheintermediatePOMDPproblemisoftencomplexduetothecomplexrewardstructure.ThelimitationcouldbehandledbymodifyingthealgorithmstoaddresstheIRLproblemswithotherPOMDPsolvers,suchasHSVI(SmithandSimmons,2005),Perseus(SpaanandVlassis,2005),PBVI(Pineauetal.,2006),andSARSOP(Kurniawatietal.,2008),whichgeneratethepolicydenedasamappingfrombeliefstoactions.AcknowledgmentsThisworkwassupportedbytheNationalResearchFoundationofKorea(NRF)grant2009-0069702,andbytheDefenseAcquisitionProgramAdministrationandAgencyforDefenseDevelopmentofKoreaundercontract09-01-03-04.727 CHOIANDKIMReferencesPieterAbbeelandAndrewY.Ng.Apprenticeshiplearningviainversereinforcementlearning.InProceedingsofthe21stInternationalConferenceonMachineLearning(ICML),pages1–8,Banff,Alta,Canada,2004.TilmanBorgersandRajivSarin.Naivereinforcementlearningwithendogenousaspirations.Inter-nationalEconomicReview,41(4):921–950,2000.AnthonyR.Cassandra,LesliePackKaelbling,andMichaelL.Littman.Actingoptimallyinpartiallyobservablestochasticdomains.InProceedingsofthe12thNationalConferenceonArticialIntelligence(AAAI),pages1023–1028,Seattle,WA,USA,1994.JaedeugChoiandKee-EungKim.Inversereinforecementlearninginpartiallyobservableenvi-ronments.InProceedingsofthe21stInternationalJointConferenceonArticialIntelligence(IJCAI),pages1028–1033,Pasadena,CA,USA,2009.MichaelXCohenandCharanRanganath.Reinforcementlearningsignalspredictfuturedecisions.JournalofNeuroscience,27(2):371–378,2007.IdoErevandAlvinE.Roth.Predictinghowpeopleplaygames:Reinforcementlearninginex-perimentalgameswithunique,mixedstrategyequilibria.AmericanEconomicReview,88(4):848–881,1998.YoavFreundandRobertE.Schapire.Adaptivegameplayingusingmultiplicativeweights.GamesandEconomicBehavior,29(1–2):79–103,1999.HectorGeffnerandBlaiBonet.SolvinglargePOMDPsusingrealtimedynamicprogramming.InProceedingsoftheAAAIFallSymposiumSeries,pages61–68,1998.JacquesHadamard.Surlesproblemesauxderiveespartiellesetleursignicationphysique.Prince-tonUniversityBulletin,13(1):49–52,1902.EricA.Hansen.Finite-MemoryControlofPartiallyObservableSystems.PhDthesis,UniversityofMassachusettsAmherst,1998.JesseHoey,AxelVonBertoldi,PascalPoupart,andAlexMihailidis.Assistingpersonswithdemen-tiaduringhandwashingusingapartiallyobservableMarkovdecisionprocess.InProceedingsofthe5thInternationalConferenceonVisionSystems,BielefeldUniversity,Germany,2007.EdHopkins.Adaptivelearningmodelsofconsumerbehavior.JournalofEconomicBehaviorandOrganization,64(3–4):348–368,2007.Ronald.A.Howard.DynamicProgrammingandMarkovProcesses.MITPress,Cambridge,MA,1960.ShihaoJi,RonaldParr,HuiLi,XuejunLiao,andLawrenceCarin.Point-basedpolicyiteration.InProceedingsofthe22ndAAAIConferenceonArticialIntelligence(AAAI),pages1243–1249,Vancouver,BC,USA,2007.728 IRLINPARTIALLYOBSERVABLEENVIRONMENTSLesliePackKaelbling,MichaelL.Littman,andAnthonyR.Cassandra.Planningandactinginpartiallyobservablestochasticdomains.ArticialIntelligence,101(1–2):99–134,1998.RudolfE.Kalman.Whenisalinearcontrolsystemoptimal?TransactionASME,JournalBasicEngineering,86:51–60,1964.HannaKurniawati,DavidHsu,andWeeSunLee.SARSOP:Efcientpoint-basedPOMDPplanningbyapproximatingoptimallyreachablebeliefspaces.InProceedingsofRobotics:ScienceandSystems,Zurich,Switzerland,2008.TerranLaneandCarlaE.Brodley.Anempiricalstudyoftwoapproachestosequencelearningforanomalydetection.MachineLearning,51(1):73–107,2003.DaeyeolLee,MichelleL.Conroy,BenjaminP.McGreevy,andDominicJ.Barraclough.Rein-forcementlearninganddecisionmakinginmonkeysduringacompetitivegame.CognitiveBrainResearch,22(1):45–58,2004.GeorgeE.Monahan.AsurveyofpartiallyobservableMarkovdecisionprocesses:Theory,models,andalgorithms.ManagementScience,28(1):1–16,1982.P.ReadMontagueandGregoryS.Berns.Neuraleconomicsandthebiologicalsubstratesofvalua-tion.Neuron,36(2):265–284,2002.GergelyNeuandCsabaSzepesvari.Apprenticeshiplearningusinginversereinforcementlearn-ingandgradientmethods.InProceedingsofthe23rdConferenceonUncertaintyinArticialIntelligence(UAI),pages295–302,Vancouver,BC,Canada,2007.GergelyNeuandCsabaSzepesvari.Trainingparsersbyinversereinforcementlearning.MachineLearning,pages1–35,2009.AllenNewell.Theknowledgelevel.ArticialIntelligence,18(1):87–127,1982.AndrewY.NgandStuartRussell.Algorithmsforinversereinforcementlearning.InProceedingsofthe17thInternationalConferenceonMachineLearning(ICML),pages663–670,StanfordUniversity,Standord,CA,USA,2000.AndrewY.Ng,DaishiHarada,andStuartRussell.Policyinvarianceunderrewardtransformations:Theoryandapplicationtorewardshaping.InProceedingsofthe16thInternationalConferenceonMachineLearning(ICML),pages278–287,Bled,Slovenia,1999.YaelNiv.Reinforcementlearninginthebrain.JournalofMathematicalPsychology,53(3):139–154,2009.JoellePineau,GeoffreyGordon,andSebastianThrun.Anytimepoint-basedapproximationsforlargePOMDPs.JournalofArticialIntelligenceResearch,27:335–380,2006.DeepakRamachandranandEyalAmir.Bayesianinversereinforcementlearning.InProceedingsof20thInternationalJointConferenceonArticialIntelligence(IJCAI),pages2586–2591,Hyder-abad,India,2007.729 CHOIANDKIMNathanD.Ratliff,J.AndrewBagnell,andMartinA.Zinkevich.Maximummarginplanning.InPro-ceedingsofthe23rdInternationalConferenceonMachineLearning,pages729–736,Pittburgh,Pennsylvania,USA,2006.StuartRussell.Learningagentsforuncertainenvironments(extendedabstract).InProceedingsofthe11thAnnualConferenceonComputationalLearningTheory(COLT),pages101–103,Madi-son,WI,USA,1998.TreySmith.Probabilisticplanningforroboticexploration.PhDthesis,CarnegieMellonUniversity,TheRoboticsInstitute,2007.TreySmithandReidSimmons.HeuristicsearchvalueiterationforPOMDPs.InProceedingsofthe20thAnnualConferenceonUncertaintyinArticialIntelligence(UAI),pages520–527,Banff,Canada,2004.TreySmithandReidSimmons.Point-basedPOMDPalgorithms:Improvedanalysisandimple-mentation.InProceedingsofthe21stAnnualConferenceonUncertaintyinArticialIntelligence(UAI),pages542–547,Edinburgh,Scotland,2005.EdwardJaySondik.TheOptimalControlofPartiallyObservableMarkovProcesses.PhDthesis,StanfordUniversity,1971.MatthijsT.J.SpaanandNikosVlassis.Apoint-basedpomdpalgorithmforrobotplanning.InProceedingsofIEEEInternationalConferenceonRoboticsandAutomation,pages2399–2404,NewOrleans,LA,USA,2004.MatthijsT.J.SpaanandNikosVlassis.Perseus:Randomizedpoint-basedvalueiterationforPOMDPs.JournalofArticialIntelligenceResearch,24:195–220,2005.UmarSyedandRobertE.Schapire.Agame-theoreticapproachtoapprenticeshiplearning.InProceedingsofNeuralInformationProcessingSystems(NIPS),pages1449–1456,Vancouver,BritishColumbia,Canada,2008.UmarSyed,MichaelBowling,andRobertSchapire.Apprenticeshiplearningusinglinearpro-gramming.InProceedingsofthe25thInternationalConferenceonMachineLearning,pages1032–1039,Helsinki,Finland,2008.BenTaskar,VassilChatalbashev,DaphneKoller,andCarlosGuestrin.Learningstructuredpredic-tionmodels:alargemarginapproach.InProceedingsofthe22ndInternationalConferenceonMachinelearning,pages896–903,Bonn,Germany,2005.JasonD.WilliamsandSteveYoung.PartiallyobservableMarkovdecisionprocessesforspokendialogsystems.ComputerSpeechandLanguage,21(2):393–422,2007.QingZhao,LangTong,AnanthramSwami,andYunxiaChen.Decentralizedcognitivemacforop-portunisticspectrumaccessinadhocnetworks:Apomdpframework.IEEEJournalonSelectedAreasinCommunications,25(3):589–600,2007.BrianD.Ziebart,AndrewMaas,JamesAndrewBagnell,andAnindK.Dey.Maximumentropyinversereinforcementlearning.InProceedingsofthe23rdAAAIConferenceonArticialIntelli-gence(AAAI),pages1433–1438,Chicago,IL,USA,2008.730