Belief states MDPbased algorithms Other suboptimal algorithms Optimal algorithms Application to robotics 222 brPage 3br A planning problem Task start at random position pick up mail at P deliver mail at D Characteristics motion noise perceptual a ID: 26088
Download Pdf The PPT/PDF document "Partially observable Markov decision pro..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
PartiallyobservableMarkovdecisionprocesses MatthijsSpaanInstituteforSystemsandRoboticsInstitutoSuperiorT´ecnicoLisbon,PortugalReadinggroupmeeting,February12,2007 1/22 Overview PartiallyobservableMarkovdecisionprocesses:Model.Beliefstates.MDP-basedalgorithms.Othersub-optimalalgorithms.Optimalalgorithms.Applicationtorobotics. 2/22 Aplanningproblem Task:startatrandomposition()pickupmailatPdelivermailatD(4).Characteristics:motionnoise,perceptualaliasing. 3/22 Planningunderuncertainty Uncertaintyisabundantinreal-worldplanningdomains.Bayesianapproachprobabilisticmodels.Commonapproachinrobotics,e.g.,robotlocalization. 4/22 POMDPs PartiallyobservableMarkovdecisionprocesses(POMDPs)(Kaelblingetal.,1998):Frameworkforagentplanningunderuncertainty.Typicallyassumesdiscretesetsofstates,actionsAandobservationsO.Transitionmodelps0js;a:modelstheeffectofactions.Observationmodelpojs;a:relatesobservationstostates.Taskisdenedbyarewardmodelrs;a.Goalistocomputeplan,orpolicy,thatmaximizeslong-termreward. 5/22 POMDPapplications Robotnavigation(SimmonsandKoenig,1995;TheocharousandMahadevan,2002).Visualtracking(DarrellandPentland,1996).Dialoguemanagement(Royetal.,2000).Robot-assistedhealthcare(Pineauetal.,2003b;Bogeretal.,2005).Machinemaintenance(SmallwoodandSondik,1973),structuralinspection(Ellisetal.,1995).Inventorycontrol(TreharneandSox,2002),dynamicpricingstrategies(AvivandPazgal,2005),marketingcampaigns(RusmevichientongandVanRoy,2001).Medicalapplications(HauskrechtandFraser,2000;Huetal.,1996). 6/22 Transitionmodel Forinstance,robotmotionisinaccurate.Transitionsbetweenstatesarestochastic.ps0js;aistheprobabilitytojumpfromstatestostates0aftertakingactiona. ????? 7/22 Observationmodel Imperfectsensors.Partiallyobservableenvironment:ISensorsarenoisy.ISensorshavealimitedview.pojs;aistheprobabilitytheagentreceivesobservationoinstatesaftertakingactiona. 8/22 Memory APOMDPexamplethatrequiresmemory(Singhetal.,1994): s1s2a1a2 r r a1;+ra2;+rMethodValue MDPpolicyV=r 1 \rMemorylessdeterministicPOMDPpolicyVmax=r \rr 1 \rMemorylessstochasticPOMDPpolicyV=0Memory-basedPOMDPpolicyVmin=\rr 1 \r r 9/22 Beliefs Beliefs:Theagentmaintainsabeliefbsofbeingatstates.Afteractiona2Aandobservationo2OthebeliefbscanbeupdatedusingBayes'rule:b0s0/pojs0Xsps0js;absThebeliefvectorisaMarkovsignalfortheplanningtask. 10/22 Beliefupdateexample Truesituation: Robot'sbelief: 00:250:5Observations:doororcorridor,10%noise.Action:moves3(20%),4(60%),or5(20%)states. 11/22 Beliefupdateexample Truesituation: Robot'sbelief: 00:250:5Observations:doororcorridor,10%noise.Action:moves3(20%),4(60%),or5(20%)states. 11/22 Beliefupdateexample Truesituation: Robot'sbelief: 00:250:5Observations:doororcorridor,10%noise.Action:moves3(20%),4(60%),or5(20%)states. 11/22 Beliefupdateexample Truesituation: Robot'sbelief: 00:250:5Observations:doororcorridor,10%noise.Action:moves3(20%),4(60%),or5(20%)states. 11/22 SolvingPOMDPs AsolutiontoaPOMDPisapolicy,i.e.,amappinga=bfrombeliefstoactions.Anoptimalpolicyischaracterizedbyavaluefunctionthatmaximizes:Vb0)=E[1Xt=0\rtrbt;bt))]Computingtheoptimalvaluefunctionisahardproblem(PSPACE-completefornitehorizon).Inrobotics:apolicyisoftencomputedusingsimpleMDP-basedapproximations. 12/22 MDP-basedalgorithms UsethesolutiontotheMDPasanheuristic.Mostlikelystate(Cassandraetal.,1996):MLSb)=(argmaxsbs.QMDP(Littmanetal.,1995):QMDPb)=argmaxaPsbsQs;a. CIAAD+1cbaa0.50.5bcaabb-1 (ParrandRussell,1995) 13/22 Othersub-optimaltechniques Grid-basedapproximations(Drake,1962;Lovejoy,1991;Brafman,1997;ZhouandHansen,2001;Bonet,2002).Optimizingnite-statecontrollers(Platzman,1981;Hansen,1998b;PoupartandBoutilier,2004).Gradientascent(NgandJordan,2000;AberdeenandBaxter,2002).Heuristicsearchinthebelieftree(SatiaandLave,1973;Hansen,1998a;SmithandSimmons,2004).CompressingthePOMDP(Royetal.,2005;PoupartandBoutilier,2003).Point-basedtechniques(Pineauetal.,2003a;SpaanandVlassis,2005). 14/22 Optimalvaluefunctions Theoptimalvaluefunctionofa(nitehorizon)POMDPispiecewiselinearandconvex:Vb)=maxb. (1,0)(0,1) 1234V 15/22 Exactvalueiteration Valueiterationcomputesasequenceofvaluefunctionestimates:V1;V2;:::;Vn. (1,0)(0,1) V V1V2V3 16/22 OptimalPOMDPmethods Enumerateandprune:Moststraightforward:Monahan(1982)'senumerationalgorithm.GeneratesamaximumofjAjjVnjjOjvectorsateachiteration,hencerequirespruning.Incrementalpruning(ZhangandLiu,1996;Cassandraetal.,1997).Searchforwitnesspoints:OnePass(Sondik,1971;SmallwoodandSondik,1973).RelaxedRegion,LinearSupport(Cheng,1988).Witness(Cassandraetal.,1994). 17/22 Vectorpruning (1,0)(0,1) V b1b212345Linearprogramforpruning:variables:8s2S;bs);xmaximize:xsubjectto:b 0x;802V;0=b2 18/22 Highdimensionalsensorreadings Omnidirectionalcameraimages.Exampleimages Dimensionreduction:Collectadatabaseofimagesandrecordtheirlocation.ApplyPrincipalComponentAnalysisontheimagedata.Projecteachimagetotherst3eigenvectors,resultingina3Dfeaturevectorforeachimage. 19/22 Observationmodel psjoWeclusterthefeaturevectorsinto10prototypeobservations.Wecomputeadiscreteob-servationmodelpojs;abyahistogramoperation. 20/22 States,actionsandrewards DP State:s=(x;jwithxtherobot'slocationandthemailbit.GridXinto500locations.Actions:f";;#;;pickup;deliverg.Positivereward:onlyuponsuccessfulmaildelivery. 21/22 References D.AberdeenandJ.Baxter.Scalinginternal-statepolicy-gradientmethodsforPOMDPs.InInternationalConferenceonMachineLearning,2002.Y.AvivandA.Pazgal.ApartiallyobservedMarkovdecisionprocessfordynamicpricing.ManagementScience,51(9):14001416,2005.J.Boger,P.Poupart,J.Hoey,C.Boutilier,G.Fernie,andA.Mihailidis.Adecision-theoreticapproachtotaskassistanceforpersonswithdementia.InProc.Int.JointConf.onArticialIntelligence,2005.B.Bonet.Anepsilon-optimalgrid-basedalgorithmforpartiallyobservableMarkovdecisionprocesses.InInternationalConferenceonMachineLearning,2002.R.I.Brafman.AheuristicvariablegridsolutionmethodforPOMDPs.InProc.oftheNationalConferenceonArticialIntelligence,1997.A.R.Cassandra,L.P.Kaelbling,andM.L.Littman.Actingoptimallyinpartiallyobservablestochasticdomains.InProc.oftheNationalConferenceonArticialIntelligence,1994.A.R.Cassandra,L.P.Kaelbling,andJ.A.Kurien.Actingunderuncertainty:DiscreteBayesianmodelsformobilerobotnavigation.InProc.ofInternationalConferenceonIntelligentRobotsandSystems,1996.A.R.Cassandra,M.L.Littman,andN.L.Zhang.Incrementalpruning:Asimple,fast,exactmethodforpartiallyobservableMarkovdecisionprocesses.InProc.ofUncertaintyinArticialIntelligence,1997.H.T.Cheng.AlgorithmsforpartiallyobservableMarkovdecisionprocesses.PhDthesis,UniversityofBritishColumbia,1988.T.DarrellandA.Pentland.ActivegesturerecognitionusingpartiallyobservableMarkovdecisionprocesses.InProc.ofthe13thInt.Conf.onPatternRecognition,1996.A.W.Drake.ObservationofaMarkovprocessthroughanoisychannel.Sc.D.thesis,MassachusettsInstituteofTechnology,1962.J.H.Ellis,M.Jiang,andR.Corotis.Inspection,maintenance,andrepairwithpartialobservability.JournalofInfrastructureSystems,1(2):9299,1995.E.A.Hansen.Finite-memorycontrolofpartiallyobservablesystems.PhDthesis,UniversityofMassachusetts,Amherst,1998a.E.A.Hansen.SolvingPOMDPsbysearchinginpolicyspace.InProc.ofUncertaintyinArticialIntelligence,1998b.M.HauskrechtandH.Fraser.PlanningtreatmentofischemicheartdiseasewithpartiallyobservableMarkovdecisionprocesses.ArticialIntelligenceinMedicine,18:221244,2000.C.Hu,W.S.Lovejoy,andS.L.Shafer.Comparisonofsomesuboptimalcontrolpoliciesinmedicaldrugtherapy.OperationsResearch,44(5):696709,1996.L.P.Kaelbling,M.L.Littman,andA.R.Cassandra.Planningandactinginpartiallyobservablestochasticdomains.ArticialIntelligence,101:99134,1998.M.L.Littman,A.R.Cassandra,andL.P.Kaelbling.Learningpoliciesforpartiallyobservableenvironments:Scalingup.InInternationalConferenceonMachineLearning,1995.W.S.Lovejoy.ComputationallyfeasibleboundsforpartiallyobservedMarkovdecisionprocesses.OperationsResearch,39(1):162175,1991.G.E.Monahan.AsurveyofpartiallyobservableMarkovdecisionprocesses:theory,modelsandalgorithms.ManagementScience,28(1),Jan.1982.A.Y.NgandM.Jordan.PEGASUS:ApolicysearchmethodforlargeMDPsandPOMDPs.InProc.ofUncertaintyinArticialIntelligence,2000.R.ParrandS.Russell.Approximatingoptimalpoliciesforpartiallyobservablestochasticdomains.InProc.Int.JointConf.onArticialIntelligence,1995.J.Pineau,G.Gordon,andS.Thrun.Point-basedvalueiteration:AnanytimealgorithmforPOMDPs.InProc.Int.JointConf.onArticialIntelligence,2003a.J.Pineau,M.Montemerlo,M.Pollack,N.Roy,andS.Thrun.Towardsroboticassistantsinnursinghomes:Challengesandresults.RoboticsandAutonomousSystems,42(34):271281,2003b.L.K.Platzman.Afeasiblecomputationalapproachtoinnite-horizonpartially-observedMarkovdecisionproblems.TechnicalReportJ-81-2,SchoolofIndustrialandSystemsEngineering,GeorgiaInstituteofTechnology,1981.ReprintedinworkingnotesAAAI1998FallSymposiumonPlanningwithPOMDPs.P.PoupartandC.Boutilier.Boundednitestatecontrollers.InAdvancesinNeuralInformationProcessingSystems16.MITPress,2004.P.PoupartandC.Boutilier.Value-directedcompressionofPOMDPs.InAdvancesinNeuralInformationProcessingSystems15.MITPress,2003.N.Roy,J.Pineau,andS.Thrun.Spokendialogmanagementforrobots.InProc.oftheAssociationforComputationalLinguistics,2000.N.Roy,G.Gordon,andS.Thrun.FindingapproximatePOMDPsolutionsthroughbeliefcompression.JournalofArticialIntelligenceResearch,23:140,2005.P.RusmevichientongandB.VanRoy.AtractablePOMDPforaclassofsequencingproblems.InProc.ofUncertaintyinArticialIntelligence,2001.J.K.SatiaandR.E.Lave.Markoviandecisionprocesseswithprobabilisticobservationofstates.ManagementScience,20(1),1973.R.SimmonsandS.Koenig.Probabilisticrobotnavigationinpartiallyobservableenvironments.InProc.Int.JointConf.onArticialIntelligence,1995.S.Singh,T.Jaakkola,andM.Jordan.Learningwithoutstate-estimationinpartiallyobservableMarkoviandecisionprocesses.InInternationalConferenceonMachineLearning,1994.R.D.SmallwoodandE.J.Sondik.TheoptimalcontrolofpartiallyobservableMarkovdecisionprocessesoveranitehorizon.OperationsResearch,21:10711088,1973.T.SmithandR.Simmons.HeuristicsearchvalueiterationforPOMDPs.InProc.ofUncertaintyinArticialIntelligence,2004.E.J.Sondik.TheoptimalcontrolofpartiallyobservableMarkovprocesses.PhDthesis,StanfordUniversity,1971.M.T.J.SpaanandN.Vlassis.Perseus:Randomizedpoint-basedvalueiterationforPOMDPs.JournalofArticialIntelligenceResearch,24:195220,2005.G.TheocharousandS.Mahadevan.ApproximateplanningwithhierarchicalpartiallyobservableMarkovdecisionprocessesforrobotnavigation.InProceedingsoftheIEEEInternationalConferenceonRoboticsandAutomation,2002.J.T.TreharneandC.R.Sox.Adaptiveinventorycontrolfornonstationarydemandandpartialinformation.ManagementScience,48(5):607624,2002.N.L.ZhangandW.Liu.Planninginstochasticdomains:problemcharacteristicsandapproximations.TechnicalReportHKUST-CS96-31,DepartmentofComputerScience,TheHongKongUniversityofScienceandTechnology,1996.R.ZhouandE.A.Hansen.Animprovedgrid-basedapproximationalgorithmforPOMDPs.InProc.Int.JointConf.onArticialIntelligence,2001. 22/22