资源描述:
《Generalized Model Learning for Reinforcement Learning on a Humanoid Robot Austin Villa 2010 》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、InIEEEInternationalConferenceonRoboticsandAutomation(ICRA2010),Anchorage,Alaska,May2010.GeneralizedModelLearningforReinforcementLearningonaHumanoidRobotToddHester,MichaelQuinlan,andPeterStoneDepartmentofComputerScienceTheUniversityofTexasatAustinAustin,TX78712{todd,mquinlan,
2、pstone}@cs.utexas.eduAbstractReinforcementlearning(RL)algorithmshavelongbeenpromisingmethodsforenablinganautonomousrobottoimproveitsbehavioronsequentialdecision-makingtasks.Theobviousenticementisthattherobotshouldbeabletoimproveitsownbehaviorwithouttheneedfordetailedstep-by-
3、stepprogramming.However,forRLtoreachitsfullpotential,thealgorithmsmustbesampleefficient:theymustlearncompetentbehaviorfromveryfewreal-worldtrials.Fromthisperspective,model-basedmethods,whichuseexperientialdatamoreefficientlythanmodel-freeapproaches,areappealing.Buttheyoftenreq
4、uireexhaustiveexplorationtolearnanaccuratemodelofthedomain.Inthispaper,wepresentanalgorithm,ReinforcementLearningwithDecisionTrees(RL-DT),thatusesdecisiontreestolearnthemodelbygeneralizingtherelativeeffectofactionsacrossstates.Theagentexplorestheenvironmentuntilitbelievesith
5、asareasonablepolicy.ThecombinationofthelearningapproachwiththetargetedFig.1.Oneofthepenaltykicksduringthesemi-finalsofRoboCup2009.explorationpolicyenablesfastlearningofthemodel.WecompareRL-DTagainststandardmodel-freeandmodel-basedlearningmethods,anddemonstrateitseffectiveness
6、onanValue-functionmethodscanthemselvesbedividedintoAldebaranNaohumanoidrobotscoringgoalsinapenaltykickmodel-freealgorithms,suchasQ-LEARNING[6],thatarescenario.computationallycheap,butignorethedynamicsoftheworld,thusrequiringlotsofexperience;andmodel-basedalgo-I.INTRODUCTIONr
7、ithms,suchasR-MAX[7],thatlearnanexplicitdomainAsthetasksthatwedesirerobotstoperformbecomemodelandthenuseittofindtheoptimalactionsviasim-morecomplex,andasrobotsbecomecapableofoperatingulationinthemodel.Model-basedreinforcementlearning,autonomouslyforlongerperiodsoftime,wewilln
8、eedtothoughcomputationallymoreintensive,givestheagentthemovefromhand-codeds