资源描述:
《10[Oct 29]Reinforcement learning [Peter Bodik].pdf》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、ReinforcementLearningPeterBodíkPreviousLectures•Supervisedlearning–classification,regression•Unsupervisedlearning–clustering,dimensionalityreduction•Reinforcementlearning–generalizationofsupervisedlearning–learnfrominteractionw/environmenttoachieveagoalenvironmentrewardactionnewstateage
2、ntToday•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Differencelearning•automaticresourceallocationforin-memorydatabase•miscellaneous–staterepresentation–functionapproximation,rewardsRobotinaroomactions:UP,D
3、OWN,LEFT,RIGHT+1UP-180%moveUP10%moveLEFT10%moveRIGHTSTART•reward+1at[4,3],-1at[4,2]•reward-0.04foreachstep•what’sthestrategytoachievemaxreward?•whatiftheactionsweredeterministic?Otherexamples•pole-balancing•walkingrobot(applet)•TD-Gammon[GerryTesauro]•helicopter[AndrewNg]•noteacherwhowo
4、uldsay“good”or“bad”–isreward“10”goodorbad?–rewardscouldbedelayed•exploretheenvironmentandlearnfromtheexperience–notjustblindsearch,trytobesmartaboutitOutline•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Diff
5、erencelearning•automaticresourceallocationforin-memorydatabase•miscellaneous–staterepresentation–functionapproximation,rewardsRobotinaroomactions:UP,DOWN,LEFT,RIGHT+1UP-180%moveUP10%moveLEFT10%moveRIGHTSTARTreward+1at[4,3],-1at[4,2]reward-0.04foreachstep•states•actions•rewards•whatisthe
6、solution?Isthisasolution?+1-1•onlyifactionsdeterministic–notinthiscase(actionsarestochastic)•solution/policy–mappingfromeachstatetoanactionOptimalpolicy+1-1Rewardforeachstep-2+1-1Rewardforeachstep:-0.1+1-1Rewardforeachstep:-0.04+1-1Rewardforeachstep:-0.01+1-1Rewardforeachstep:+0.01+1-1M
7、arkovDecisionProcess(MDP)•setofstatesS,setofactionsA,initialstateS0•transitionmodelP(s’
8、s,a)environment–P([1,2]
9、[1,1],up)=0.8–Markovassumptionrewardactionnewstateagent•rewardfunctionr(s)–r([4,3])=+1•goal:maximizecumulativerewardinthelongrun•policy:mappingfromStoA–(s)or(s,a)•r