10[Oct 29]Reinforcement learning [Peter Bodik].pdf

10[Oct 29]Reinforcement learning [Peter Bodik].pdf

ID:34143076

大小:593.96 KB

页数:48页

时间:2019-03-03

10[Oct 29]Reinforcement learning [Peter Bodik].pdf_第1页
10[Oct 29]Reinforcement learning [Peter Bodik].pdf_第2页
10[Oct 29]Reinforcement learning [Peter Bodik].pdf_第3页
10[Oct 29]Reinforcement learning [Peter Bodik].pdf_第4页
10[Oct 29]Reinforcement learning [Peter Bodik].pdf_第5页
资源描述:

《10[Oct 29]Reinforcement learning [Peter Bodik].pdf》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库

1、ReinforcementLearningPeterBodíkPreviousLectures•Supervisedlearning–classification,regression•Unsupervisedlearning–clustering,dimensionalityreduction•Reinforcementlearning–generalizationofsupervisedlearning–learnfrominteractionw/environmenttoachieveagoalenvironmentrewardactionnewstateage

2、ntToday•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Differencelearning•automaticresourceallocationforin-memorydatabase•miscellaneous–staterepresentation–functionapproximation,rewardsRobotinaroomactions:UP,D

3、OWN,LEFT,RIGHT+1UP-180%moveUP10%moveLEFT10%moveRIGHTSTART•reward+1at[4,3],-1at[4,2]•reward-0.04foreachstep•what’sthestrategytoachievemaxreward?•whatiftheactionsweredeterministic?Otherexamples•pole-balancing•walkingrobot(applet)•TD-Gammon[GerryTesauro]•helicopter[AndrewNg]•noteacherwhowo

4、uldsay“good”or“bad”–isreward“10”goodorbad?–rewardscouldbedelayed•exploretheenvironmentandlearnfromtheexperience–notjustblindsearch,trytobesmartaboutitOutline•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Diff

5、erencelearning•automaticresourceallocationforin-memorydatabase•miscellaneous–staterepresentation–functionapproximation,rewardsRobotinaroomactions:UP,DOWN,LEFT,RIGHT+1UP-180%moveUP10%moveLEFT10%moveRIGHTSTARTreward+1at[4,3],-1at[4,2]reward-0.04foreachstep•states•actions•rewards•whatisthe

6、solution?Isthisasolution?+1-1•onlyifactionsdeterministic–notinthiscase(actionsarestochastic)•solution/policy–mappingfromeachstatetoanactionOptimalpolicy+1-1Rewardforeachstep-2+1-1Rewardforeachstep:-0.1+1-1Rewardforeachstep:-0.04+1-1Rewardforeachstep:-0.01+1-1Rewardforeachstep:+0.01+1-1M

7、arkovDecisionProcess(MDP)•setofstatesS,setofactionsA,initialstateS0•transitionmodelP(s’

8、s,a)environment–P([1,2]

9、[1,1],up)=0.8–Markovassumptionrewardactionnewstateagent•rewardfunctionr(s)–r([4,3])=+1•goal:maximizecumulativerewardinthelongrun•policy:mappingfromStoA–(s)or(s,a)•r

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。