欢迎来到天天文库
浏览记录
ID:33757306
大小:1.54 MB
页数:93页
时间:2019-02-28
《mit-introduction to numerical methods-hd》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、18.335Fall2008PerformanceExperimentswithMatrixMultiplicationStevenG.JohnsonHardware:2.66GHzIntelCore2Duo64-bitmode,doubleprecision,gcc4.1.2optimizedBLASdgemm:ATLAS3.6.0http://math-atlas.sourceforge.net/Atrivialproblem?C=ABm×pm×nn×pthe“obvious”Ccode:fori=1tomforj=1
2、top/*C=AB,whereAismxn,Bisnxp,nandCismxp,inrow-majororder*/C=ABvoidmatmul(constdouble*A,constdouble*B,ij∑ikkjdouble*C,intm,intn,intp)k=1{inti,j,k;for(i=0;i3、}}justthreeloops,howcomplicatedcanitget?flops/timeisnotconstant!(squarematrices,m=n=p)L1cache(2.66GHzprocessor?exceeded?why<1gigaflops?)L2cacheexceeded?L1cacheexceededforsinglerow?Notall“noise”israndomAllflopsarenotcreatedequalnearlypeaktheoreticalfloprate(2flops/4、cycleviaSSE2instructions)same#operationssameabstractalgorithmfactorof10inspeedThingstoremember•Wecannotunderstandperformancewithoutunderstandingmemoryefficiency(caches).–~10timesmoreimportantthanarithmeticcount•Computersaremorecomplicatedthanyouthink.•Evenatrivial5、algorithmisnontrivialtoimplementwell.–matrixmultiplication:10linesofcode→130,000+(ATLAS)MITOpenCourseWarehttp://ocw.mit.edu18.335J/6.337JIntroductiontoNumericalMethodsFall2010ForinformationaboutcitingthesematerialsorourTermsofUse,visit:http://ocw.mit.edu/terms.ide6、alcacheCPUmainmemoryZitemscachehit:CPUneedsitemincache(fast)cachemiss:CPUneedsitemnotincache—itemloadedintocacheforfutureuse,replacingsomeotheritemoptimalreplacement:oncachemiss,loadeditemreplacesitemthatwillnotbeneededforthelongesttimeinthefuture[morerealisticsch7、eme:LRUreplacement—replaceleastrecentlyuseditem—provablywithinsmallconstantfactorofoptimal,butmuchhardertoanalyze]fullyassociative—anyiteminmemorycangoanywhereinthecache[realcacheshavelimitedassociativity,whichcauses“unlucky”memory-accesspatternstogosameplaceincac8、he…effectivelyshrinkscacheinthesecases]temporallocality—sameitemisre-usedforseveralcomputationsthatareclosetooneanotherintime⇒stillin-cache⇒efficient[th
3、}}justthreeloops,howcomplicatedcanitget?flops/timeisnotconstant!(squarematrices,m=n=p)L1cache(2.66GHzprocessor?exceeded?why<1gigaflops?)L2cacheexceeded?L1cacheexceededforsinglerow?Notall“noise”israndomAllflopsarenotcreatedequalnearlypeaktheoreticalfloprate(2flops/
4、cycleviaSSE2instructions)same#operationssameabstractalgorithmfactorof10inspeedThingstoremember•Wecannotunderstandperformancewithoutunderstandingmemoryefficiency(caches).–~10timesmoreimportantthanarithmeticcount•Computersaremorecomplicatedthanyouthink.•Evenatrivial
5、algorithmisnontrivialtoimplementwell.–matrixmultiplication:10linesofcode→130,000+(ATLAS)MITOpenCourseWarehttp://ocw.mit.edu18.335J/6.337JIntroductiontoNumericalMethodsFall2010ForinformationaboutcitingthesematerialsorourTermsofUse,visit:http://ocw.mit.edu/terms.ide
6、alcacheCPUmainmemoryZitemscachehit:CPUneedsitemincache(fast)cachemiss:CPUneedsitemnotincache—itemloadedintocacheforfutureuse,replacingsomeotheritemoptimalreplacement:oncachemiss,loadeditemreplacesitemthatwillnotbeneededforthelongesttimeinthefuture[morerealisticsch
7、eme:LRUreplacement—replaceleastrecentlyuseditem—provablywithinsmallconstantfactorofoptimal,butmuchhardertoanalyze]fullyassociative—anyiteminmemorycangoanywhereinthecache[realcacheshavelimitedassociativity,whichcauses“unlucky”memory-accesspatternstogosameplaceincac
8、he…effectivelyshrinkscacheinthesecases]temporallocality—sameitemisre-usedforseveralcomputationsthatareclosetooneanotherintime⇒stillin-cache⇒efficient[th
此文档下载收益归作者所有