资源描述:
《CUDA 矩阵乘法优化.pdf》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库。
1、GPUAssignment5KSearcKhthissite70IntroductionMatrixMulExamplePreparationLearningMaterialsExamplesMatrixmultiplicationisafundamentalbuildingblockforscientificcomputation.WhilesolvinglinearequationsandMatrixMulExamplefindingeigenvalues,theworkloadisdominatedbymatrixmultiplication.Assci
2、entificcomputingisanimportantMemoryaccessgroupingexampleapplicationdomaininGPUcomputing,optimizingmatrixmultiplicationonGPUisthekeytoachievehighperformanceAssignmentinthisdomain.ContrastEnhancementKnownIssuesSitemapContents1ThematrixMulproblem2SerialImplementationonCPU3NaiveImplemen
3、tationonGPU4IncreaseComputationtoMemoryRatiobyTiling5MemoryCoalescing6Avoidingmemorybankconflict7Multiply/AddBalancing8Loopunrolling9PrefetchingThematrixMulproblemGivenanMxKmatrixAandaKxNmatrixB,multiplyAwithBandstoretheresultintoaMxNmatrixC.ThematrixMulexampleonthispagewillshowseve
4、raltechniquestooptimizematrixmultiplicationonGPU.Mostofthemaregeneric,whichcanbeappliedtootherapplications.Thesetechniquesare:1.Tiling2.Memorycoalescing3.Avoidingmemorybankconflicts4.Increasefloatingportionbyouterproduct.5.Loopunrolling6.PrefetchingTheperformanceoftheseoptimizationt
5、echniquesareshowinthefiguresbelow.WewillstartwithasimpleserialcoderunningonCPU,andthengothroughtheseoptimizationsstepbystep.convertedbyWeb2PDFConvert.comThesourcecodeoftheseexamplesisavailableintheattachmentofthispage(clicktodownload).UnzipthepackagetoC/srcpathtocompile.SerialImplem
6、entationonCPUvoidmain(){defineA,B,Cfori=0toMdoforj=0toNdo/*computeelementC(i,j)*/fork=0toKdoC(i,j)<=C(i,j)+A(i,k)*B(k,j)endendend}Tosimplifytheexplanation,squarematricesareusedinthisfigure.ThefigureshowsthememoryfootprinttocomputeanelementC(3,11).Thiscanbeviewedastheinnerproductofon
7、erowofAandonecolumnofB.NaiveImplementationonGPU/*CodesrunningonCPU*/voidmain(){defineA_cpu,B_cpu,C_cpuintheCPUmemorydefineA_gpu,B_gpu,C_gpuintheGPUmemorymemcopyA_cputoA_gpumemcopyB_cputoB_gpudim3dimBlock(16,16)dim3dimGrid(N/dimBlock.x,M/dimBlock.y)matrixMul<<>>(A_g
8、pu,B_gpu,C_gpu,K)me