欢迎来到天天文库
浏览记录
ID:37392695
大小:420.10 KB
页数:54页
时间:2019-05-12
《清华云计算课件分布式集群》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、DistributedComputingSeminarLecture4:Clustering–anOverviewandSampleMapReduceImplementationChristopheBisciglia,AaronKimball,&SierraMichels-SlettvetGoogle,Inc.Summer2007Exceptasotherwisenoted,thecontentofthispresentationislicensedundertheCreativeCommonsAttribution2.5License.OutlineClusteringIntu
2、itionClusteringAlgorithmsTheDistanceMeasureHierarchicalvs.PartitionalK-MeansClusteringComplexityCanopyClusteringMapReducingalargedatasetwithK-MeansandCanopyClusteringClusteringWhatisclustering?GoogleNewsTheydidn’tpickall3,400,217relatedarticlesbyhand…OrAmazon.comOrNetflix…Otherlessglamorousth
3、ings...HospitalRecordsScientificImagingRelatedgenes,relatedstars,relatedsequencesMarketResearchSegmentingmarkets,productpositioningSocialNetworkAnalysisDataminingImagesegmentation…TheDistanceMeasureHowthesimilarityoftwoelementsinasetisdetermined,e.g.EuclideanDistanceManhattanDistanceInnerProd
4、uctSpaceMaximumNormOranymetricyoudefineoverthespace…HierarchicalClusteringvs.PartitionalClusteringTypesofAlgorithmsHierarchicalClusteringBuildsorbreaksupahierarchyofclusters.PartitionalClusteringPartitionssetintoallclusterssimultaneously.PartitionalClusteringPartitionssetintoallclusterssimult
5、aneously.K-MeansClusteringSupersimplePartitionalClusteringChoosethenumberofclusters,kChoosekpointstobeclustercentersThen…K-MeansClusteringiterate{Computedistancefromallpointstoallk-centersAssigneachpointtothenearestk-centerComputetheaverageofallpointsassignedtoallspecifick-centersReplacethek-
6、centerswiththenewaverages}But!Thecomplexityisprettyhigh:k*n*O(distancemetric)*num(iterations)Moreover,itcanbenecessarytosendtonsofdatatoeachMapperNode.Dependingonyourbandwidthandmemoryavailable,thiscouldbeimpossible.FurthermoreTherearethreebigwaysadatasetcanbelarge:Therearealargenumberofeleme
7、ntsintheset.Eachelementcanhavemanyfeatures.TherecanbemanyclusterstodiscoverConclusion–Clusteringcanbehuge,evenwhenyoudistributeit.CanopyClusteringPreliminarysteptohelpparallelizecomputation.ClustersdataintooverlappingCanopiesusingsupercheapdi
此文档下载收益归作者所有