资源描述:
《Hadoop, Pig, and Twitter》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、Hadoop,Pig,andTwitterKevinWeil--@kevinweilAnalyticsLead,TwitterTMWednesday,March24,2010Introduction‣HadoopOverview‣WhyPig?‣EvolutionofDataProcessingatTwitter‣PigforCounting‣PigforCorrelating‣PigforResearchandDataMining‣ConclusionsandNextStepsWednesday,March24,2010MyBackground
2、‣StudiedMathematicsandPhysicsatHarvard,PhysicsatStanford‣TroposNetworks(city-widewireless):meshroutingalgorithms,GBsofdata‣Cooliris(webmedia):HadoopandPigforanalytics,TBsofdata‣Twitter:Hadoop,Pig,machinelearning,visualization,socialgraphanalysis,???ofdataWednesday,March24,201
3、0Introduction‣HadoopOverview‣WhyPig?‣EvolutionofDataProcessingatTwitter‣PigforCounting‣PigforCorrelating‣PigforResearchandDataMining‣ConclusionsandNextStepsWednesday,March24,2010DataisGettingBig‣NYSE:1TB/day‣Facebook:20+TBcompressed/day‣CERN/LHC:40TB/day(15PB/year!)‣Andgrowth
4、isaccelerating‣Needmultiplemachines,horizontalscalabilityWednesday,March24,2010Hadoop‣Distributedfilesystem(hardtostoreaPB)‣Fault-tolerant,handlesreplication,nodefailure,etc‣MapReduce-basedparallelcomputation(evenhardertoprocessaPB)‣Generickey-valuebasedcomputationinterfaceal
5、lowsforwideapplicability‣Opensource,top-levelApacheproject‣Scalable:Y!hasa4000-nodecluster‣Powerful:sortedaTBofrandomintegersin62secondsWednesday,March24,2010MapReduce?‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=
6、1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runscloseto2xfaster.Wednesday,March24,2010MapReduce?‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣R
7、educe:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runscloseto2xfaster.Wednesday,March24,2010MapReduce?‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣
8、Output:user_id,tweetcount‣With2xmachines,runscloseto2xfaster.Wednesd