资源描述:
《an introduction to the berkeley data analytics stack》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、ReynoldXinParallelProgrammingWithApacheSparkWhatisSpark?EfficiencyGeneralexecutiongraphsIn-memorystorageUsabilityRichAPIsinJava,Scala,PythonInteractiveshellUpto10×fasterondisk,100×inmemory2-10×lesscodeFastandExpressiveClusterComputingSystemCompatiblewithApacheHadoo
2、pProjectHistorySparkstartedin2009,opensourced2010InuseatIntel,Yahoo!,Adobe,AlibabaTaobao,Conviva,Ooyala,BizoandothersEnteredApacheIncubatorinJuneOpenSourceCommunity1300+meetupmembers90+codecontributors20companiescontributingThisTalkIntroductiontoSparkTourofSparkope
3、rations(inPython)JobexecutionStandaloneappsKeyIdeaWriteprogramsintermsoftransformationsondistributeddatasetsConcept:resilientdistributeddatasets(RDDs)CollectionsofobjectsspreadacrossaclusterBuiltthroughparalleltransformations(map,filter,etc)Automaticallyrebuiltonfa
4、ilureControllablepersistence(e.g.cachinginRAM)OperationsTransformations(e.g.map,filter,groupBy)LazyoperationstobuildRDDsfromotherRDDsActions(e.g.count,collect,save)ReturnaresultorwriteittostorageExample:LogMiningLoaderrormessagesfromalogintomemory,theninteractively
5、searchforvariouspatternslines=spark.textFile(“hdfs://...”)errors=lines.filter(lambdas:s.startswith(“ERROR”))messages=errors.map(lambdas:s.split(“t”)[2])messages.cache()Block1Block2Block3WorkerWorkerWorkerDrivermessages.filter(lambdas:“foo”ins).count()messages.filt
6、er(lambdas:“bar”ins).count()...tasksresultsCache1Cache2Cache3BaseRDDTransformedRDDActionResult:full-textsearchofWikipediain0.5sec(vs20sforon-diskdata)Result:scaledto1TBdatain5sec(vs180secforon-diskdata)FaultRecoveryRDDstracklineageinformationthatcanbeusedtoefficie
7、ntlyrecomputelostdataEx:msgs=textFile.filter(lambdas:s.startsWith(“ERROR”)).map(lambdas:s.split(“t”)[2])HDFSFileFilteredRDDMappedRDDfilter(func=_.contains(...))map(func=_.split(...))BehaviorwithLessRAMSparkinScalaandJava//Scala:vallines=sc.textFile(...)lines.filt
8、er(x=>x.contains(“ERROR”)).count()//Java:JavaRDDlines=sc.textFile(...);lines.filter(newFunction(){Booleancall(Strings){