资源描述:
《mining of massive datasets》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、MiningofMassiveDatasetsAnandRajaramanKosmix,Inc.JeffreyD.UllmanStanfordUniv.Copyrightc2010,2011AnandRajaramanandJeffreyD.UllmaniiPrefaceThisbookevolvedfrommaterialdevelopedoverseveralyearsbyAnandRaja-ramanandJeffUllmanforaone-quartercourseatStanford.ThecourseCS345A,t
2、itled“WebMining,”wasdesignedasanadvancedgraduatecourse,althoughithasbecomeaccessibleandinterestingtoadvancedundergraduates.WhattheBookIsAboutAtthehighestlevelofdescription,thisbookisaboutdatamining.However,itfocusesondataminingofverylargeamountsofdata,thatis,datas
3、olargeitdoesnotfitinmainmemory.Becauseoftheemphasisonsize,manyofourexamplesareabouttheWebordataderivedfromtheWeb.Further,thebooktakesanalgorithmicpointofview:dataminingisaboutapplyingalgorithmstodata,ratherthanusingdatato“train”amachine-learningengineofsomesort.The
4、principaltopicscoveredare:1.Distributedfilesystemsandmap-reduceasatoolforcreatingparallelalgorithmsthatsucceedonverylargeamountsofdata.2.Similaritysearch,includingthekeytechniquesofminhashingandlocality-sensitivehashing.3.Data-streamprocessingandspecializedalgorith
5、msfordealingwithdatathatarrivessofastitmustbeprocessedimmediatelyorlost.4.Thetechnologyofsearchengines,includingGoogle’sPageRank,link-spamdetection,andthehubs-and-authoritiesapproach.5.Frequent-itemsetmining,includingassociationrules,market-baskets,theA-PrioriAlgo
6、rithmanditsimprovements.6.Algorithmsforclusteringverylarge,high-dimensionaldatasets.7.TwokeyproblemsforWebapplications:managingadvertisingandrec-ommendationsystems.iiiivPREFACEPrerequisitesCS345A,althoughitsnumberindicatesanadvancedgraduatecourse,hasbeenfoundacces
7、siblebyadvancedundergraduatesandbeginningmastersstudents.Inthefuture,itislikelythatthecoursewillbegivenamezzanine-levelnumber.TheprerequisitesforCS345Aare:1.Thefirstcourseindatabasesystems,coveringapplicationprogramminginSQLandotherdatabase-relatedlanguagessuchasXQ
8、uery.2.Asophomore-levelcourseindatastructures,algorithms,anddiscretemath.3.Asophomore-levelcourseinsoftwaresystems,softwareengineering,andprogramminglan