资源描述:
《Data MiningTools for Exploring Large Datasets数据挖掘 探索大型数据集地工具.pdf》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、ICPSRSummerProgram,2012DataMiningToolsforExploringLargeDatasetsRobertStineDepartmentofStatisticsWhartonSchool,UniversityofPennsylvaniawww-stat.wharton.upenn.edu/~stineModerndataminingcombinesfamiliarandnovelstatisticalmethodstoidentifyreproduciblepatternsi
2、nwidedatasets.Widedatasetsaredistinguishedbyhavingmanycolumns(orvariables),oftenmorecolumnsthanrows.Theobjectiveinthisdomainisprediction.Ifyoucanpredictnewdataaccuratelyorbetterthanalternatives,thenyou’vemadeacontribution.Ratherthanbuildingamodelthatrelate
3、soneortwoexperimentalresultstoaresponse,datamininginvolvessearchingforpatterns.Suchsearchesmayscanthousandsoffeatures,lookingforthefewthatarepredictiveoftheresponse.Thesearchmightbeentirelyautomatedorallowexpertinsight.Onceavilethingtobeaccusedof,dataminin
4、ghasbecomerespectable,useful,andnecessary.Theselecturesintroducedataminingthroughacombinationoflecturesandexamples.You’llseeexamplesthatlookforpatternsinvotingbehavior,patientsatriskofadisease,prospectivejobcandidates,andcreditapplicationsthatrevealfraud.I
5、neachillustration,thegoalisprediction.Ratherthaninterpretapatternfoundinonesetofdata,theobjectiveistopredictnewdata.Interpretationisfun,butwe’llexerciseconsiderablerestrainttoavoidconfusingassociationwithcausation.Evenifyousticktosimplemodels,conceptsfromd
6、ataminingcanhelpdeterminewhetheryou’vemissedanimportantfeatureofyourdata.Dataminingdoesnotrequireexotichardwareorsoftware.Today’sPCwouldhavebeenasupercomputerin1999.Youcanexplorelargedatasetsquitewellwithnothingmorethanregressionandalaptop.That’sbasicallyw
7、hatwe’lldointhefirstweekofclass.Onceyougraspthefundamentals,you’llappreciatethestrengthsandweaknessesofexoticmethods.We’llstartwithregression,andthenlogisticregression,classificationandregressiontrees,andabitofneuralnetworksandclusteranalysis.Youneedtododa
8、taminingtolearndatamining.Forthisclass,we’lluseacombinationofRandJMPfromSAS.JMPhandlesverylargedatasetsandincludesanextensivecollectionofalgorithmsforbuildingandassessingregressionmodelswithdataminingtoolssuc