资源描述:
《数据挖掘数据预处理 Data Preprocessing.ppt》由会员上传分享,免费在线阅读,更多相关内容在PPT专区-天天文库。
1、DataMining:ConceptsandTechniques—Chapter2—2021/8/141DataMining:ConceptsandTechniquesChapter2:DataPreprocessingWhypreprocessthedata?DescriptivedatasummarizationDatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygener
2、ationSummary2021/8/142DataMining:ConceptsandTechniquesWhyDataPreprocessing?Dataintherealworldisdirtyincomplete:lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatae.g.,occupation=“”noisy:containingerrorsoroutlierse.g.,S
3、alary=“-10”2021/8/143DataMining:ConceptsandTechniquesWhyDataPreprocessing?inconsistent:containingdiscrepanciesincodesornamese.g.,Age=“42”Birthday=“03/07/1997”e.g.,Wasrating“1,2,3”,nowrating“A,B,C”e.g.,discrepancybetweenduplicaterecords2021/8/144DataMinin
4、g:ConceptsandTechniquesWhyIsDataDirty?Incompletedatamaycomefrom“Notapplicable”datavaluewhencollectedDifferentconsiderationsbetweenthetimewhenthedatawascollectedandwhenitisanalyzed.Human/hardware/softwareproblemsNoisydata(incorrectvalues)maycomefromFaulty
5、datacollectioninstrumentsHumanorcomputererroratdataentryErrorsindatatransmissionInconsistentdatamaycomefromDifferentdatasourcesFunctionaldependencyviolation(e.g.,modifysomelinkeddata)Duplicaterecordsalsoneeddatacleaning2021/8/145DataMining:ConceptsandTec
6、hniquesWhyIsDataPreprocessingImportant?Noqualitydata,noqualityminingresults!Qualitydecisionsmustbebasedonqualitydatae.g.,duplicateormissingdatamaycauseincorrectorevenmisleadingstatistics.DatawarehouseneedsconsistentintegrationofqualitydataDataextraction,
7、cleaning,andtransformationcomprisesthemajorityoftheworkofbuildingadatawarehouse2021/8/146DataMining:ConceptsandTechniquesMajorTasksinDataPreprocessingDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesDatain
8、tegrationIntegrationofmultipledatabases,datacubes,orfilesDatatransformationNormalizationandaggregationDatareductionObtainsreducedrepresentationinvolumebutproducesthesameorsimilaranalyticalresultsDatadiscretizationPartofdat