正文描述:《Issues in Mining Imbalanced Data Sets - A Review Paper》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、IssuesinMiningImbalancedDataSets-AReviewPaperSofiaVisaAncaRalescuECECSDepartmentECECSDepartmentML0030ML0030UniversityofCincinnatiUniversityofCincinnatiCincinnati,OH45221,USACincinnati,OH45221,USAsvisa@ececs.uc.eduAnca.Ralescu@uc.eduAbstractExample2(TrainingandtestingdistributionarerarelyThis
2、papertracessomeoftherecentprogressinthefieldofthesame)Classdistributionisanimportantissueforlearn-learningofimbalanceddata.Itreviewsapproachesadopteding,ingeneral.Thetrainingdatamightbeimbalancedbutforthisproblemanditidentifieschallengesandpointsoutthetestingmightnotandtheotherwayaround.Howev
3、er,futuredirectionsinthisrelativelynewfield.experimentalstudiesshowthatabalancedclassdistributionisnotthebestforlearning(Weiss&Provost2003),(Visa&IntroductionRalescu2005)andtheopenquestionforfurtherresearchis:WhatisthebestclassdistributionforlearningagivenLearningwithimbalancedclassdistribut
4、ionsaddressesthetask?casewhen,foratwo-classclassificationproblem,thetrain-ingdataforoneclass(majority)greatlyoutnumberstheExample3(Inapplications,errorcostaredifferent)Inotherclass(minority).Recently,machinelearningcommu-applicationstheerrorcostaredifferent:consideracancernityacknowledgedtha
5、tthecurrentlearningmethods(e.g.versusnon-cancer,fraudversusvalidaction,systemOKver-C4.5,NN)performpoorlyinapplicationsdealingwithim-sussystemfailuresituation.Iftheerrorcostsandclassdis-balanceddatasets(IDS).Ontheotherhand,itwasobservedtributionareknownthecorrectthresholdcanbecomputedthatinm
6、anyrealworlddomainsavailabledatasetsareim-easily.Butthedifficultyisthaterrorcostsarehardtoassessbalanced.Intheliterature,theIDSproblemisalsoknownevenbythehumanexpertsinthefield,andtherefore,theseasdealingwithrareclasses,orwithskeweddata.costsarerarelyknown.Further,itisimportanttomentionThepoo
7、rperformanceoftheclassifiersproducedbythethat,whentheerrorscomingfromdifferentclasseshavedif-standardmachinelearningalgorithmsonIDSismainlydueferentbutunknowncost,classifiershaveproblemsevenfortothefollowingfactors:thebalanceddata.Inordertogiveacomprehensi
显示全部收起