欢迎来到天天文库
浏览记录
ID:39715163
大小:473.64 KB
页数:12页
时间:2019-07-09
《Framework for Evaluating Clustering Algorithms in Duplicate Detection 》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、FrameworkforEvaluatingClusteringAlgorithmsinDuplicateDetection∗OktieHassanzadehFeiChiang†HyunChulLeeRen´eeJ.MillerUniversityofTorontoUniversityofTorontoThooraInc.UniversityofTorontooktie@cs.toronto.edufchiang@cs.toronto.educhul.lee@thoora.commiller@cs.toro
2、nto.eduABSTRACTdetectiontask.Inthispaper,wepresentathoroughexperimentalcomparisonofclusteringapproachesfromalltheseareas.ThepresenceofduplicaterecordsisamajordataqualityconcerninOurworkismotivatedbytherecentexcitingadvancementsthatlargedatabases.Todetectdu
3、plicates,entityresolutionalsoknownhavemadeapproximatejoinalgorithmshighlyscalable[3,7,15,asduplicationdetectionorrecordlinkageisusedasapartofthe41,43].Theseinnovationslendhopetotheideathatduplicatede-datacleaningprocesstoidentifyrecordsthatpotentiallyrefer
4、totectioncanbemadesufficientlyscalableandgeneralpurposetobethesamereal-worldentity.WepresenttheStringersystemthatintroducedasageneric,data-independentoperatorwithinaDBMS.providesanevaluationframeworkforunderstandingwhatbarriers1Inthispaper,wedescribetheStri
5、ngersystemthatprovidesanremaintowardsthegoaloftrulyscalableandgeneralpurposedu-evaluationframeworkforunderstandingwhatbarriersremainto-plicationdetectionalgorithms.Inthispaper,weuseStringertowardsthegoaloftrulyscalableandgeneralpurposeduplicationevaluateth
6、equalityoftheclusters(groupsofpotentialduplicates)detectionalgorithms.OurfocusinthispaperisonusingStringerobtainedfromseveralunconstrainedclusteringalgorithmsusedintounderstandwhichclusteringalgorithmscanbeusedinconcertconcertwithapproximatejointechniques.
7、Ourworkismotivatedwithscalableapproximatejoinalgorithmstoproduceduplicatede-bytherecentsignificantadvancementsthathavemadeapproximatetectionalgorithmsthatarerobustwithrespecttothethresholdusedjoinalgorithmshighlyscalable.Ourextensiveevaluationrevealsforthea
8、pproximatejoin,andvariousdatacharacteristicsincludingthatsomeclusteringalgorithmsthathaveneverbeenconsideredtheamountanddistributionofduplicates.forduplicatedetection,performextremelywellintermsofbothaccuracy
此文档下载收益归作者所有