资源描述:
《adaptive name matching in information integration》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、InformationIntegrationontheWebAdaptiveNameMatchinginInformationIntegrationMikhailBilenkoandRaymondMooney,UniversityofTexasatAustinWilliamCohen,PradeepRavikumar,andStephenFienberg,CarnegieMellonUniversityWhenyoucombineinformationfromheterogeneousinformationsources,youmus
2、tidentifydatarecordsthatrefertoequivalententities.However,recordsthatdescribethesameobjectmightdiffersyntactically—forexample,thesamepersoncanbereferredtoas“WilliamJeffersonClinton”and“billclinton.”Figure1presentsIdentifyingmorecomplexexamplesofduplicaterecordsthatareta
3、ntinduplicateidentification.Weexamineafeweffec-notidentical.tiveandwidelyusedmetricsformeasuringsimilarity.approximatelyVariationsinrepresentationacrossinformationsourcescanarisefromdifferencesinformatsthatEditdistanceduplicatedatabasestoredata,typographicalandopticalch
4、aracterrecog-Animportantclassofsuchmetricsareeditdis-nition(OCR)errors,andabbreviations.Variationstances.Here,thedistancebetweenstringssandtisrecordsthatrefertoareparticularlypronouncedindatathat’sautomati-thecostofthebestsequenceofeditoperationsthatcallyextractedfromWe
5、bpagesandunstructuredorconvertsstot.Forexample,considermappingthethesameentityissemistructureddocuments,makingthematchingtaskstrings=“Willlaim”tot=“William”usingtheseessentialforinformationintegrationontheWeb.editoperations:essentialforResearchershaveinvestigatedtheprob
6、lemofidentifyingduplicateobjectsunderseveralmonikers,•Copythenextletterinstothenextpositionint.informationincludingrecordlinkage,merge-purge,duplicate•Insertanewletterintthatdoesnotappearins.detection,databasehardening,identityuncertainty,•Substituteadifferentletterintf
7、orthenextletterintegration.Thecoreferenceresolution,andnamematching.Suchins.diversityreflectsresearchinseveralareas:statistics,•Deletethenextletterins;thatis,don’tcopyittot.authorscompareanddatabases,digitallibraries,naturallanguagepro-cessing,anddatamining.Thesidebarsu
8、mmarizesTable1showsonepossiblesequenceofopera-describemethodsforvarioustraditionalapproachestonamematching.tio