资源描述:
《生物信息导论英文论文practical suffix tree construction 生物信息导论英文论文-practical suffix tree construction》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、PracticalSuffixTreeConstructionSandeepTataRichardA.HankinsJigneshM.PatelUniversityofMichigan24AbstractLargestringdatasetsarecommoninanumberofemergingtextandbiologicaldatabaseapplications.Commonqueriesoversuchdatasetsincludebothexactandapproximatestringmatches.Thesequeriescanbeevaluatedveryefficien
2、tlybyusingasuffixtreeindexonthestringdataset.Althoughsuffixtreescanbeconstructedquicklyinmemoryforsmallinputdatasets,constructingpersistenttreesforlargedatasetshasbeenchallenging.Inthispaper,weexploresuffixtreeconstructionalgorithmsoverawidespectrumofdatasourcesandsizes.First,weshowthatonmodernpro
3、cessors,acache-efficientalgorithmwithO(n2)complexityoutperformsthepopularO(n)Ukkonenalgorithm,evenforin-memoryconstruction.Forlargerdatasets,thediskI/Orequirementquicklybecomesthebottleneckineachalgorithm’sperformance.Toaddressthisproblem,wepresentabuffermanagementstrategyfortheO(n2)algorithm,crea
4、tinganewdisk-basedconstructionalgorithmthatscalestosizesmuchlargerthanhavebeenpreviouslydescribedintheliterature.Ourapproachfaroutperformsthebestknowndiskbasedconstructionalgorithms.1IntroductionQueryinglargestringdatasetsisbecomingincreasinglyimportantinanumberofemergingtextandlifesciencesapplica
5、tions.Lifescienceresearchersareofteninterestedinexplorativequeryingoflargebiologicalsequencedatabases,suchasgenomesandlargesetsofproteinsequences.Manyofthesebiologicaldatasetsaregrowingatexponentialrates—forexample,thesizesofthesequencedatasetsinGenBankhavebeendoublingeverysix-Permissiontocopywith
6、outfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommercialadvantage,theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheVeryLargeDataBaseEndowment.Tocopyotherwise,ortorepublish,requiresafeeand/or
7、specialpermissionfromtheEndowment.Proceedingsofthe30thVLDBConference,Toronto,Canada,2004teenmonths[31].Consequently,methodsforefficientlyqueryinglargestringdatasetsarecriticaltothe24successoftheseemergingdatabase