资源描述:
《Latent Fault Detection in Large Scale Services》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、LatentFaultDetectioninLargeScaleServicesMosheGabel,AssafSchusterRan-GiladBachrach,NikolajBjørnerDepartmentofComputerScienceMicrosoftResearchTechnion–IsraelInstituteofTechnologyMicrosoftHaifa,IsraelRedmond,WA,USAfmgabel,assafg@cs.technion.ac.ilfrang,nbjornerg@microsoft.comAbstract—Unexpect
2、edmachinefailures,withtheirresultingiscrossed,anactionistriggered.Theseactionsrangefromserviceoutagesanddataloss,posechallengestodatacenterman-notifyingthesystemoperatortoautomaticrecoveryattempts.agement.ExistingfailuredetectiontechniquesrelyondomainRule-basedfailuredetectionsuffersfroms
3、everalkeyprob-knowledge,precious(oftenunavailable)trainingdata,textuallems.Thresholdsmustbemadelowenoughthatfaultswillconsolelogs,orintrusiveservicemodifications.Wehypothesizethatmanymachinefailuresarenotaresultofnotgounnoticed.Atthesametimetheyshouldbesetabruptchangesbutratheraresultofalo
4、ngperiodofdegradedhighenoughtoavoidspuriousdetections.However,sincetheperformance.Thisisconfirmedinourexperiments,inwhichoverworkloadchangesovertime,nofixedthresholdisadequate.20%ofmachinefailureswereprecededbysuchlatentfaults.Moreover,differentservices,orevendifferentversionsoftheWepropose
5、aproactiveapproachforfailureprevention.Wesameservice,mayhavedifferentoperatingpoints.Therefore,presentanovelframeworkforstatisticallatentfaultdetectionusingonlyordinarymachinecounterscollectedasstandardmaintainingtherulesrequiresconstant,manualadjustments,practice.Wedemonstratethreedetect
6、ionmethodswithinthisoftendoneonlyaftera“postmortem”examination.framework.Derivedtestsaredomain-independentandunsuper-Othershavenoticedtheshortcomingsoftheserule-basedvised,requireneitherbackgroundinformationnortuning,andapproaches.[8],[9]proposedtrainingadetectoronhistoricscaletoverylarge
7、services.Weprovestrongguaranteesontheannotateddata.However,suchapproachesfallshortduetofalsepositiveratesofourtests.IndexTerms—faultdetection;webservices;statisticalanalysis;thedifficultyinobtainingthisdata,aswellasthesensitivitydistributedcomputing;statisticallearni