资源描述:
《浅探关节镜下盘状半月板损伤的治疗》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、CS276BTextInformationRetrieval,Mining,andExploitationLecture523January2003RecapToday’stopicsFeatureselectionfortextclassificationMeasuringclassificationperformanceNearestneighborcategorizationFeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,
2、000,000uniquewords–andmoreMakeusingaparticularclassifierfeasibleSomeclassifierscan’tdealwith100,000soffeat’sReducetrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeatures(e.g.,logisticregression)ImprovegeneralizationEliminatenoisefeaturesAvoi
3、doverfittingRecap:FeatureReductionStandardwaysofreducingfeaturespacefortextStemmingLaugh,laughs,laughing,laughed->laughStopwordremovalE.g.,eliminateallprepositionsConversiontolowercaseTokenizationBreakonallspecialcharacters:fire-fighter->fire,fighterFeatureSelect
4、ionYangandPedersen1997ComparisonofdifferentselectioncriteriaDF–documentfrequencyIG–informationgainMI–mutualinformationCHI–chisquareCommonstrategyComputestatisticforeachtermKeepntermswithhighestvalueofthisstatisticInformationGain(Pointwise)MutualInformationChi-Squ
5、areTermpresentTermabsentDocumentbelongstocategoryABDocumentdoesnotbelongtocategoryCDX^2=N(AD-BC)^2/((A+B)(A+C)(B+D)(C+D))UseeithermaximumoraverageX^2Valueforcompleteindependence?DocumentFrequencyNumberofdocumentsatermoccursinIssometimesusedforeliminatingbothveryf
6、requentandveryinfrequenttermsHowisdocumentfrequencymeasuredifferentfromtheother3measures?Yang&Pedersen:ExperimentsTwoclassificationmethodskNN(knearestneighbors;morelater)LinearLeastSquaresFitRegressionmethodCollectionsReuters-2217392categories16,000uniquetermsOhs
7、umed:subsetofmedline14,000categories72,000uniquetermsLtctermweightingYang&Pedersen:ExperimentsChoosefeaturesetsizePreprocesscollection,discardingnon-selectedfeatures/wordsApplytermweighting->featurevectorforeachdocumentTrainclassifierontrainingsetEvaluateclassifi
8、erontestsetDiscussionYoucaneliminate90%offeaturesforIG,DF,andCHIwithoutdecreasingperformance.Infact,performanceincreaseswithfewerfeaturesforIG,DF,andCHI.Mutual