nonparametric+statistics+with+applicatio...

nonparametric+statistics+with+applicatio...

ID:30480746

大小:16.28 MB

页数:445页

时间:2018-12-30

上传者:U-14522
nonparametric+statistics+with+applicatio..._第1页
nonparametric+statistics+with+applicatio..._第2页
nonparametric+statistics+with+applicatio..._第3页
nonparametric+statistics+with+applicatio..._第4页
nonparametric+statistics+with+applicatio..._第5页
资源描述:

《nonparametric+statistics+with+applicatio...》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库

NonparametricStatisticswithApplicationstoScienceandEngineeringPaulH.KvamGeorgiaInstituteofTechnologyTheH.hliltonStewartSchooloflndustrialandSystemsEngineeringAtlanta.GABraniVidakovicGeorgiaInstituteofTechnologyandEmoryUniversitySchoolofMedicineTheWallaceH.CoulterDepartmentofBiomedicalEngineeringAtlanta,GABICENTENNIALBICENTENNIALWILEY-INTERSCIENCEAJohnWiley&Sons,Inc.,Publication ThisPageIntentionallyLeftBlank NonparametricStatisticswithApplicationstoScienceandEngineering THEWILEYBICENTENNIAL-KNOWLEDGEFORGENERATIONSGachgenerationhasitsuniqueneedsandaspirations.WhenCharlesWileyfirstopenedhissmallprintingshopinlowerManhattanin1807,itwasagenerationofboundlesspotentialsearchingforanidentity.Andwewerethere,helpingtodefineanewAmericanliterarytradition.Overhalfacenturylater,inthemidstoftheSecondIndustrialRevolution,itwasagenerationfocusedonbuildingthefuture.Onceagain,wewerethere,supplyingthecriticalscientific,technical,andengineeringknowledgethathelpedframetheworld.Throughoutthe20thCentury,andintothenewmillennium,nationsbegantoreachoutbeyondtheirownbordersandanewinternationalcommunitywasborn.Wileywasthere,expandingitsoperationsaroundtheworldtoenableaglobalexchangeofideas,opinions,andknow-how.For200years,Wileyhasbeenanintegralpartofeachgeneration’sjourney,enablingtheflowofinformationandunderstandingnecessarytomeettheirneedsandfulfilltheiraspirations.Today,boldnewtechnologiesarechangingthewayweliveandlearn.Wileywillbethere,providingyouthemust-haveknowledgeyouneedtoimaginenewworlds,newpossibilities,andnewopportunities.Generationscomeandgo,butyoucanalwayscountonWileytoprovideyoutheknowledgeyouneed,whenandwhereyouneedit!4WILLIAMJ.PESCEPETERBOOTHWlLEVPRESIDENTANDCHIEFEXECUTIVEOmCERCHAIRMANOFTHEBOARD NonparametricStatisticswithApplicationstoScienceandEngineeringPaulH.KvamGeorgiaInstituteofTechnologyTheH.hliltonStewartSchooloflndustrialandSystemsEngineeringAtlanta.GABraniVidakovicGeorgiaInstituteofTechnologyandEmoryUniversitySchoolofMedicineTheWallaceH.CoulterDepartmentofBiomedicalEngineeringAtlanta,GABICENTENNIALBICENTENNIALWILEY-INTERSCIENCEAJohnWiley&Sons,Inc.,Publication Copyright02007byJohnWiley&Sons,Inc.Allrightsreserved.PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJerseyPublishedsimultaneouslyinCanada.Nopartofthispublicationmaybereproduced.storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptaspermittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteitherthepriorwrittenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfeetotheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)750-8400,fax(978)750-4470,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshouldbeaddressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,(201)748-6011,fax(201)748-6008,oronlineathttp://www.wiley.comlgo/permission.LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsinpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthepublishernorauthorshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental,consequential,orotherdamages.Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontactourCustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidetheUnitedStatesat(317)572-3993orfax(317)572-4002.Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynotbeavailableinelectronicformat.ForinformationaboutWileyproducts,visitourwebsiteatwww.wiley.com.WileyBicentennialLogo:RichardJ.Pacific0LibraryofCongressCataloging-in-PublicationDataisavailable.ISBN978-0-470-08147-1PrintedintheUnitedStatesofAmericaI0987654321 ContentsPrefacexi1Introduction1.1EfficiencyofNonparametricMethods1.2OverconfidenceBias1.3ComputingwithMATLAB1.4ExercisesReferences2ProbabilityBasics92.1HelpfulFunctions92.2Events,ProbabilitiesandRandomVariables112.3NumericalCharacteristicsofRandomVariables122.4DiscreteDistributions142.5ContinuousDistributions172.6MixtureDistributions232.7ExponentialFamilyofDistributions252.8StochasticInequalities262.9ConvergenceofRandomVariables28V viCONTENTS2.10Exercises31References323StatisticsBasics333.1Estimation333.2EmpiricalDistributionFunction343.3StatisticalTests363.4Exercises45References464BayesianStatistics474.1TheBayesianParadigm474.2IngredientsforBayesianInference484.3BayesianComputationandUseofWinBUGS614.4Exercises63References675OrderStatistics695.1JointDistributionsofOrderStatistics705.2SampleQuantiles725.3ToleranceIntervals735.4AsymptoticDistributionsofOrderStatistics755.5ExtremeValueTheory765.6RankedSetSampling765.7Exercises77References806GoodnessofFit816.1Kolmogorov-SmirnovTestStatistic826.2SmirnovTesttoCompareTwoDistributions866.3SpecializedTests896.4ProbabilityPlotting976.5RunsTest1006.6AIetaAnalysis1066.7Exercises109 CONTENTSviiReferences1137RankTests1157.1PropertiesofRanks1177.2SignTest1187.3SpearmanCoefficientofRankCorrelation1227.4WilcoxonSignedRankTest1267.5Wilcoxon(Two-Sample)SumRankTest1297.6Mann-WhitneyUTest1317.7TestofVariances1337.8Exercises135References1398DesignedExperiments1418.1Kruskal-WallisTest1418.2FriedmanTest1458.3VarianceTestforSeveralPopulations1488.4Exercises149References1529CategoricalData1539.1Chi-squareandGoodness-of-Fit1559.2ContingencyTables1599.3FisherExactTest1639.4MCNemarTest1649.5Cochran’sTest1679.6Mantel-HaenszelTest1679.7CLTforMultinomialProbabilities1719.8Simpson’sParadox1729.9Exercises173References18010EstimatingDistributionFunctions18310.1Introduction18310.2NonparametricMaximumLikelihood184 viiiCONTENTS10.3Kaplan-MeierEstimator18510.4ConfidenceIntervalforF19210.5Plug-inPrinciple19310.6Semi-ParametricInference19510.7EmpiricalProcesses19710.8EmpiricalLikelihood19810.9Exercises201References20311DensityEstimation20511.1Histogram20611.2KernelandBandwidth20711.3Exercises213References21512BeyondLinearRegression21712.1LeastSquaresRegression21812.2RankRegression21912.3RobustRegression22112.4IsotonicRegression22712.5GeneralizedLinearModels23012.6Exercises237References24013CurveFittingTechniques24113.1KernelEstimators24313.2NearestNeighborMethods24713.3VarianceEstimation24913.4Splines25113.5Summary25713.6Exercises258References26014Wavelets26314.1IntroductiontoWavelets263 CONTENTS;x14.2HowDotheWaveletsWork?26614.3WaveletShrinkage27314.4Exercises281References28315Bootstrap28515.1BootstrapSampling28515.2NonparametricBootstrap28715.3BiasCorrectionforNonparametricIntervals29215.4TheJackknife29515.5BayesianBootstrap29615.6PermutationTests29815.7MoreontheBootstrap30215.8Exercises302References30416EMAlgorithm30716.1Fisher’sExample30916.2Mixtures31116.3EMandOrderStatistics31516.4MAPviaEM31716.5InfectionPatternEstimation31816.6Exercises319References32117StatisticalLearning32317.1DiscriminantAnalysis32417.2LinearClassificationModels32617.3NearestNeighborClassification32917.4NeuralNetworks33317.5BinaryClassificationTrees33817.6Exercises346References34618NonparametricBayes349 xCONTENTS18.1DirichletProcesses35018.2BayesianCategoricalModels35718.3InfinitelyDimensionalProblems36018.4Exercises364References366AMATLAB369A.lUsingMATLAB369A.2MatrixOperations372A.3CreatingFunctionsinMATLAB374A.4ImportingandExportingData375A.5DataVisualization380A.6Statistics386BWinBUGS397B.lUsingWinBUGS398B.2Built-inFunctions401hIATLABIndex405AuthorIndex409SubjectIndex413 PrefaceDangerliesnotinwhatwedon'tknow-.butinwhatwethinkweknowthatjustain'tso.MarkTwain(1835-1910)AsPrefacesusuallystart.theauthor(s)explainwhytheywrotethebookinthefirstplace~andwewillfollowthistradition.BothofustaughtthegraduatecourseonnonparametricstatisticsattheSchoolofIndustrialandSystemsEngineeringatGeorgiaTech(ISyE6404)severaltimes.Theaudi-encewasalwaysversatile:PhDstudentsinEngineeringStatistics.ElectricalEngineering,Management,Logistics,Physics.tolistafew.Whilecomprisinganonhomogeneousgroup.allofthestudentshadsolidmathematical,pro-grammingandstatisticaltrainingneededtobenefitfromthecourse.Givensuchanonstandardclass.thetextselectionwasallbuteasy.Thereareplentyofexcellentmonographs/textsdealingwithnonparamet-ricstatistics,suchastheencyclopedicbookbyHollanderandWolfe.Non-parametracStatzstzcalMethods.ortheexcellentevergreenbookbyConover.PractacalNonparametrzcStatastacs,forexample.Weusedasatextthe3rdeditionofConover'sbook,whichismainlyconcernedwithwhatmostofusthinkofastraditionalnonparametricstatistics:proportions.ranks.categor-icaldata.goodnessoffit.andsoon,withtheunderstandingthatthetextwouldbesupplementedbytheinstructor'shandouts.Bothofusendedupsupplyinganincreasingnumberofhandoutseveryyear,forunitssuchasden-sityandfunctionestimation.wavelets.Bayesianapproachestononparametricproblems.theEMalgorithm.splines,machinelearning,andotherarguablyXI xi/PREFACEmodernnonparametrictopics.Aboutayearago.wedecidedtomergethehandoutsandfillthegaps.Thereareseveralnoveltiesthisbookprovides.Wedecidedtointertwineinformalcommentsthatmightbeamusing.buttriedtohaveagoodbalance.OnecouldeasilygetcarriedawayandproduceaprefacesimilartothatofcelebratedBarlowandProschan's,StatastacalTheoryofRelaabalztyandLzfeTestang:ProbabzlatyModels,whoacknowledgegreedyspousesandobnoxiouschildrenasanimpetustotheirbookwriting.Inthisspirit.wefeaturedpho-tosandsometimesbiographicdetailsofstatisticianswhomadefundamentalcontributionstothefieldofnonparametricstatistics,suchasKarlPearson.Nathanhfantel,BradEfron,andBaronVonMunchausen.Computing.Anotherspecificityisthechoiceofcomputingsupport.ThebookisintegratedwithMATLAB@andformanyprocedurescoveredinthisbook.hfATLAB'sm-filesortheircorepartsarefeatured.Thechoiceofsoftwarewasnatural:engineers.scientists,andincreasinglystatisticiansarecommunicatinginthe"AlATLABlanguage."Thislanguageis,forexample,taughtatGeorgiaTechinacorecomputingcoursethateveryfreshmanengi-neeringstudenttakes.andalmosteverybodyaroundus"speaksMATLAB."Thebook'swebsite:http://www2.isye.gatech.edu/NPbookcontainsmostofthem-filesandprogrammingsupplementseasytotraceanddownload.ForBayesiancalculationweusedN-inBUGS,afreesoftwarefromCambridge'sBiostatisticsResearchUnit.BothMATLABandWinBUGSarebrieflycoveredintwoappendicesforreaderslessfamiliarwiththem.OutlineofChapters.Foratypicalgraduatestudenttocoverthefullbreadthofthistextbook,twosemesterswouldberequired.Foraone-semestercourse.theinstructorshouldnecessarilycoverChapters1-3,5-9tostart.Dependingonthescopeoftheclass,thelastpartofthecoursecanincludedifferentchapterselections.Chapters2-4containimportantbackgroundmaterialthestudentneedstounderstandinordertoeffectivelylearnandapplythemethodstaughtinanonparametricanalysiscourse.Becausetheranksofobservationshavespecialimportanceinanonparametricanalysis,Chapter5presentsbasicresultsfororderstatisticsandincludesstatisticalmethodstocreatetoleranceintervals.TraditionaltopicsinestimationandtestingarepresentedinChapters7-10andshouldreceiveemphasiseventostudentswhoaremostcuriousaboutadvancedtopicssuchasdensityestimation(Chapter11).curve-fitting(Chap-ter13)aridwavelets(Chapter14).Thesetopicsincludeacoreofrankteststhatareanalogoustocommonparametricprocedures(e.g..t-tests,analysisofvariance).BasicmethodsofcategoricaldataanalysisarecontainedinChapter9.Al- PREFACExi;;thoughmoststudentsinthebiologicalsciencesareexposedtoawidevarietyofstatisticalmethodsforcategoricaldata.engineeringstudentsandotherstu-dentsinthephysicalsciencestypicallyreceivelessschoolinginthisquintessen-tialbranchofstatistics.Topicsincludemethodsbasedontableddata.chi-squaretestsandtheintroductionofgenerallinearmodels.Alsoincludedinthefirstpartofthebookisthetopicof"goodnessoffit"(Chapter6),whichreferstotestingdatanotintermsofsomeunknownparameters,buttheun-knowndistributionthatgeneratedit.Inaway.goodnessoffitrepresentsaninterfacebetweendistribution-freemethodsandtraditionalparametricmeth-odsofinference,andbothanalyticalandgraphicalproceduresarepresented.Chapter10presentsthenonparametricalternativetomaximumlikelihoodestimationandlikelihoodratiobasedconfidenceintervals.Theterm"regression"isfamiliarfromyourpreviouscoursethatintroducedyoutostatisticalmethods.Konparametricregressionprovidesanalternativemethodofanalysisthatrequiresfewerassumptionsoftheresponsevariable.InChapter12weusetheregressionplatformtointroduceotherimportanttopicsthatbuildonlinearregression.includingisotonic(constrained)regression,robustregressionandgeneralizedlinearmodels.InChapter13.weintroducemoregeneralcurvefittingmethods.Regressionmodelsbasedonwavelets(Chapter14)arepresentedinaseparatechapter.Inthelatterpartofthebook.emphasisisplacedonnonparametricproce-duresthatarebecomingmorerelevanttoengineeringresearchersandprac-titioners.Beyondtheconspicuousranktests,thistextincludesmanyofthenewestnonparametrictoolsavailabletoexperimentersfordataanalysis.Chapter17introducesfundamentaltopicsofstatisticallearningasabasisfordataminingandpatternrecognition.andincludesdiscriminantanalysis.nearest-neighborclassifiers,neuralnetworksandbinaryclassificationtrees.Computationaltoolsneededfornonparametricanalysisincludebootstrapre-sampling(Chapter15)andtheELIAlgorithm(Chapter16).Bootstrapmeth-ods.inparticular.havebecomeindispensableforuncertaintyanalysiswithlargedatasetsandelaboratestochasticmodels.ThetextbookalsounabashedlyincludesareviewofBayesianstatisticsandanoverviewofnonparametricBayesianestimation.IfyouarefamiliarwithBayesianmethods.youmightwonderwhatroletheyplayinnonparametricstatistics.Admittedly.theconnectionisnotobvious,butinfactnonpara-metricBayesianmethods(Chapter18)representanimportantsetoftoolsforcomplicatedproblemsinstatisticalmodelingandlearning,wheremanyofthemodelsarenonparametricinnature.Thebookisintendedbothasareferencetextandatextforagraduatecourse.Yehopethereaderwillfindthisbookuseful.Allcomments,sugges-tions.updates,andcritiqueswillbeappreciated. xivPREFACEAcknowledgments.Beforeanyoneelsewewouldliketothankourwives,LoriKvamandDragaVidakovic.andourfamilies.Reasonstheytoleratedourdisorderlyconductduringthewritingofthisbookarebeyondus,butwelovethemforit.WeareespeciallygratefultoBinShi,whosupportedouruseofMATLABandwrotehelpfulcodingandtextfortheAppendixA.WearegratefultoMathWorksStatisticsteam.especiallytoTomLanewhosuggestednumerousimprovementsandupdatesinthatappendix.Severalindividualshavehelpedtoimproveontheprimitivedraftsofthisbook.includingSarochBoonsiripant,LuluKang.HeeYoungKim.JongphilKim,SeoungBumKim,KichunLee,andAndrewSmith.Finally,wethankWiley'steam.MelissaYanuzzi,JacquelinePalmieriandSteveQuigley,fortheirkindassistance.PAULH.KVAMSchoolofIndustrialandSystemEngineeringGeorgiaInstituteofTechnologyBRANVIDAKOVICSchoolofBiomedicalEngineeringGeorgiaInstituteofTechnology IntroductionForeverycomplexquestion.thereisasimpleanswer....anditiswrong.H.L.XlenckenJacobWolfowitz(Figure].la)firstcoinedthetermnonparametrzc,saying-Weshallrefertothissituation[whereadastrzbutzonascompletelydetermznedbytheknowledgeofftsfinzteparameterset]astheparametriccase.anddenotetheoppositecase.wherethefunctionalformsofthedistributionsareunknown.asthenon-parametriccase”(Wolfowitz,1942).Fromthatpointon.nonpara-metricstatisticswasdefinedbywhatitisnot:traditionalstatisticsbasedonknowndistributionswithunknownparameters.Randles.Hettmansperger.andCasella(2004)extendedthisnotionbystating“nonparametricstatisticscanandshouldbebroadlydefinedtoincludeallmethodologythatdoesnotuseamodelbasedonasingleparametricfamily.“Traditionalstatisticalmethodsarebasedonparametricassumptions:thatis,thatthedatacanbeassumedtobegeneratedbysomewell-knownfamilyofdistributions,suchasnormal.exponential,Poisson.andsoon.Eachofthesedistributionshasoneormoreparameters(e.g..thenormaldistributionhaspand02).atleastoneofwhichispresumedunknownandmustbeinferred.Theemphasisonthenormaldistributioninlinearmodeltheoryisoftenjus-tifiedbythecentrallimittheorem.whichguaranteesapproxzmatenormalztyofsamplemeansprovidedthesamplesizesarelargeenough.Otherdistribu-tionsalsoplayanimportantroleinscienceandengineering.Physicalfailuremechanismsoftencharacterizethelifetimedistributionofindustrialcompo-1 fig.1.1(a)JacobWolfowitz(1910-1981)and(b)WassilyHoeffding(1914-1991),pioneersinnonparametricstatistics.nents(e.g..Weibullorlognormal),soparametricmethodsareimportantinreliabilityengineering.However,withcomplexexperimentsandmessysamplingplans.thegener-ateddatamightnotbeattributedtoanywell-knowndistribution.Analystslimitedtobasicstatisticalmethodscanbetrappedintomakingparametricassumptionsaboutthedatathatarenotapparentintheexperimentorthedata.Inthecasewheretheexperimenterisnotsureabouttheunderlyingdis-tributionofthedata.statisticaltechniquesareneededwhichcanbeappliedregardlessofthetruedistributionofthedata.Thesetechniquesarecallednonparametrzcmethods.ordastrzbutzon-freemethods.Thetermsnonparametricanddistribution-freearenotsynonymous...Popularusage.however,hasequatedtheterms...Roughlyspeaking.anonparametrictestisonewhichmakesnohypothesisaboutthevalueofaparameterinastatisticaldensityfunction,whereasadistribution-freetestisonewhichmakesnoassumptionsaboutthepreciseformofthesampledpopulation.J1’.Bradley(1968)Itcanbeconfusingtounderstandwhatisimpliedbytheword“nonpara-metric“.Whatistermedmodernnonparumetrzcsincludesstatisticalmodelsthatarequiterefined,exceptthedistributionforerrorisleftunspecified.Wasserman‘srecentbookAllThangsNonparametrac(Ivasserman,2005)em-phasizesonlymoderntopicsinnonparametricstatistics.suchascurvefitting.densityestimation.andwavelets.Conover’sPractzculNonparumetrzcStatas-tzcs(Conover.1999).ontheotherhand.isaclassicnonparametricstextbook.butmostlylimitedtotraditionalbinomialandranktests,contingencytables.andtestsforgoodnessoffit.Topicsthatarenotreallyunderthedistribution-freeumbrella.suchasrobustanalysis.Bayesiananalysis.andstatisticallearn-ingalsohaveimportantconnectionstononparametricstatistics.andareall EFFICIENCYOFNONPARAMETRICMETHODS3featuredinthisbook.PerhapsthistextcouldhavebeentitledABitLessofParametricStatisticswithApplicationsinScienceandEngineering.butitsurelywouldhavesoldfewercopies.Ontheotherhand,ifsalesweretheprimaryobjective,wewouldhavetitledthisNonparametricStatisticsforDummiesormaybeNonparametricStatisticswithPicturesofNakedPeople.1.1EFFICIENCYOFNONPARAMETRICMETHODSItwouldbeamistaketothinkthatnonparametricproceduresaresimplerthantheirparametriccounterparts.Onthecontrary,aprimarycriticismofusingparametricmethodsinstatisticalanalysisisthattheyoversimplifythepopulationorprocessweareobserving.Indeed.parametricfamiliesarenotmoreusefulbecausetheyareperfectlyappropriate,ratherbecausetheyareperfectlyconvenient.Nonparametricmethodsareinherentlylesspowerfulthanparametricmeth-ods.Thismustbetruebecausetheparametricmethodsareassumingmoreinformationtoconstructinferencesaboutthedata.Inthesecasestheesti-matorsareinefficient.wheretheefficienciesoftwoestimatorsareassessedbycomparingtheirvariancesforthesamesamplesize.Thisinefficiencyofonemethodrelativetoanotherismeasuredinpowerinhypothesistesting,forexample.However.evenwhentheparametricassumptionsholdperfectlytrue.wewillseethatnonparametricmethodsareonlyslightlylesspowerfulthanthemorepresumptuousstatisticalmethods.Furthermore,iftheparametricas-sumptionsaboutthedatafailtohold,onlythenonparametricmethodisvalid.At-testbetweenthemeant3oftwonormalpopulationscanbedanger-ouslymisleadingiftheunderlyingdataarenotactuallynormallydistributed.SomeexamplesoftherelativeefficiencyofnonparametrictestsarelistedinTable1.1,whereasymptoticrelativeefficiency(A.R.E.)isusedtocompareparametricprocedures(2ndcolumn)withtheirnonparametriccounterparts(3rdcolumn).Asymptoticrelativeefficiencydescribestherelativeefficiencyoftwoestimatorsofaparameterasthesamplesizeapproachesinfinity.TheA.R.E.islistedforthenormaldistribution.whereparametricassumptionsarejustified,andthedouble-exponentialdistribution.Forexample.iftheun-derlyingdataarenormallydistributed.thet-testrequires955observationsinordertohavethesamepoweroftheWilcoxonsigned-ranktestbasedon1000observations.Parametricassumptionsallowustoextrapolateawayfromthedata.Forexample.itishardlyuncommonforanexperimentertomakeinferencesaboutapopulation’sextremeupperpercentile(say9gthpercentile)withasamplesosmallthatnoneoftheobservationswouldbeexpectedtoexceedthatpercentile.Iftheassumptionsarenotjustified.thisisgrosslyunscientific.Nonparametricmethodsareseldomusedtoextrapolateoutsidetherange Table1.1Asymptoticrelativeefficiency(A.R.E.)ofsomenonparametrictests2-SampleTestt-testMann-Whitney0.9551.50I3-SampleTestone-waylayoutKruskal-Wallis0.8641.50VariancesTest~F-testConover~0.760~1.081ofobserveddata.Inatypicalnonparametricanalysis,littleornothingcanbesaidabouttheprobabilityofobtainingfuturedatabeyondthelargestsampledobservationorlessthanthesmallestone.Forthisreason,theactualmeasure-mentsofasampleitemmeanslesscomparedtoitsrankwithinthesample.Infact,nonparametricmethodsaretypicallybasedonranksofthedata.andpropertiesofthepopulationarededucedusingorderstatistics(Chapter5).ThemeasurementscalesfortypicaldataareNomznalScale:Numbersusedonlytocategorizeoutcomes(e.g.,wemightdefinearandomvariabletoequaloneintheeventacoinflipsheads,andzeroifitflipstails).OrdznalScale:Numberscanbeusedtoorderoutcomes(e.g.*theeventXisgreaterthantheeventYifX=medtumandY=small).IntervalScale:Orderbetweennumbersaswellasdistancesbetweennumbersareusedtocompareoutcomes.Onlyintervalscalemeasurementscanbeusedbyparametricmethods.Nonparametricmethodsbasedonrankscanuseordinalscalemeasurements.andsimplernonparametrictechniquescanbeusedwithnominalscalemea-surements.Thebinomialdistributionischaracterizedbycountingthenumberofinde-pendentobservationsthatareclassifiedintoaparticularcategory.Binomialdatacanbeformedfrommeasurementsbasedonanominalscaleofmeasure-ments,thusbinomialmodelsaremostencounteredmodelsinnonparametricanalysis.Forthisreason.Chapter3includesaspecialemphasisonstatisticalestimationandtestingassociatedwithbinomialsamples. OVERCONF/GENCEWAS51.2OVERCONFIDENCEBIASBeslowtobelievewhatyouworstwanttobetrueSamualPepysConfirmatzonBaasorOverconfidenceBzasdescribesourtendencytosearchfororinterpretinformationinawaythatconfirmsourpreconceptions.Busi-nessandfinancehasshowninterestinthispsychologicalphenomenon(Tver-skyandKahneman,1974)becauseithasproventohaveasignificanteffectonpersonalandcorporatefinancialdecisionswherethedecisionmakerwillactivelyseekoutandgiveextraweighttoevidencethatconfirmsahypothesistheyalreadyfavor.Atthesametime,thedecisionmakertendstoignoreevidencethatcontradictsordisconfirmstheirhypothesis.Overconfidencebiashasanaturaltendencytoeffectanexperimenter'sdataanalysisforthesamereasons.Whilethedictatesoftheexperimentandthedatasamplingshouldreducethepossibilityofthisproblem.oneoftheclearpathwaysopentosuchbiasistheinfusionofparametricassumptionsintothedataanalysis.Afterall,iftheassumptionsseemplausible,theresearcherhasmuchtogainfromtheextracertaintythatcomesfromtheassumptionsintermsofnarrowerconfidenceintervalsandmorepowerfulstatisticaltests.Nonparametricproceduresserveasabufferagainstthishumantendencyoflookingfortheevidencethatbestsupportstheresearcher'sunderlyinghypothesis.Giventhesubjectiveinterestsbehindmanycorporateresearchfindings,nonparametricmethodscanhelpalleviatedoubttotheirvalidityincaseswhentheseproceduresgivestatisticalsignificancetothecorporations'sclaims.1.3COMPUTINGWITHMATLABBecauseatypicalnonparametricanalysiscanbecomputationallyintensive.computersupportisessentialtounderstandboththeoryandapplications.Numeroussoftwareproductscanbeusedtocompleteexercisesandrunnon-parametricanalysisinthistextbook,includingSAS,R.S-Plus.MIXITAB.StatXactandJMP(tonameafew).Astudentfamiliarwithoneoftheseplatformscanincorporateitwiththelessonsprovidedhere,andwithouttoomuchextrawork.Itmustbestressed,however,thatdemonstrationsinthisbookrelyen-tirelyonasinglesoftwaretoolcalledMATLAB@(byMathworksInc.)thatisusedwidelyinengineeringandthephysicalsciences.MATLAB(shortforMATrzxLABorutory)isaflexibleprogrammingtoolthatiswidelypopularinengineeringpracticeandresearchTheprogramenvironmentfeaturesuser-friendlyfront-endandincludesmenusforeasyimplementationofprogramcommands.MATLABisavailableonUnixsystems,MicrosoftWindowsand 6lNJRODUCTlONAppleMacintosh.IfyouareunfamiliarwithMATLAB.inthefirstappendixwepresentabrieftutorialalongwithashortdescriptionofsomeMATLABproceduresthatareusedtosolveanalyticalproblemsanddemonstratenon-parametricmethodsinthisbook.Foramorecomprehensiveguide,werec-ommendthehandylittlebookMATLABPrzmer(SigmonandDavis,2002).Wehopethatmanystudentsofstatisticswillfindthisbookuseful,butitwaswrittenprimarilywiththescientistandengineerinmind.Withnothingagainststatisticians(someofourbestfriendsknowstatisticians)ourapproachemphasizestheapplicationofthemethodoveritsmathematicaltheory.Wehaveintentionallymadethetextlessheavywiththeoryandinsteadempha-sizedapplicationsandexamples.Ifyoucomeintothiscoursethinkingthehistoryofnonparametricstatisticsisdryandunexciting.youareprobablyright.atleastcomparedtothehistoryofancientRome.theBritishmonarchyormaybeevenWayneYewton'.Nonetheless,wemadeeffortstoconvinceyouotherwisebynotingtheinterestinghistoricalcontextoftheresearchandthepersonalitiesbehinditsdevelopment.Forexample,wewilllearnmoreaboutKarlPearson(1857-1936)andR.A.Fisher(1890-1962),legendaryscientistsandcompetitivearch-rivals,whobothcontributedgreatlytothefoundationofnonparametricstatisticsthroughtheirseparateresearchdirections.fig.1.2"Doubtisnotapleasantcondition.butcertaintyisabsurd"-FrancoisMarieVoltaire(1694-1778).111short.thisbookfeaturestechniquesofdataanalysisthatrelylessontheassumptionsofthedata'sgoodbehavior-theveryassumptionsthatcangetresearchersintrouble.Science'sgravitationtowarddistribution-freetechniquesisduetobothadeeperawarenessofexperimentaluncertaintyandtheavailabilityofever-increasingcomputationalabilitiestodealwiththeimpliedambiguitiesintheexperimentaloutcome.ThequotefromVoltaire'StrangelypopularLasVegasentertainer. EXERClSES7(Figure1.2)exemplifiestheattitudetowarduncertainty:asscienceprogresses.weareabletoseesometruthsmoreclearly.butatthesametime.weuncovermoreuncertaintiesandmorethingsbecomeless“blackandwhite”.1.4EXERCISES1.1.Describeapotentialdataanalysisinengineeringwhereparametricmeth-odsareappropriate.Howwouldyoudefendthisassumption?1.2.Describeanotherpotentialdataanalysisinengineeringwhereparamet-ricmethodsmaynotbeappropriate.Whatmightpreventyoufromusingparametricassumptionsinthiscase?1.3.Describethreewaysinwhichoverconfidencebiascanaffectthestatisti-calanalysisofexperimentaldata.Howcanthisproblembeovercome?REFERENCESBradley.J.V.(1968),DzstrzbutzonFreeStatzstzcalTests.EnglewoodCliffs,NJ:PrenticeHall.Conover.IVJ.(1999).PractzcalNonparametrzcStatzstzcs,IiewYork:Miley.Randles.R.H..Hettmansperger,T.P.,andCasella,G.(2004),IntroductiontotheSpecialIssue”NonparametricStatistics,“StatzstzcalSczence,19,561-562.Sigmon,K.,andDavis.T.A.(2002),MATLABPrzmer.6thEdition,hlath-Works,Inc..BocaRaton.FLCRCPress.Tversky,A.andKahneman.D(1974).“JudgmentUnderUncertainty:Heuris-ticsandBiases,”Sczence.185,1124-1131.Wasserman,L(2006).AllThzngsNonparametrzc,NewYork:SpringerVerlag.M’olfowitz,J.(1942).“AdditivePartitionFunctionsandaClassofStatisticalHypotheses,”AnnalsofStatzstzcs,13.247-279. ThisPageIntentionallyLeftBlank ProbabilityBasicsProbabilitytheoryisnothingbutcommonsensereducedtocalculation.PierreSimonLaplace(1749-1827)Inthesenexttwochapters,wereviewsomefundamentalconceptsofelemen-taryprobabilityandstatistics.Ifyauthinkyoucanusethesechapterstocatchuponallthestatisticsyouforgotsinceyoupassed"IntroductoryStatistics''inyourcollegesophomoreyear,youareacutelymistaken.Whatisofferedhereisanabbreviatedreferencelistofdefinitionsandformulasthathaveap-plicationstononparametricstatisticaltheory.Someparametricdistributions.usefulformodelsinbothparametricandnonparametricprocedures.arelistedbutthediscussionisabridged.2.1HELPFULFUNCTIONS0Permutations.Thenumberofarrangementsofndistinctobjectsisn!=n(n-1)...(2)(1).InLIATLAB:factorial(n).0Combinations.Thenumberofdistinctwaysofchoosingkitemsfromasetofnis(y)n!=k!(n-k)!'InILIATLAB:nchoosek(n,k).9 10PROBABlLlJYBASlCSr(t)=Joxzt-le-"dz,t>0iscalledthegammafunction.Iftisapositiveinteger.r(t)=(t-l)!.InMATLAB:gamma(t).0IncompleteGammaisdefinedasy(t.2)=S;&le-"dz.InMAT-LAB:gammainc(t,z).TheuppertailIncompleteGammaisdefinedasr(t,2)=Jzxzt-Ie--5dz,inMATLAB:gammainc(t,z,'upper'1.Iftisaninteger,t-1i=O0BetaFunction.B(a,b)=Jita-l(l-t)b-ldt=r(a)r(b)/r(a+b).InMATLAB:beta(a,b).0IncompleteBeta.B(z.a.b)=J:t"-'(l-t)*-ldt.05z51.InI1lAT-LAB:betainc(x,a,b)representsnormalizedIncompleteBetadefinedasIz(a.b)=B(z.a,b)/B(a,b).0FloorFunction.1.1denotesthegreatestinteger5a.InMATLAB:floor(a).0GeometricSeriesn1-p+lx1c3=,sothatforIpl<1,cfl=__1-Pj=O1-P3=00Stirling'sFormula.Toapproximatethevalueofalargefactorial,n!EJ2,e-nnn+1/z,0CommonLimitfore.Foraconstanta.lim(1+ax)""=ea.xi0Thiscanalsobeexpressedas(1+~y/n)~-+e'asn-cc EVENTS,PROBABILITIESANDRANDOMVARIABLES110Kewton'sFormula.Forapositiveintegern.(u+b)"=2(Y)ajb"-j.j=O0TaylorSeriesExpansion.Forafunctionf(x).itsTaylorseriesexpansionaboutx=aisdefinedas(x-u)2-a)+j"'(a)~+.2!wherefcm)(a)denotesrnthderivativeoffevaluatedataand,forsome7ibetweenuandx,0ConvexFunction.Afunctionhisconvexifforany05cv51.h(ax+(1-Q)Y)I~L(z)+(I-~)h(y).forallvaluesofxandy.Ifhistwicedifferentiable.thenhisconvexifh"(x)20.Also,if-hisconvex.thenhissaidtobeconcave.0BesselFunction.Jn(x)isdefinedasthesolutiontotheequationInMATLAB:bessel(n,x).2.2EVENTS,PROBABILITIESANDRANDOMVARIABLES0ThecondataonalprobabalatyofaneventAoccurringgiventhateventBoccursisP(AIB)=P(AB)/P(B),whereABrepresentstheintersectionofeventsAandB.andP(B)>0.0EventsAandBarestochasticallyzndependentifaridonlyifP(A1B)=P(B)orequivalently,P(AB)=P(A)P(B).0LawofTotalProbabalaty.LetAl,...,AkbeapartitionofthesamplespaceR,i.e.,A1uA2u...uAI,=RandA,A,=8forz#3.ForeventB.P(B)=c,P(BIA,)P(A,).0BayesFormula.ForaneventBwhereP(B)#0,andpartition 12PROBABILITYBASICS(A1.....Ak)of0,Afunctionthatassignsrealnumberstopointsinthesamplespaceofeventsiscalledarandomvarzable.’ForarandomvariableX.Fx(z)=P(X5z)representsits(cumu-lative)dzstrzbutzonfunctzon,whichisnon-decreasingwithF(-x)=0andF(x)=1.Inthisbook,itwilloftenbedenotedsimplyasCDF.ThesurvzvorfunctzonisdefinedasS(z)=1-F(z).IftheCDF’sderivativeexists.f(z)=aF(z)/dzrepresentstheproba-bzlztydensztyfunctaon,orPDF.AdzscreterandomvarzableisonewhichcantakeonacountablesetofvaluesXE{zl.x2.s3....}sothatFx(z)=C,,,P(X=t).OverthesupportX.theprobabilityP(X=2,)iscalledtheprobabilitymassfunction.orPMF.Acontznuousrandomvarzableisonewhichtakesonanyrealvalueinaninterval,soP(XEA)=s,f(z)dz,wheref(z)isthedensityfunctionofx.FortworandomvariablesXandY.theirgozntdzstrabutzonfunctzonisFx,y(z.y)=P(X5s,Y5y).Ifthevariablesarecontinuous,onecandefinejointdensityfunctionfx,y(s.y)as&Fxy(z.y).TheconditionaldensityofX.givenY=yisf(z1y)=fx,y(x,y)/fy(y).wherefy(y)isthedensityofY.TworandomvariablesXandY,withdistributionsFXandFy,areznde-pendentifthejointdistributionFx,~of(X.Y)issuchthatFXy(s%y)=Fx(z)Fy(y).ForanysequenceofrandomvariablesXI,...,X,thatareindependentwiththesame(identical)marginaldistribution,wewillde-notethisusingz.a.d.2.3NUMERICALCHARACTERISTICSOFRANDOMVARIABLESForarandomvariableXwithdistributionfunctionFx.theexpectedvalueofsomefunction@(X)isdefinedasIE(d(X))=sd(s)dFx(s).If‘WhilewritingtheirearlytextbooksinStatistics,J.DoobandWilliamFellerdebatedonwhethertousethisterm.Doobsaid,“IhadanargumentwithFeller.HeassertedthateveryonesaidrandomvariableandIassertedthateveryonesaidchancevariable.Weobviouslyhadtousethesamenameinourbooks,sowedecidedtheissuebyastochasticprocedure.Thatis.wetossedforitandhewon.” NUMERICALCHARACTERISTICSOFRANDOMVARIABLES13FXiscontinuouswithdensityf~(z)>thenE(@(X))=Q(x)fx(z)dx.IfXisdiscrete,thenE(@(X))=c,@(x)P(X=A).ThekthmomentofXisdenotedasEX‘.Thekthmomentaboutthemean,orkthcentralmomentofXisdefinedasE(X-P)~.wherep=EX.ThevaraanceofarandomvariableXisthesecondcentralmoment,VarX=E(X-p)’=EX2-(EX)’.Often,thevarianceisdenotedbyi$,orsimplyby0’whenitisclearwhichrandomvariableisinvolved.Thesquarerootofvariance,gx=dw3iscalledthestandarddevi-ationofX.With05p51.thepthquantaleofF.denotedxPisthevaluexsuchthatP(X5x)2pandP(X2J)21-p.IftheCDFFisinvertible,thenxp=F-l(p).The0.5t”quantileiscalledthemedaanofF.FortworandomvariablesXandY.thecovaraanceofXandYisde-finedasCov(X,Y)=E[(X-px)(Y-py)].wherepxandpyaretherespectiveexpectationsofXandY.FortworandomvariablesXandYwithcovariance@ov(X,Y),thecorrelataoncoeficaentisdefinedas@ov(X.Y)@orr(X,Y)=oxOYwhereOXandCTYaretherespectivestandarddeviationsofXandY.Notethat-15pL1isaconsequenceoftheCauchy-Schwartzinequal-ity(Section2.8).ThecharacterastacfunctaonofarandomvariableXisdefinedaspx(t)==Ee‘tX=1e“t”d~(z)ThemomentgeneratangfunctaonofarandomvariableXisdefinedaswhenevertheintegralexists.BydifferentiatingTtimesandlettingt--f0wehavethattl‘--mx(O)=EXT.dt’TheconditionalexpectationofarandomvariableXisgivenY=yisdefinedasE(XIY=,,/)=J’xf(z(y)d.r:. 14PROBABILITYBASICSwheref(z1y)isaconditionaldensityofXgivenY.IfthevalueofYisnotspecified,theconditionalexpectationE(XIY)isarandomvariableanditsexpectationisEX.thatis,E(E(X1Y))=EX.2.4DISCRETEDISTRIBUTIONSIronically,parametricdistributionshaveanimportantroletoplayinthede-velopmentofnonparametricmethods.Evenifweareanalyzingdatawithoutmakingassumptionsaboutthedistributionsthatgeneratethedata.theseparametricfamiliesappearnonetheless.Incountingtrials,forexample.wecangeneratewell-knowndiscretedistributions(e.g.,binomial,geometric)as-sumingonlythatthecountsareindependentandprobabilitiesremainthesamefromtrialtotrial.2.4.1BinomialDistributionAsimpleBernoullirandomvariableYisdichotomouswithP(Y=1)=pandP(Y=0)=1-pforsome05p51.ItisdenotedasYNBer(p).Supposeanexperimentconsistsofnindependenttrials(Yl,....Y,)inwhichtwooutcomesarepossible(e.g..successorfailure).withP(success)=P(Y=1)=pforeachtrial.IfX=zisdefinedasthenumberofsuccesses(outofn).thenX=Yl+Yz+.I.+Y,andthereare(z)arrangementsof5successesandn-xfailures,eachhavingthesameprobabilitypx(1-p)"-".XisabanomaalrandomvariablewithprobabilitymassfunctionThisisdenotedbyXNBzn(n,p).Fromthemomentgeneratingfunctionrnx(t)=(pet+(l-p)),.weobtainp=EX=npando2=VarX=np(1-p).Thecumulativedistributionforabinomialrandomvariableisnotsimpli-fiedbeyondthesum:i.e.,F(z)=CtI,px(i).However.intervalprobabilitiescanbecomputedinMATLABusingbinocdf(x,n,p>.whichcomputesthecumulativedistributionfunctionatvaluez.TheprobabilitymassfunctionisalsocomputedinMATLABusingbinopdf(x,n,p).A"quick-and-dirty"plotofabinomialPDFcanbeachievedthroughtheAlATLABfunctionbinoplot. DECREEDlSTRlBUTlONS152.4.2PoissonDistributionTheprobabilitymassfunctionforthePoissondistributionisThisisdenotedbyX-%’(A).Fromrn*y(t)=exp{X(et-l)},wehaveEX=XandVarX=A;themeanandthevariancecoincide.ThesumofafiniteindependentsetofPoissonvariablesisalsoPoisson.Specifically,ifX,N%’(A,),thenY=XI+...+XI,isdistributedas%’(XI+...+Xk).Furthermore,thePoissondistributionisalimitingformforabinomialmodel.i.e..RlATLABcommandsforPoissonCDF,PDF.quantile,andarandomnumberare:poisscdf,poisspdf,poissinv,andpoissrnd.2.4.3NegativeBinomialDistributionSupposewearedealingwithi.i.d.trialsagain.thistimecountingthenumberofsuccessesobserveduntilafixednumberoffailures(k)occur.Ifweobservekconsecutivefailuresatthestartoftheexperiment,forexample,thecountisX=0andPx(0)=pk.wherepistheprobabilityoffailure.IfX=2,wehaveobserved2successesandkfailuresinx+ktrials.Thereare(x:k)differentwaysofarrangingthosex+ktrials.butwecanonlybeconcernedwiththearrangementsinwhichthelasttrialendedinafailure.Sotherearereallyonly(“+:-I)arrangements.eachequalinprobability.Withthisinmind,theprobabilitymassfunctionisThisisdenotedbyXNNB(k.p).FromitsmomentgeneratingfunctiontheexpectationofanegativebinomialrandomvariableisEX=k(1-p)/pandvarianceVarX=k(1-p)/p’.hIATLABcommandsfornegativebino-mialCDF,PDF,quantile,andarandomnumberare:nbincdf,nbinpdf,nbininv,andnbinrnd. 16PROBABILITYBASICS2.4.4GeometricDistributionThespecialcaseofnegativebinomialfork=1iscalledthegeometricdistri-bution.RandomvariableXhasgeometricG(p)distributionifitsprobabilitymassfunctionispx(2)=p(1-p)”,2=0.1.2,...IfXhasgeometricG(p)distribution.itsexpectedvalueisEX=(1-p)/pandvarianceVarX=(1-p)/p2.Thegeometricrandomvariablecanbeconsideredasthediscreteanalogtothe(continuous)exponentialrandomvariablebecauseitpossessesa“memoryless”property.Thatis,ifweconditiononX2mforsomenon-negativeintegerm,thenforn2m.P(X2nlX2m)=P(X2n-m).ATATLABcommandsforgeometricCDF,PDF,quantile.andarandomnumberare:geocdf,geopdf,geoinv,andgeornd.2.4.5HypergeometricDistributionSupposeaboxcontainsmballs.kofwhicharewhiteandm-kofwhicharegold.Supposewerandomlyselectandremovenballsfromtheboxwzthoutreplacement.sothatwhenwefinish.thereareonlyrn-nballsleft.IfXisthenumberofwhiteballschosen(withoutreplacement)fromn.thenThisprobabilitymassfunctioncanbededucedwithcountingrules.Thereare(T)differentwaysofselectingthenballsfromaboxofm.Fromthese(eachequallylikely),thereare(2)waysofselectingzwhiteballsfromthekwhiteballsinthebox,andsimilarly(:I:)waysofchoosingthegoldballs.Itcanbeshownthatthemeanandvarianceforthehypergeometricdis-tributionare.respectively,nkE(X)=p=-andVar(X)=o2-mNATLABcommandsforHypergeometricCDF.PDF.quantile.andarandomnumberare:hygecdf,hygepdf,hygeinv,andhygernd.2.4.6MultinomialDistributionThebinomialdistributionisbasedondichotomizingeventoutcomes.Iftheoutcomescanbeclassifiedintok22categories.thenoutofntrials.wehaveX,outcomesfallinginthecategoryi.i=1....~k.Theprobabilitymass CONUNUOUSDlSTRlBUTlON.517functionforthevector(XI,...!X,)iswherePI+...+pk=1.sotherearek-1freeprobabilityparameterstochar-acterizethemultivariatedistribution.ThisisdenotedbyX=(XI.....X,)NMn(n.pI.....prC).ThemeanandvarianceofX,isthesameasabinomialbecausethisisthemarginaldistributionofX,.i.e.,E(X,)=np,.Var(X,)=np,(l-p,).ThecovariancebetweenX,andX,is@ov(X,,X,)=-n.p,p,becauseIE(X,X,)=E(IE(X,X,IX,))=E(X,IE(X,IX,))andconditionalonX,=x3,X,isbinomialUzn(n-x,,p,/(l-p,)).Thus.IE(X,X,)=E(X,(n-X,))p,/(l-p,).andthecovariancefollowsfromthis2.5CONTINUOUSDISTRIBUTIONSDiscretedistributionsareoftenassociatedwithnonparametricprocedures.butcontinuousdistributionswillplayaroleinhowwelearnaboutnonparametricmethods.Thenormaldistribution,ofcourse.canbeproducedinasamplemeanwhenthesamplesizeislarge.aslongastheunderlyingdistributionofthedatahasfinitemeanandvariance.Manyotherdistributionswillbereferencedthroughoutthetextbook.2.5.1ExponentialDistributionTheprobabilitydensityfunctionforanexponentialrandomvariableisfx(z)=XFX".Iz'>0,X>0.AnexponentiallydistributedrandomvariableXisdenotedbyX-&(A).Itsmomentgeneratingfunctionism(t)=:X/(X-t)fort2.P(X2tlX2x)=P(X2t-T).Themedianvalue,representingatypicalobservation.isroughly70%ofthemean.showinghowextremevaluescanaffectthepopulationmean.ThisiseasilyshownbecauseoftheeaseatwhichtheinverseCDFiscomputed:MATLABcommandsforexponentialCDF.PDF.quantile.andarandomnumberare:expcdf,exppdf,expinv,andexprnd.MATLABusesthealternativeparametrizationwith1/XinplaceofA.Forexample,theCDFofrandomvariableX-E(3)distributionevaluatedatx=2iscalculatedinLL4TLABasexpcdf(2,1/3).2.5.2GammaDistributionThegammadistributionisanextensionoftheexponentialdistribution.Ran-domvariableXhasgammaGarnma(r.A)distributionifitsprobabilitydensityfunctionisgivenbyThemomentgeneratingfunctionism(t)=(X/(X-t))',sointhecaser=1.gammaispreciselytheexponentialdistribution.Fromm(t)wehaveEX=r/XandVarX=r/X2.IfXI,....X,aregeneratedfromanexponentialdistributionwith(rate)parameterA.itfollowsfromm(t)thatY=XI+...+X,isdistributedgammawithparametersXandn:thatis.Y-Gamrna(n.X).Often.thegammadistributionisparameterizedwith1/XinplaceofA.andthisalternativeparametrizationisusedinMATLABdefinitions.TheCDFinNATLABisgamcdf(x,r,l/lambda).andthePDFisgampdf(x,r,l/lambda).Thefunctiongaminv(p,r,l/lambda)computesthepthquantileofthegamma.2.5.3NormalDistributionTheprobabilitydensityfunctionforanormalrandomvariablewithmeanEX=pandvarianceVarX=o2is CONTlNUOUSDlSTRlBUTlONS19Thedistributionfunctioniscomputedusingintegralapproximationbecausenoclosedformexistsfortheanti-derivative:thisisgenerallynotaproblemforpractitionersbecausemostsoftwarepackageswillcomputeintervalprobabil-itiesnumerically.Forexample.inMATLAB.normcdf(x,mu,sigma)andnormpdf(x,mu,sigma)findtheCDFandPDFatx,andnorminv(p,mu,sigma)computestheinverseCDFwithquantileprobabilityp.ArandomvariableXwiththenormaldistributionwillbedenotedX-N(p.02).Thecentrallimittheorem(formulatedinalatersectionofthischapter)el-evatesthestatusofthenormaldistributionaboveotherdistributions.Despiteitsdifficultformulation,thenormalisoneofthemostimportantdistributionsinallscience.andithasacriticalroletoplayinnonparametricstatistics.Anylinearcombinationofnormalrandomvariables(independentorwithsimplecovariancestructures)arealsonormallydistributed.Insuchsums.then.weneedonlykeeptrackofthemeanandvariance.becausethesetwoparame-terscompletelycharacterizethedistribution.Forexample,ifXI.....X,arei.i.d.N(p.02).thenthesamplemeanX=(XI+...+X,)/n-N(p.02/n)distribution.2.5.4Chi-squareDistributionTheprobabilitydensityfunctionforanchi-squarerandomvariablewiththeparameterk,calledthedegreesoffrecdom.isThechi-squaredistribution(x2)isaspecialcaseofthegammadistributionwithparametersr=k/2andX=1/2.ItsmeanandvarianceareEX=p=kandVarX=o2=2k.If2NN(O.1).then2’-x:.thatis,achi-squarerandomvariablewithonedegree-of-freedom.Furthermore,ifli-x:andV-xzareindependent.thenU+V-x$+,.Fromtheseresults,itcanbeshownthatifXI.....X,-N(p,02)andXisthesamplemean,thenthesamplevarzanceS2=C,(X,-X)’/(n-1)isproportionaltoachi-squarerandomvariablewithn-1degreesoffreedom:(n-1)S22~--Yn-1.u2InMATLAB.theCDFandPDFforaxiischi2cdf(x,k)andchi2pdf(x,k).Thepthquantileofthexfdistributionischi2inv(p,k). 20PROBABILITYBASICS2.5.5(Student)t-DistributionRandomvariableXhasStudent'stdistributionwithkdegreesoffreedom,xNtk;ifitsprobabilitydensityfunctionisThet-distribution'issimilarinshapetothestandardnormaldistributionexceptforthefattertails.IfXNtk,EX=0.k>1andVarX=k/(k-2).k>2.Forik=1.thetdistributioncoincideswiththeCauchydistribution.Thet-distributionhasanimportantroletoplayinstatisticalinference.Withasetofi.i.d.XI,....X,NN(p,02).wecanstandardizethesamplemeanusingthesimpletransformationof2=(X-p)/ox=fi(X-p)/o.However,ifthevarianceisunknown.byusingthesametransformationex-ceptsubstitutingthesamplestandarddeviationSforo,wearriveatat-distributionwithn-1degreesoffreedom:Moretechnically,ifZNN(O.1)andY-xiareindependent.thenT=Z/mNtk.InMATLAB.theCDFatxforat-distributionwithkde-greesoffreedomiscalculatedastcdf(x,k).andthePDFiscomputedastpdf(x,k).Thepthpercentileiscomputedwithtinv(p,k).2.5.6BetaDistributionThedensityfunctionforabetarandomvariableisandBisthebetafunction.BecauseXisdefinedonlyin(O,l),thebetadistributionisusefulindescribinguncertaintyorrandomnessinproportionsorprobabilities.Abeta-distributedrandomvariableisdenotedbyXBe(a.b).TheUnzformdzstrzbutzonon(0.l),denotedasU(0.1).servesasaspecialcase*WilliamSealyGossetderivedthet-distributionin1908underthepenname"Student"(Gosset.1908).HewasaresearcherforGuinnessBrewery,whichforbidanyoftheirworkerstopublish"companysecrets". CONTlNUOUSDlSTRlBUnONS21with(a,b)=(1.1).Thebetadistribut#ionhasmomentssothatE(X)=./(a+b)andVarX==ab/[(a+b)’(a+b+l)].InMATLAB.theCDFforabetarandomvariable(at2E(0.1))iscom-putedwithbetacdf(x,a,b)andthePDFiscomputedwithbetapdf(x,a,b).Thepthpercentileiscomputedbetainv(p,a,b).Ifthemeanpandvariance0’forabetarandomvariableareknown,thenthebasicparameters(a>b)canbedeterminedasa=/*andb=(1-p)(iL(l0;/*I-I).(2.2)2.5.7DoubleExponentialDistributionRandomvariableXhasdoubleexponentialD&(/*.A)distributionifitsdensityisgivenbyTheexpectationofXisEX=/*andthevarianceisVarX=2/A2.ThemomentgeneratingfunctionforthedoubleexponentialdistributionisDoubleexponentialisalsocalledLaplacedzstrzbutzon.IfXIandX2areindependent&(A).thenXI-XzisdistributedasDE(0.A).Also.ifX-DE(0.A)then1x1NE(A).2.5.8CauchyDistributionTheCauchydistributionissymmetricandbell-shapedlikethenormaldistri-bution,butwithmuchheaviertails.Forthisreason,itisapopulardistribu-tiontouseinnonparametricprocedurestorepresentnon-normality.Becausethedistributionis50spreadout.ithasnomeanandvariance(noneoftheCauchymomentsexist).PhysicistsknowthisastheLorentzdzstrzbutzon.IfXNCa(a.b),thenXhasdensityThemomentgeneratingfunctionforCauchydistributiondoesnotexistbut 22PROBABILITYBASICSitscharacteristicfunctionisEezx=exp(iat-bltl}.TheCa(O.1)coincideswitht-distributionwithonedegreeoffreedom.TheCauchyisalsorelatedtothenormaldistribution.If21and22aretwoindependentN(O.1)randomvariables,thenC=21/22NCa(O.1).Finally,ifC,NCa(a,,b,)fori=1.....n,thenS,=C1+...+C,isdistributedCauchywithparametersas=C,a%andbs=C,b,.2.5.9InverseGammaDistributionRandomvariableXissaidtohaveaninversegammaZG(r.A)distributionwithparametersr>0andX>0ifitsdensityisgivenbyThemeanandvarianceofXareEX=Ak/(r-1)andVarX=A2/((r-1)'(r-2)).respectively.IfXNBarnrna(r.A)thenitsreciprocalX-lisZg(r>A)distributed.2.5.10DirichletDistributionTheDirichletdistributionisamultivariateversionofthebetadistributioninthesamewaytheMultinomialdistributionisamultivariateextensionoftheBinomial.ArandomvariableX=(XI....,Xk)withaDirichletdistribution(XNDir(al...,ak))hasprobabilitydensityfunctionwhereA=Ca,.andJ:=(21.....zk)20isdefinedonthesimplex51+...+xk=1.Thenaa,(A-a,)ata3E(X,)=2,Var(X,)=and@ov(X,.X,)=-AA2(A+1)'A2(A+1)'TheDirichletrandomvariablecanbegeneratedfromgammarandomvariablesY1....,YkNGarnrna(a.b)asX,=Y,/Sy.i=1,...,kwhereSy=c,Yt.Obviously.themarginaldistributionofacomponentX,isBe(n,,A-a,). MIXTUREDISTRIBUTIONS232.5.11FDistributionRandomvariableXhasFdistributionwithmandndegreesoffreedom.denotedasFm,,.ifitsdensityisgivenbyTheCDFoftheFdistributionhasnoclosedform.butitcanbeexpressedintermsofanincompletebetafunction.ThemeanisgivenbyEX=n/(n-2).n>2,andthevariancebyVarX=[2n2(m+n-2)]/[m(n-2)2(n-4)].n>4.IfX-,&andYNx:areindependent.then(X/m)/(Y/n)-Fm,,.IfX-Be(u,b).thenbX/[a(l-X)]-Fza,2b.Also.ifXNFm,,thenmX/(n+mx)-Be(m/2.n/2).TheFdistributionisoneofthemostimportantdistributionsforstatisticalinference:inintroductorystatisticalcoursestestofequalityofvariancesandANOVAarebasedontheFdistribution.Forexample,ifSfandSiaresamplevariancesoftwoindependentnormalsampleswithvariancesC$andcriandsizesmandnrespectively,theratio(S~/o~)/(S~/n~)isdistributedasFm-1,n-1.InMATLAB,theCDFatxforaFdistributionwithm.ndegreesoffree-domiscalculatedasfcdf(x,m,n>.andthePDFiscomputedasfpdf(x,m,n).Thepthpercentileiscomputedwithfinv(p,m,n).2.5.12ParetoDistributionTheParetodistributionisnamedaftertheItalianeconomistVilfredoPareto.SomeexamplesinwhichtheParetodistributionprovidesagood-fittingmodelincludewealthdistribution.sizesofhumansettlements.visitstoencyclopediapages,andfilesizedistributionofinternettraffic.RandomvariableXhasaParetoPu(z0,a)distributionwithparameters00ifitsdensityisgivenbyThemeanandvarianceofXareEX=cvzo/(cy-1)andVarX=cyxZ0/((cv-1)2(a-2)).IfXI....,X,NPu(x0.a).thenY=220Cln(X,)x~~~.2.6MIXTUREDISTRIBUTIONSMixturedistributionsoccurwhenthepopulationconsistsofheterogeneoussubgroups.eachofwhichisrepresentedbyadifferentprobabilitydistribu- 24PROBABILITYBASICStion.Ifthesub-distributionscannotbeidentifiedwiththeobservation,theobserverisleftwithanunsortedmixture.Forexample.afinitemixtureofkdistributionshasprobabilitydensityfunctionk2=1wheref2isadensityandtheweights(pz20.z=1....,k)aresuchthatc,pz=1.Here.p,canbeinterpretedastheprobabilitythatanobservationwillbegeneratedfromthesubpopulationwithPDFfz.Inadditiontoapplicationswheredifferenttypesofrandomvariablesaremixedtogetherinthepopulation,mixturedistributionscanalsobeusedtocharacterizeextravariability(dispersion)inapopulation.Amoregeneralcontinuousmixtureisdefinedviaamzxangdzstrabutzong(Q),andthecorre-spondingmixturedistributionfX(2)=1f(t;6MQ)dQ.Alongwiththemixingdistribution,f(t:0)iscalledthekerneldzstrzbutaon.Example2.1SupposeanobservedcountisdistributedBin(n,p),andover-dispersionismodeledbytreatingpasamixingparameter.Inthiscase,thebinomialdistributionisthekernelofthemixture.Ifweallowgp(p)tofollowabetadistributionwithparameters(a.b).thentheresultingmixturedistributionisthebeta-binomialdistributionwithparameters(n.a.b)andBisthebetafunction.Example2.2In1hlBdynamicrandomaccessmemory(DRAM)chips.thedistributionofdefectfrequencyisapproximatelyexponentialwithp=0.5/cm2.The16hlBchipdefectfrequency.ontheotherhand.isexponentialwithp=0.1/cm2.Ifacompanyproduces20timesasmany1MBchipsastheyproduce16LIBchips,theoveralldefectfrequencyisamixtureofexponentials:120fx(x)=-lOe-lOx+-2e-2x.2121InLIATLAB.wecanproduceagraph(seeFigure2.1)ofthismixtureusingthefollowingcode:>>x=0:O.Ol:l; EXPONENTlALFAMlLYOFDlSTRlBUTlONS252.5,I-Mixture1---ExponentialE(2)Estimationproblemsinvolvingmixturesarenotoriouslydifficult,especiallyifthemixingparameterisunknown.InSection16.2.theEl1Algorithmisusedtoaidinstatisticalestimation.2.7EXPONENTIALFAMILYOFDISTRIBUTIONSWesaythaty2isfromtheexponentialfamily.ifitsdistributionisofformforsomegivenfunctionsbandc.ParameterQiscalledcanonicalparameter,andodispersionparameter.Example2.3Wecanwritethenormaldensityas 26PROBABILITYBASlCSthusitbelongstotheexponentialfamily.with8=p,4=cr2.b(Q)=Q2/2andc(y.4)=-l/2[y2/4+log(2n4)l.2.8STOCHASTICINEQUALITIESThefollowingfoursimpleinequalitiesareoftenusedinprobabilityproofs.1.MarkovInequality.IfX20andp=E(X)isfinite,thenP(X>t)5p/t.2.Chebyshev'sInequality.Ifp=E(X)andu2=Var(X).then3.Cauchy-SchwartzInequality.ForrandomvariablesXandYwithfinitevariances,IE:/XYl5JE(X2)E(Y2).4.Jensen'sInequalzty.Leth(x)beaconvexfunction.Thenh(E(X))5E(h(X)).Forexample.h(x)=x2isaconvexfunctionandJensen'sinequalityimplies[IE(X)]'5E(X*).hfostcomparisonsbetweentwopopulationsrelyondirectinequalitiesofspecificparameterssuchasthemeanormedian.Wearemorelimitedifnoparametersarespecified.IfFx(x)andGy(y)representtwodistributions(forrandomvariablesXandY.respectively),thereareseveraldirectinequalitiesusedtodescribehowonedistributionislargerorsmallerthananother.Theyarestochasticordering,failurerateordering,uniformstochasticorderingandlikelihoodratioordering.StochasticOrdering.XissmallerthanYinstochasticorder(X>x1=0:0.02:0.7; 28PROBABILITYBASICS>>rl=(l-betacdf(xl,2,4))./(l-betacdf(xl,3,6));>>plot(x1,rl)>>x2=0.08:0.02:.99;>>r2=(betapdf(x2,2,4))./(betapdf(x2,3,6));>>plot(x2,r2)2.9CONVERGENCEOFRANDOMVARIABLESUnlikenumbersequencesforwhichtheconvergencehasauniquedefinition,sequencesofrandomvariablescanconvergeinmanydifferentways.Instatis-tics.convergencereferstoanestimator'stendencytolooklikewhatitisestimatingasthesamplesizeincreases.Forgenerallimits,wewillsaythatg(n)issmall('0"ofnandwritegn=o(n)ifandonlyifg,/n-+0whenn-+x.Thenifgn=o(1).gn-+0.The''bag0"notatzonconcernsequiconvergence.Definegn=O(n)ifthereexistconstants0no.Byexamininghowanestimatorbehavesasthesamplesizegrowstoinfinity(itsasymptotzclzmzt),wegainavaluableinsightastowhetherestimationforsmallormediumsizedsamplesmakesense.FourbasicmeasureofconvergenceareConvergenceznDastrabutzon.AsequenceofrandomvariablesXI~....X,convergesindistributiontoarandomvariableXifP(X,5z)+P(X5z).ThisisalsocalledweakconvergenceandiswrittenX,+XorX,+dX.ConvergenceznProbabzlzty.AsequenceofrandomvariablesXI.....X,con-vergesinprobabilitytoarandomvariableXif,foreveryE>0,wehavePP(iX,-XI>E)+0asn+x.ThisissymbolizedasX,-X.AlmostSureConvergence.AsequenceofrandomvariablesXI.....X,con-vergesalmostsurely(a.s.)toarandomvariableX(symbolizedX,%X)ifP(1imnem/X,-XI=0)=1.ConuergenceanMeanSquare.AsequenceofrandomvariablesXI~..,~X,convergesinmeansquaretoarandomvariableXifEIX,-XI2+0ThisisLalsocalledConvergenceinILpandiswrittenX,4X.Convergenceindistribution,probabilityandalmostsurecanbeordered:i.e..x,-xP=+x,+x=+x,==+x.TheLz-convergenceimpliesconvergenceinprobabilityandindistributionbut CONVERGENCEOFRANDOMVARIABLES29itisnotcomparablewiththealmostsureconvergence.Ifh(z)isacontinuousmapping,thentheconvergenceofX,toXguaran-teesthesamekindofconvergenceofh(X,,)toh(X).Forexample.ifX,Xandh(z)iscontinuous.thenh(X,)h(X).whichfurtherimpliesthath(X,)5h(X)andh(X,)+h(X).LawsofLargeNumbers(LLN).Fori.i.d.randomvariablesXI.X2,...withfiniteexpectationEXl=p.thesamplemeanconvergestopinthealmost-sureassense.thatis,Sn/n-p,forS,=XI-...+X,.Thisistermedthestronglawoflargenumbers(SLLN).Finitevariancemakestheproofeasier,butitisnotanecessaryconditionfortheSLLNtohold.If.undermoregeneralconditions.Sn/n=Xconvergestopinprobability.wesaythattheweaklawoflargenumbers(IYLLK)holds.Lawsoflargenumbersareimportantinstatisticsforinvestigatingtheconsistencyofestimators.Slutsky'sTheorem.Let{X,}and{Y,}betwosequencesofrandomvariablesPonsomeprobabilityspace.IfX,-Y,--+0.andY,+X.thenX,==+X.CorollarytoSlutsky'sTheorem.Insometexts.thisissometimescalledSlut-Psky'sTheorem.IfX,--r.X.Y,5a.and2,+b,thenX,Y,+2,==+aX+b.DeltaMethod.IfEX,=pandVarX,=c2.andifhisadifferentiablefunctionintheneighborhoodof/-1withh'(p)#0.thenfi(h(X,)-h(p))==+W.whereW-N(0.[h'(p)I2a2).CentralLzmztTheorem(CLT).LetXI,X2...,bei.i.d.randomvariableswithEX1=pandVarXl=a2>S-300=[I;>>fori=1:5000S-300=[S-300sum(poissmd(0.5,[1,3001))1;end>>hist(S-300,30)Thehistogramof5000realizationsofS300isshowninFigure2.3(b).Noticethatthehistogramofsumsisbell-shapedandnormal-like,aspredictedbytheCLT.Itiscenterednear300xl/2=150.Amoregeneralcentrallimittheoremcanbeobtainedbyrelaxingtheas-sumptionthattherandomvariablesareidenticallydistributed.LetXI.X2....beindependentrandomvariableswithIE(X,)=ptandVar(X,)=0,”<3cj.Assumethatthefollowinglimit(calledLindeberg’scondztion)issatisfied:ForE>0,wherenD:=C0’i=l EXERCISES31ExtendedCLT.LetXI,X2....beindependent(notnecessarilyidenticallydistributed)randomvariableswithEX,=p,andVarX,=a:0,P(l8,-Q/>E)+0asn--f30(i.e..8,*PconvergestoQinprobability).Incompactnotation:Qn-+8.Unbiasednessandconsistencyaredesirablequalitiesinanestimator,butthereareotherwaystojudgeanestimate’sefficacy.Tocompareestimators,onemightseektheonewithsmallermeansquarederror(MSE),definedasAISE(8,)=E(8,-8)’=Var(8,)+[Bia~(d,)]~.whereBias(8,)=JE(8,-Q).Ifthebiasandvarianceoftheestimatorhavelimit0asn-+CG,(orequivalently,MSE(8,)+0)theestimatorisconsistent.Anestimatorisdefinedasstronglyconsistentif.asn+cc,QnA-as.8.Example3.1SupposeX-Bin(n,p).Ifpisanunknownparameter,?j=X/nisunbiasedandstronglyconsistentforp.ThisisbecausetheSLLNholdsfori.i.d.Ber(p)randomvariables,andXcoincideswithS,fortheBernoullicase;seeLawsofLargeNumbersonp.29.3.2EMPIRICALDISTRIBUTIONFUNCTIONLetXI,Xz.....X,beasamplefromapopulationwithcontinuousCDFF.Anempirical(cumulative)dzstributionfunction(EDF)basedonarandomsampleisdefinedaswherel(p)iscalledtheindicatorfunctionofp?andisequalto1iftherelationpistrue,and0ifitisfalse.IntermsoforderedobservationsXI:,5Xz:,5’.IXn:,%theempiricaldistributionfunctioncanbeexpressedasifz>yl=randn(20,l);>>y2=randn(200,i);>>x=-3:0.05:3;>>y=normcdf(x,O,l);>>plot(x,y);>>holdon;>>plotedf(yl);>>plotedf(y2);-3-2-10123Fig3.1EDFofnormalsamples(sizes20and200)plottedalongwiththetrueCDF. 36STAT/ST/CSBASlCS3.2.1ConvergenceforEDFThemeansquarederror(hISE)isdefinedforF,asIE(F,(z)-F(z))2.BecauseF,(z)isunbiasedforF(z).theh4SEreducestoVarF,(z)=F(z)(l-F(z))/n.Pandasn+m,hISE(F,(z))+0.sothatF,(z)--fF(z).ThereareanumberofconvergencepropertiesforF,thatareoflimiteduseinthisbookandwillnotbediscussed.However,onefundamentallimittheoreminprobabilitytheory,theGlivenko-CantelliTheorem.isworthyofmention.Theorem3.1(Glzvenko-Cantellz)IfFn(x)astheemparacaldzstrzbutaonfunc-tzonbasedonanz.a.d.sampleXI....,X,generatedfromF(x),supIFn(z)-F(z)/=0.53.3STATISTICALTESTSIshallnotrequireofascientificsystemthatitshallbecapableofbeingsingledout.onceandforall,inapositivesense;butIshallrequirethatitslogicalformshallbesuchthatitcanbesingledout,bymeansofempiricaltests:inanegativesense:itmustbepossibleforanempiricalscientificsystemtoberefutedbyexperience.KarlPopper,Philosopher(1902-1994)Uncertaintyassociatedwiththeestimatorisakeyfocusofstatistics,especiallytestsofhypothesisandconfidenceintervals.Thereareavarietyofmethodstoconstructtestsandconfidenceintervalsfromthedata,includingBayesian(seeChapter4)andfrequentistmethods,whicharediscussedinSection3.3.3.Ofthetwogeneralmethodsadoptedinresearchtoday,methodsbasedontheLikelihoodRatioaregenerallysuperiortothosebasedonFisherInformation.Inatraditionalset-upfortestingdata.weconsidertwohypothesesre-gardinganunknownparameterintheunderlyingdistributionofthedata.Experimentersusuallyplantoshowneworalternativeresults:whicharetypicallyconject,uredinthealternativehypothesis(HIorHa).Thenullhy-pothesis,designatedHo,usuallyconsistsofthepartsoftheparameterspacenotconsideredinHI.W%enatestisconductedandaclaimismadeaboutthehypotheses,twodistincterrorsarepossible:TypeIerror.ThetypeIerroristheactionofrejectingHowhenHOwasactuallytrue.Theprobabilityofsucherrorisusuallylabeledbya.andreferredtoasszgnzficancelevelofthetest. STATlSTlCALTESTS37TypeI1error.ThetypeI1errorisanactionoffailingtorejectHowhenHIwasactuallytrue.TheprobabilityofthetypeI1errorisdenotedby0.Powerisdefinedas1-3.Insimpleterms.thepowerispropensityofatesttorejectwrongalternativehypothesis.3.3.1TestPropertiesAtestisunbzasedifthepowerisalwaysashighorhigherintheregionofH1thananywhereinHo.Atestisconszstentif,overallofHI,3+0asthesamplesizesgoestoinfinitv.SupposewehaveahypothesistestofHo:8=80versusHI:8#80.TheWaldtestofhypothesisisbasedonusinganormalapproximationfortheteststatistic.Ifweestimatethevarianceoftheestimator8,bypluggingin0,for8inthevarianceterma&(denotethise-,",).wehavethez-teststatisticH,-0020=T.DonThecriticalregion(orrejectionregion)forthetestisdeterminedbythequantileszqofthenormaldistribution.whereqissettomatchthetypeIerror.p-values:Thep-valueisapopularbutcontroversialstatisticfordescribingthesignificanceofahypothesisgiventheobserveddata.Technically.itistheprobabilityofobservingaresultas"rejectable"(accordingtoHo)astheobservedstatisticthatactuallyoccurredbutfromanewsample.Soap-valueof0.02meansthatifHoistrue,wewouldexpecttoseeresultsmorereflectiveofthathypothesis98%ofthetimeinrepeatedexperiments.Notethatifthep-valueislessthanthesetQ:levelofsignificanceforthetest.thenullhypothesisshouldberejected(andotherwiseshouldnotberejected).Intheconstructofclassicalhypothesistesting,thep-valuehaspotentialtobemisleadingwithlargesamples.ConsideranexampleinwhichHo:p=20.3versusHI:p#20.3.Asfarastheexperimenterisconcerned,thenullhypothesismightbeconjecturedonlytothreesignificantdigits.Butifthesampleislargeenough.Z=20.30001willeventuallyberejectedasbeingtoofarawayfromHo(granted.thesamplesizewillhavetobeawfullylarge,butyougetourpoint?).Thisproblemwillberevisitedwhenwelearnaboutgoodness-of-fittestsfordistributions.BinomialDistribution.Forbinomialdata.considerthetestofhypothesisIfmefixthetypeIerrortoa,wewouldhaveacriticalregion(orrejection 38STATISTICSBASICSregzon)of{x:x>zo},where20ischosensothata=P(X>201p=PO).Forinstance.ifn=10,ana=0.0547leveltestforHO:p50.5vsHI:p>0.5istorejectHOifX28.Thetest’spowerisplottedinFigure3.2basedonthefollowingMATLABcode.ThefigureillustrateshowourchanceatrejectingthenullhypothesisinfavorofspecificalternativeH1:p=plincreasesasplincreasespast0.5.>>p1=0.5:0.01:0.99;>>pow=l-binocdf(7,lO,pi);>>plot(pl,pow)Fig.3.2GraphofstatisticaltestpowerforbinomialtestforspecificalternativeHI:p=PI.Valuesofplaregivenonthehorizontalaxis.Example3.2Asemiconductormanufacturerproducesanunknownpropor-tionpofdefectiveintegrativecircuit(IC)chips,sothatchipyzeldisdefinedas1-p.Themanufacturer’sreliabilitytargetis0.9.Withasampleof25randomlyselectedmicrochips,theWaldtestwillrejectHO:p50.10infavorofHI:p>0.10if6-0.1>ZCY,J(0.1)(0.9)/100orforthecasea=0.05.ifthenumberofdefectivechipsX>14.935. STAT/ST/CALTESTS393.3.2ConfidenceIntervalsA1-Qlevelconfidenceintervalisastatistic,intheformofaregionorin-terval,thatcontainsanunknownparameter0withprobability1-Q.Forcommunicatinguncertaintyinlayman'sterms,confidenceintervalsaretypi-callymoresuitablethantestsofhypothesis,astheuncertaintyisillustratedbythelengthoftheintervalconstructed,alongwiththeadjoiningconfidencestatement.Atwo-sidedconfidenceintervalhastheform(L(X).V(X)).whereXistheobservedoutcome,andP(L(X)505U(X))=1-a.Thesearethemostcommonlyusedintervals.buttherearecasesinwhichone-sidedintervalsaremoreappropriate.Ifoneisconcernedwithhowlargeaparametermightbe.wewouldconstructanupperboundU(X)suchthatP(O5V(X))=1-Q.Ifsmallvaluesoftheparameterareofconcerntotheexperimenter,alowerboundL(X)canbeusedwhereP(L(X)50)=1-Q.Example3.3BinomialDistribution.Toconstructatwo-sided1-Qconfidenceintervalforp.wesolvetheequationforptoobtaintheupper1-Qlimitforp.andsolvetoobtainthelowerlimit.Onesided1-QintervalscanbeconstructedbysolvingjustoneoftheequationsusingQinplaceofa/2.UseMATLABfunctionsbinup(n,x,a)andbinlow(n,x,a).ThisisnamedtheClopper-Pearsoninterval(ClopperandPearson,1934).wherePearsonreferstoEgonPearson,KarlPearson'sson.Thisexactintervalistypicallyconservatzve,butnotconservativelikeaG.O.P.senatorfromhlississippi.Inthiscase,conservativemeansthecoverageprobabalztyoftheconfidenceintervalisatleastashighasthenomznalcover-ageprobability1-Q,andcanbemuchhigher.Ingeneral."conservative"issynonymouswithriskaverse.Thenominalandactualcoverageprobabilitiesdisagreefrequentlywithdiscretedata,whereanintervalwiththeexactcover-ageprobabilityof1-Qmaynotexist.Whiletheguaranteedconfidenceinaconservativeintervalisreassuring,itispotentiallyinefficientandmisleading.Example3.4Ifn=10,s=3.thenJ?=0.3anda95%(two-sided)con-fidenceintervalforpiscomputedbyfindingtheupperlimitplforwhichFx(3:pl)=0.025andlowerlimitp2forwhich1-Fx(2:p2)=0.025.whereFXistheCDFforthebinomialdistributionwithn=10.Theresultinginterval,(0.06774.0.65245)isnotsymmetricinp. 40STATISTICSBASICSIntervalsBasedonNormalApproximation.TheintervalinExample3.4is“exact“,incontrasttomorecommonlyusedintervalsbasedonanormalapproximation.Recallthat5&izapu/45servesasa1-Qlevelconfidenceintervalforpwithdatageneratedfromanormaldistribution.Herez,rep-resentsthecyquantileofthestandardnormaldistribution.Withthenormalapproximation(seeCentralLimitTheoreminChapter2).phasanapproxi-matenormaldistributionifnislarge,soifweestimate0;=p(1-p)/nwith6s=@(I-fi)/n.anapproximate1-aintervalforpispkza,2J5(n-.)/n3.ThisiscalledtheWaldintervalbecauseitisbasedoninvertingthe(Wald)z-teststatisticforHO:p=poversusHI:p#PO.Agresti(1998)pointsoutthatboththeexactandWaldintervalsperformpoorlycomparedtothescoreantervalwhichisbasedontheWaldz-testofhypothesis,butinsteadofusingljintheerrorterm,itusesthevaluepoforwhich(6-po)/dpo(l-po)/n=kz,p.Thesolution,firststatedbyWilson(1927),istheintervalp++*za/2dmy5F1+Z:/z/nThisactuallyservesasanexampleofshrankage.whichisastatisticalphe-nomenonwherebetterestimatorsaresometimesproducedby“shrinking”oradjustingtreatmentmeanstowardanoverall(sample)mean.Inthiscase,onecanshowthatthemiddleoftheconfidenceintervalshrinksalittlefromptowardl/2,althoughtheshrinkingbecomesnegligibleasngetslarger.UseMATLABfunctionbinomial-shrink-ci(n,x,alpha)togenerateatwo-sidedWilson‘sconfidenceinterval.Example3.5Inthepreviousexample,withn=10and2=3,theex-act2-sided95%confidenceinterval(0.06774.0.65245)haslength0.5847.sotheinferenceisrathervague.Usingthenormalapproximation,theintervalcomputesto(0.0616.0.5384)andhaslength0.4786.Theshrinkageintervalis(0.1078,0.6032)andhaslength0.4954.Isthisaccurate?Ingeneral,theexactintervalwillhavecoverageprobabilityexceeding1-Q,andtheWaldintervalsometimeshascoverageprobabilitybelow1-a.Overall.theshrinkageinter-valhascoverageprobabilitycloserto1-a.Inthecaseofthebinomial,theword“exact”doesnotimplyaconfidenceintervalisbetter.>>x=O:lO;>>y=binopdf(x,l0,0.3);>>bar(x,y)>>barh(C1231,C0.0670.652;0.061,0.538;0.2130.4051,’stacked’) STATlSTlCALTESTS41Fig.33(a)ThebinomialBin(10.0.3)PLIF:(b)95%confidenceintervalsbasedonexact.TValdandTVilsonmethod.3.3.3LikelihoodSirRonaldFisher.perhapsthegreatestinnovatorofstatisticalmethodology.developedtheconceptsoflikelihoodandsufficiencyforstatisticalinference.WithasetofrandomvariablesXI~...,X,.supposethejointdistributionisafunctionofanunknownparameter8:fn(xl,....x,:Q).ThelakelzhoodfunctaonpertainingtotheobserveddataL(0)=fn(xl.....x,;Q)isassociatedwiththeprobabilityofobservingthedataateachpossiblevalueQofanunknownparameter.Inthecasethesampleconsistsofi.i.d.measurementswithdensityfunctionf(x;Q).thelikelihoodsimplifiestonL(Q)=n.f(z,:0)2=1ThelikelihoodfunctionhasthesamenumericalvalueasthePDFofarandomvariable.butitisregardedasafunctionoftheparameters8.andtreatsthedataasfixed.ThePDF.ontheotherhand,treatstheparametersasfixedandisafunctionofthedatapoints.Thelakelzhoodpranctplestatesthatafterxisobserved.allrelevantexperimentalinformationiscontainedinthelikelihoodfunctionfortheobserved2.andthat81supportsthedatamorethanQ2ifL(&)2L(Q2).Themuxzmumlzkelzhoodestzmute(hlLE)ofQisthatvalueofQintheparameterspacethatmaximizesL(Q).AlthoughthehILEisbasedstronglyontheparametricassumptionsoftheunderlyingdensityfunctionf(x:Q).thereisasensiblenonparametricversionofthelikelihoodintroducedinChapter10. 42STATlSTlCSBASICSMLEsareknowntohaveoptimalperformanceifthesamplesizeissuf-ficientandthedensitiesare“regular”;forone.thesupportoff(z;8)shouldnotdependon8.Forexample,if8istheWILE,thenfi(e-8)===+N(0.i?(8)).wherei(8)=IE([dlogf/dO]*)istheFisherInformatzonof8.Theregularityconditionsalsodemandthati(8)20isboundedandJf(z;Q)dzisthricedifferentiable.Foracomprehensivediscussionaboutregularityconditionsformaximumlikelihood,seeLehmannandCasella(1998).TheoptimalityoftheMLEisguaranteedbythefollowingresult:Cramer-RaoLowerBound.Fromani.i.d.sampleXI,...,X,whereXihasdensityfunctionfx(z),let4,beanunbiasedestimatorfor8.Thenvar(4,)2(i(e)n)-l.DeltaMethodforMLE.TheinvariancepropertyofMLEsstatesthatifgisaone-to-onefunctionoftheparameter8,thenthehiLEofg(0)isg(6).Assumingthefirstderivativeofg(denoted9’)exists.thenExample3.6Afterwaitingforthekthsuccessinarepeatedprocesswithconstantprobabilitiesofsuccessandfailure,werecognizetheprobabilitydis-tributionofX=no.offailuresisnegatzvebznomzal.Toestimatetheunknownsuccessprobabilityp,wecanmaximizeL(p)=px(x;p)xpk(1-p)”.0>dat=[2.9441,-13.3618,7.1432,16.2356,-6.9178,8.5800,.12.5400,-15.9373,-14.4096,5.7115];>>[m,v]=BA-nornor2(dat,100,20,20) lNGREDIENT5FORBAYESIANINFERENCE51':.:'II.-likelihood...1..iI.0.06-0.04-I~1I--30-20-100102030406fig.4.2Thenormalh'(Q,100)likelihood,h'(20.20)prior.andposteriorfordata(2.9441.-13.3618.....5.7115}.4.2.1QuantifyingExpertOpinionBayesianstatisticshasbecomeincreasinglypopularinengineering,andonereasonforitsincreasedapplicationisthatitallowsresearcherstoinputex-pertopinionasacatalystintheanalysis(throughthepriordistribution).Expertopinionmightconsistofsubjectiveinputsfromexperiencedengineers.orperhapsasummaryjudgmentofpastresearchthatyieldedsimilarresults.Example4.2PriorElicitationforReliabilityTests.Supposeeachofnindependentreliabilitytestsamachinerevealseitherasuccessfulorunsuc-cessfuloutcome.If6'representsthereliabilityofthemachine.letXbethenumberofsuccessfulmissionsthemachineexperiencedinnindependenttri-als.Xisdistributedbinomialwithparametersn(known)andQ(unknown).U'eprobablywon'texpectanexperttoquantifytheiruncertaintyabout0directlyintoapriordistribution~(6).Perhapstheresearchercanelicitinfor-mationsuchastheexpectedvalueandstandarddeviationof6'.Ifwesupposethepriordistributionfor6'isBe(a,3).wherethehyper-parametersaand3areknown.then 52BAYESIANSTATISTKSWithXI6NBin(n.6).thejoint,marginal.andposteriordistributionsare(:)B(z+cY.TL-x+/~)m(z)=.z=O.l.....nB(Q.P)ItiseasytoseethattheposteriordistributionisBe(a+z.n-z+0).Supposetheexpertssuggestthatthepreviousversionofthismachinewas”reliable93%ofthetime,plusorminus2%”.WemighttakeE(0)=0.93andinsinuatethat00=0.04(orVar(6)=0.0016),usingtwo-sigmaruleasanargument.Fromthebetadistribution,WecanactuallysolveforCYand0asafunctionoftheexpectedvaluepandvariance0’.asin(2.2).Q=p(p-p’-o’))/o’,and0=(1-p)(p-p’-n’))/o’.Inthisexample.(p,o’)=(0.93.0.0016)leadstoQ=36.91and3=2.78.ToupdatethedataX.wewilluseaBe(36.91,2.78)distributionforaprioron6.Considertheweightgiventotheexpertinthisexample.Ifweobserveonetestonlyandthemachinehappenedtofail,ourposteriordistributionisthenBe(36.91.3.78),whichhasameanequalto0.9071.TheNLEfortheaveragereliabilityisobviouslyzero,withwithsuchpreciseinformationelicitedfromtheexpert.theposteriorisclosetotheprior.Insomecaseswhenyoudonottrustyourexpert,thismightbeunsettlingandlessinformativepriorsmaybeabetterchoice.4.2.2PointEstimationTheposterioristheultimateexperimentalsummaryforaBayesian.Theloca-tionmeasures(especiallythemean)oftheposteriorareofgreatimportance.TheposteriormeanrepresentsthemostfrequentlyusedBayesestimatorfortheparameter.Theposteriormodeandmedianarelesscommonlyusedal-ternativeBayesestimators.AnobjectivewaytochooseanestimatorfromtheposterioristhroughapenaltyorlossfunctionL(6.6)thatdescribeshowwepenalizethediscrepancyoftheestimator6fromtheparameter6.Becausetheparameterisviewedas arandomvariable.weseektominimizeexpectedloss.orposterzorrisk:R(8.z)=.iL(8.8)7r(81X)d8Forexample,theestimatorbasedonthecommonsquared-errorlossL(d.8)=(8-8)'minimizesE((8-d)'),whereexpectationistakenovertheposteriordistribution~(8jX).It'seasytoshowthattheestimatorturnsouttobetheposteriorexpectation.Similartosquared-errorloss.ifweuseabsolute-errorlossL(8,O)=(8-01.theBayesestimatoristheposteriormedian.TheposteriormodemaximizestheposteriordensitythesamewayA4LEismaximizingthelikelihood.ThegenerulzzedMLEmaximizes~(8lX).BayesianspreferthenameMAP(maximumaposteriori)estimatororsimplyposteriormode.TheMAPestimatorispopularinBayesiananalysisinpartbecauseitisoftencomputationallylessdemandingthantheposteriormeanormedian.Thereasonissimple:tofindthemaximum.theposteriorneednottobefullyspecifiedbecauseargmaxesT(8lr)=argmaxef(n:lQ)T(O),thatis.onesimplymaximizestheproductoflikelihoodandtheprior.Ingeneral.theposteriormeanwillfallbetweentheMLEandthethepriormean.ThiswasdemonstratedinExample4.1.Asanotherexample.supposeweflippedacoinfourtimesandtailsshoweduponall4occasions.Weareinterestedinestimatingprobabilityofheads,8.inaBayesianfashion.IfthepriorisU(0,l).theposteriorisproportionaltoQ0(1-8)4whichisbetaBe(l.5).TheposteriormeanshranktheLILEtowardtheexpectedvalueoftheprior(1/2)toget8,=1/(1+5)=1/6,whichisamorereasonableestimatorof8thentheMLE.Example4.3Binomial-BetaConjugatePair.Supposexl8NBzn(n.8).Ifthepriordistributionfor8isBe(a,3),theposteriordistributionisBe(a+IC,n-z+0).UndersquarederrorlossL(8.8)=(8-8)2.theBayesestimatorof8istheexpectedvalueoftheposteriora+zQ+XOB=--(a+X)(p+12-IC)(Y+n+n'ThisisactuallyaweightedaverageofhILE.X/n,andthepriormean./(a+3).Noticethat,asnbecomeslarge.theposteriormeanisgettingclosetoLILE,becausetheweightn/(n+Q+3)tendsto1.Ontheotherhand,whenQislarge,theposteriormeanisclosetothepriormean.LargeQindicatessmallpriorvariance(forfixed3.thevarianceofBe(cr.3)behavesasO(l/cy2))andthepriorisconcentratedaboutitsmean.RecalltheExample4.2:afteronemachinetrialfailuretheposteriordistributionmeanchangedfrom0.93(thepriormean)to0.9071.shrinkingonlyslightlytowardtheMLE(whichiszero). 54BAYESIANSTATISTICSExample4.4Jeremy’sI&.Jeremy.anenthusiasticGeorgiaTechstudent.spokeinclassandposedastatisticalmodelforhisscoresonstandardIQtests.Hethinksthat.ingeneral,hisscoresarenormallydistributedwithunknownmean8andthevarianceof80.Prior(andexpert)opinionisthattheI&ofGeorgiaTechstudents,8,isanormalrandomvariable,withmean110andthevariance120.Jeremytookthetestandscored98.Thetraditionalestimatorof8wouldbe6=X=98.TheposteriorisN(102.8,48),sotheBayesestimatorofJeremy’sIQscoreis6~=102.8.Example4.5Poisson-GammaConjugatePair.LetXI,...,X,,given8arePoissonP(8)withprobabilitymassfunctionand8G(a.p)isgivenbyr(8)xOa-le-as.Then;7r(OlXl,....whichis4(C,Xz+a,n+/?).ThemeanisIE(8IX)=(cXi+a)/(n+P),anditcanberepresentedasaweightedaverageoftheRILEandthepriormean:E81XnEX,Ba=--+--n+Bnn+Bp’4.2.3ConjugatePriorsWehaveseentwoconvenientexamplesforwhichtheposteriordistributionremainedinthesamefamilyasthepriordistribution.Insuchacase,theef-fectoflikelihoodisonlyto“update”thepriorparametersandnottochangeprior‘sfunctionalform.Wesaythatsuchpriorsareconjugatewiththelike-lihood.Conjugacyispopularbecauseofitsmathematicalconvenience:oncetheconjugatepairlikelihood/priorisfound,theposterioriscalculatedwithrelativeease.IntheyearsBC1andpre-SICRICera(seeChapter18).conju-gatepriorshavebeenextensivelyused(andoverusedandmisused)preciselybecauseofthiscomputationalconvenience.Nowadays.thegeneralagreementisthatsimpleconjugateanalysisisoflimitedpracticalvaluebecause.giventhelikelihood,theconjugatepriorhaslimitedmodelingcapability.Therearemanyunivariateandmultivariateinstancesofconjugacy.Thefollowingtableprovidesseveralcases.Forpracticeyoumaywanttoworkouttheposteriorsinthetable.’Forsome.theBCerasignifiesBeforeChrist,ratherthanBeforeComputers. INGREDIENTSFORBAYESIANINFERENCE55Table4.2Someconjugatepairs.HereXstandsforasampleofsizen.XI....,X,.LikelihoodPriorPosterior4.2.4IntervalEstimation:CredibleSetsBayesianscallintervalestimatorsofmodelparameterscredzblesets.Natu-rally,themeasureusedtoassessthecredibilityofanintervalestimatoristheposteriordistribution.Studentslearningconceptsofclassicalconfidenceintervals(CIS)oftenerrbystatingthat-theprobabilitythattheCIinterval[L.U]containsparameterBis1-a".Thecorrectstatementseemsmorecon-voluted:oneneedstogeneratedatafromtheunderlyingmodelmanytimesandforeachgenerateddatasettocalculatetheCI.TheproportionofCIScov-eringtheunknownparameter"tendsto''1-a.TheBayesianinterpretationofacrediblesetCisarguablymorenatural:TheprobabilityofaparameterbelongingtothesetCis1-a.Aformaldefinitionfollows.AssumethesetCisasubsetof0.Then.Ciscredzblesetwithcredibility(1-cr)lOO%ifP(BECjX)=IE(I(BEC)jX)=Jx(Bjz)dB21-a.CIftheposteriorisdiscrete.thentheintegralisasum(usingthecountingmeasure)andP(BtClX)=c7r(B&)21-Q.e,EcThisisthedefinitionofa(1-a)lOO%credibleset,andforanygivenposteriordistributionsuchasetisnotunique. 56BAYESIANSTATISTICSForagivencredibilitylevel(1-a)100%.theshortestcrediblesethasobs-i-ousappeal.Tominimizesize.thesetsshouldcorrespondtohighestposteriorprobabilitydensityareas(HPDs).Definition4.1The(1-a)100%HPDcrediblesetforparameter0isasetC,subsetof0oftheformc={BE017r(Bl.)2k(a)};wherek(a)isthelargestconstantforwhichP(0EClX)21-a.Geometrically,iftheposteriordensityiscutbyahorizontallineatthehightk(a).thesetCisprojectionontheBaxisofthepartoflinethatliesbelowthedensity.Example4.6Jeremy’sI&,Continued.RecallJeremy.theenthusiasticGeorgiaTechstudentfromExample4.4,whousedBayesianinferenceinmod-elinghisI&testscores.ForascoreXlBhewasusingaN(0.80)likelihood,whiletheprioron8wasN(ll0,120).AfterthescoreofX=98wasrecorded.theresultingposteriorwasnormalN(102.8.48).Here,theMLEis6=98,anda95%confidenceintervalis[98-1.96J80,98+1.96m]=[80.4692.115.5308].Thelengthofthisintervalisapproximately35.TheBayesiancounterpartsare6=102.8,and[102.8-1.96-,102.8+1.96-1=[89.2207,116.3793].Thelengthof95%crediblesetisapproxi-mately27.TheBayesianintervalisshorterbecausetheposteriorvarianceissmallerthanthelikelihoodvariance:thisisaconsequenceoftheincorporationofinformation.TheconstructionofthecrediblesetisillustratedinFigure4.3.4.2.5BayesianTestingBayesiantestsamounttocomparisonofposteriorprobabilitiesoftheparam-eterregionsdefinedbythetwohypotheses.Assumethat00and01aretwonon-overlappingsubsetsoftheparameterspace0.Weassumethat00and01partition0.thatis.01=0;.althoughcasesinwhich01#06areeasilyformulated.Let0E00signifythenullhypothesisHOandlet8E01=06signifythealternativehypothesisHI:Giventheinformationfromtheposterior.thehypothesiswithhigherposteriorprobabilityisselected. INGREDIENTSFORBAYESIANINFERENCE570.061II0.05-0.041I0.02-Fig.4.3BayesiancrediblesetbasedonA'(102.8.48)density.Example4.7MereturnagaintoJeremy(Examples4.4and4.6)andcon-sidertheposteriorfortheparameter8.N(102.8,48).JeremyclaimshehadabaddayandhisgenuineI&isatleast105.Afterall,heisatGeorgiaTech!TheposteriorprobabilityofB2105is105-102.8po=PeIx(B2105)=P=1-(P(0.3175)=0.3754,lessthat3870,sohisclaimisrejected.PosterioroddsinfavorofHaare0.3754/(1-0.3754)=0.4652,lessthan50%.WecanrepresentthepriorandposterioroddsinfavorofthehypothesisHo.respectively,asTO-E00)andPo--PQlX(QE00)---TIPB(BE01)p1PQlX(BE01).TheBayesfactorinfavorofHOistheratioofcorrespondingposteriortopriorodds.Whenthehypothesesaresimple(i.e..Ho:8=00vs.HI:B=Bl),andthepriorisjustthetwopointdistributionT(Bo)=TOand~(01)=TI=1-TO, 58BAYESIANSTATISTICSTable4.3TreatmentofHoAccordingtotheValueoflog-BayesFactor05logBlo(z)50.5evidenceagainstHoispoor0.55logBlo(z)51evidenceagainstHOissubstantial15logBlo(z)52evidenceagainstHOisstronglogBlo(z)>2evidenceagainstHOisdecisivethentheBayesfactorinfavorofHObecomesthelikelihoodratio:Ifthepriorisamixtureoftwopriors.Thedataarelist(p.dice=c(0.1666666,0.1666666,0.1666667,0.1666667,0,1666667,0.1666667)) EXERClSES63andtheinitialvaluesaregenerated.Afterfivemillionrolls.WinBUGSoutputsis1111=0.0016andis1112=0.0015,sothesumof1111isadvantageoustothesumof1112.Example4.10JeremyinWinBUGS.WewillcalculateaBayesestima-torforJeremy'strueIQusingBUGS.Recall,themodelinExample4.4wasX-N(0.80)and6'NN(l00,120).InWinBUGSwewillusetheprecisionparameters11120=0.00833and1/80=0.0125.#JeremyinWinBUGSmodel{x-dnorm(theta,tau)theta-dnorm(110,0.008333333)>#datalist(tau=O.0125,x=98)#inits1ist(theta=100)BelowisthesummaryofhlCMCoutput.1node1mean1sdIMCerror12.5%ImedianI97.5%11B1102.816.91710.0214189.17I102.8I116.31Becausethisisaconjugatenormal/normalmodel,theexactposteriordis-tribution,N(102.8.48).waseasytofind,(seeExample4.4).Notethatinsimulations,theMCMCapproximation,whenrounded.coincideswiththeexactposteriormean.ThehIChICvarianceof6'is6.9172=47.84489.closetotheexactposteriorvarianceof48.4.4EXERCISES4.1.AlifetimeX(inyears)ofaparticularmachineismodeledbyanexponen-tialdistributionwithunknownfailurerateparameter6'.ThelifetimesofXI=5.X2=6.andX3=4areobserved.andassumethatanexpertbelievesthat6'shouldhaveexponentialdistributionaswellandthat,onaverage6'shouldbe1/3.(i)VL-ritedownthelILEof6'forthoseobservations.(ii)Elicitaprioraccordingtotheexpert'sbeliefs.(iii)Forthepriorin(ii).findtheposterior.Istheproblemconjugate?(iv)FindtheBavesestimator8~~~~~.andcompareitwiththeLlLEestimatorfrom(i).Discuss. 64BAYESIANSTATISTICS4.2.SupposeX=(Xl....,Xn)isasamplefromU(0.8).Let8haveParetoPa(/&.a)distribution.ShowthattheposteriordistributionisPa(max(80.z1.....z,}a+n).4.3.LetX-G(n/2.28).sothatX/8isxi.Let8-ZG(cr.p).ShowthattheposteriorisZG(n/2+a.(z/2+F1)-').4.4.IfX=(XI~...,X,)isasamplefromNB(rn,8)and6'-Be(cr,3),showthattheposteriorfor8isbetaBe(atmn,,3+CZ,xz).4.5.InExample4.5onp.54,showthatthemarginaldistributionisnegativebinomial.4.6.WhatistheBayesfactorB:linJeremy'scase(Example4.7)?TestHOisusingtheBayesfactorandwordingfromtheTable4.3.ArguethattheevidenceagainstHOispoor.4.7.AssumeXI8NN(8.a2)and8-~(8)=1.ConsidertestingHO:8580V.S.H1:8>80.Showthatpo=PBix(8580)isequaltotheclassicalp-value.4.8.ShowthattheBayesfactorisB,",(z)=f(zIQ~)/rn~(z).wherer(z)=sup@#@of(zl8).Usually.r(z)=f(z16bfLE).whereQmleisMLEestimatorof8.TheBayesfactorB,",(z)isboundedfrombelow:4.10.SupposeX=-2wasobservedfromthepopulationdistributedasiY(O.l/O)andonewishestoestimatetheparameter8.(Here8isthereciprocalofvariancea2andiscalledthepreczsaonparameter.TheprecisionparameterisusedinWinBUGStoparameterizethenormaldistribution).Aclassicalestimatorof8(e.g..theLILE)doesexist.butonemaybedisturbedtoestimatel/a2basedonasingleobservation.Supposetheanalystbelievesthattheprioron8isGarnrna(1/2,3).(i)WhatisthehlLEofd?(ii)FindtheposteriordistributionandtheBayesestimatorof8.Iftheprioron8isGarnrna(a.P).representtheBayesestimatorasweightedaverage(sumofweights=1)ofthepriormeanandtheAlLE.(iii)Finda95%HPDCrediblesetfor8.(iv)TestthehypothesisHO:851/4versusH1:8>1/4. EXERClSfS654.11.TheLzndley(1957)Paradox.Supposegl8NN(8.1,'n).WewishtotestHO:8=0versusthetwosidedalternative.SupposeaBayesianputsthepriorP(8=0)=P(8#0)=l/2,andinthecaseofthealternative,the1/2isuniformlyspreadovertheinterval[-M/2.M/2].Supposen=40.000and$i=0.01areobserved.sov'%0=2.TheclassicalstatisticianrejectsHOatlevela:=0.05.ShowthatposterioroddsinfavorofHoare11ifM=1.indicatingthataBayesianstatisticianstronglyfavorsHo.accordingtoTable4.3.4.12.ThisexerciseconcerningBayesianbinaryregressionwithaprobitmodelusingWinBUGSisborrowedfromDavidMadigan'sBayesianCourseSite.Finney(1947)describesabinaryregressionproblemwithdataofsizen=39.twocontinuouspredictorsx1and22.andabinaryresponsey.HerearethedatainBUGS-readyformat:1ist(n=39,x1=c(3.7,3.5,1.25,0.75,0.8,0.7,0.6,1.1,0.9,0.9,0.8,0.55,0.6,1.4,0.75,2.3,3.2,0.85,1.7,1.8,0.4,0.95,1.35,1.5,1.6,0.6,1.8,0.95,1.9,1.6,2.7,2.35,1.1,1.1,1.2,0.8,0.95,0.75,1.3),x2=c(0.825,1.09,2.5,1.5,3.2,3.5,0.75,1.7,0.75,0.45,0.57,2.75,3.0,2.33,3.75,1.64,1.6,1.415,1.06,1.8,2.0,1.36,1.35,1.36,1.78,1.5,~.5,1.9,0.95,0.4,0.75,0.03,1.83,2.2,2.0,3.33,1.9,1.9,1.625~,y=c~1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,~,1,1,0,0,1))Theobjectiveistobuildapredictivemodelthatpredictsyfromz1andx2.Proposedapproachistheprobitmodel:P(y=1lzl.s~)=Q(30+3121+32x2)whereQisthestandardnormalCDF.(i)UsePYinBUGStocomputeposteriordistributionsforDo.31and02usingdiffusenormalpriorsforeach.(ii)Supposeinsteadofthediffusenormalpriorfor3,.z=0,l.2.youuseanormalpriorwithmeanzeroandvarianceut.andassumethev,sareindependentlyexponentiallydistributedwithsomehyperparameter7.FitthismodelusingBUGS.Howdifferentarethetwoposteriordistributionsfromthisexercise?4.13.ThefollowingWinBUGScodeflipsacoin.theoutcomeHiscodedby1andtailsby0.Mimicthefollowingcodetosimulatearollingofafairdie.#coin.bug:modelcoin;cflip12-dcat(p.coin[I)coin<-flip12-1>#coin.dat:list(p.coin=c(0.5,0.5))#justgenerateinitials 66BAYESIANSTATISTICS4.14.Thehighlypublicized(recentTVreports)znvztrofertzlzzatzonsuccesscasesforwomenintheirlatefiftiesallinvolvedonor'segg.Iftheeggisthewoman'sown,thestoryisquitedifferent.Invitrofertilization(IVF).oneoftheassistedreproductivetechnology(ART)procedures,involvesextractingawoman'seggs,fertilizingtheeggsinthelaboratory.andthentransferringtheresultingembryosintothewomansuterusthroughthecervix.Fertilizationinvolvesaspecial-izedtechniqueknownasintracytoplasmicsperminjection(ICSI).Thetableshowsthelive-birthsuccessratepertransferratefromtherecipients'eggs,stratifiedbyageofrecipient.Thedataareforyear1999,publishedbyUS-CentersforDiseaseControlandPrevention(CDC):(http://www.cdc.gov/reproductivehealth/ART99/index99.htm)Age(x)2425262728293031Percentage(y)I38.738.638.941.439.741.138.737.6Age(XI3233343536373839Percentage(y)36.336.935.733.833.230.127.822.7Age(x)40414243444546Percentage(y)21.315.411.29.25.43.01.6Assumethechange-pointregressionmodelYi=Po+Plzz+ti,i=l'..,,'TYi=YO+rizi+E,'i=7+1....n~iNN(0,a2).(i)Proposepriors(withpossiblyhyperpriors)ong2,80,61.yo,and~1.(ii)Takediscreteuniformprioron'TandwriteaWinBUGSprogram.4.15.Isthecloningofhumansmoral?RecentGallupPollestimatesthatabout88%Americansopposedcloninghumans.Resultsarebasedontelephoneinterviewswitharandomlyselectednationalsampleofn=1000adults.aged18andolder,conductedMay2-4.2004.Inthese1000interviews.882adultsopposedcloninghumans.(i)WriteIVinBUGSprogramtoestimatetheproportionpofpeopleopposedtocloninghumans.Useanon-informativepriorforp.(ii)Testthehypothesisthatp50.87.(iii)Pretendthattheoriginalpollhadn=1062adults,i.e..resultsfor62adultsaremissing.Estimatethenumberofpeopleopposedtocloningamongthe62missinginthepoll.Hznt:model{anticlons-dbin(prob,npolled); REFERENCES67lessthan87<-step(prob-0.87)anticlons.missing-dbin(prob,nmissing)probdbeta(1,l))Datalist(anticlons=882,npolled=1000,nmissing=62)REFERENCESAnscombe,F.J.(1962)."TestsofGoodnessofFit."JournaloftheRoyalStatisticalSociety(B).25,81-94.Bayes,T.(1763):"AnEssayTowardsSolvingaProblemintheDoctrineofChances,"PhilosophicalTransactionsoftheRoyalSociety,London,53,370-418.Berger,J.0.(1985).StatisticalDecisionTheoryandBayesianAnalysis,Sec-ondEdition,NewYork:Springer-Verlag.Berger,J.0.;andDelampady,M.(1987):'.TestingPreciseHypothesis,"Sta-tisticalScience.2,317-352.Berger,J.0.;andSelke,T.(1987)..'TestingaPointNu11Hypothesis:TheIrreconcilabilityofp-valuesandEvidence(withDiscussion)",JournalofAmericanStatisticalAssociation:82,112-122.Chen,M.-H.,Shao,&.-XI.,andIbrahim.J.(2000),MonteCarloMethodsinBayesianComputation,NewYork:SpringerVerlag.Congdon,P.(2001)~BayesianStatisticalModelling,Hoboken,NJ:Wiley.Congdon,P.(2003).AppliedBayesianModels,Hoboken,NJ:Wiley.Congdon,P.(2005);BayesianModelsforCategoricalData.Hoboken,NJ:Wiley.Finney,D.J.(1947),"TheEstimationfromIndividualRecordsoftheRela-tionshipBetweenDoseandQuanta1Response,"Biometrika,34,320-334.Gelfand,A.E..andSmith.A.F.hf.(1990).'#Sampling-basedApproachestoCalculatinghIargina1Densities,"JournalofAmericanStatisticalAsso-ciation,85>398-409.Lindley,D.V.(1957)."AStatisticalParadox,''Biometrika,44,187-192.LIadigan,D.http://stat.rutgers.edu/madigan/bayes02/.AWebSiteforCourseonBayesianStatistics.hIartz,H.:andWaller3R.(1985),BayesianReliabilityAnalysis.KewYork:Wiley.hletropolis.Pi..Rosenblut,h.A..Rosenbluth,M.,Teller.A.:andTeller,E.(1953):"EquationofStateCalculationsbyFastComputingMachines,''TheJournalofChemicalPhysics.21.1087-1092. 68BAYESlANSTATI5TICSRobert,C.(2001).TheBayeszanChozce:FromDeczszon-TheoretzcMota-vatzonstoComputatzonalImplementatzon.SecondEdition,NewYork:SpringerVerlag.Robert,C.andCasella,G.(2004).MonteCarloStatzstacalMethods,SecondEdition.NewYork:SpringerVerlag.Spiegelhalter,D.J..Thomas,A,,Best.N.G.,andGilks.W.R.(1996).“BUGSExamplesVolume1,”Version0.5.Cambridge:MedicalResearchCouncilBiostatisticsUnit(PDF). OrderStatisticsTheearlybirdgetstheworm,butthesecondmousegetsthecheese.StevenWrightLetXI~X2...,X,beanindependentsamplefromapopulationwithab-solutelycontinuouscumulativedistributionfunctionFanddensityf.ThecontinuityofFimpliesthatP(Xi=X,)=0:wheni#jandthesamplecouldbeorderedwithstrictinequalities,XI:,F(t)"-l(1-F(t))n-2f(t).(5.2)Example5.1RecallthatforanycontinuousdistributionF,thetransformedsampleF(X1).....F(X,)isdistributedU(0,l).Similarly.from(5.2)thedistributionofF(Xzn)isBe(i,n-i+1).UsingtheMATLABcodebelow,thedensitiesa1.egraphedinFigure5.1.>>x=O:O.O25:1;>>fori=1,5>>plot(betapdf(x,i,6-i))>>holdall>>endExample5.2ReliabilitySystems.Inreliability.seriesandparallelsys-temsarebuildingblocksforsystemanalysisanddesign.Aserzessystemisonethatworksonlyifallofitscomponentsareworking.Aparallelsystemisonethatfailsonlyifallofitscomponentsfail.Ifthelifetimesofan-componentsystem(XI,....X,)arei.i.d.distributed.thenifthesystemisinseries.thesystemlifetimeisXIn.Ontheotherhand,foraparallelsystem,thelifetimeisX,n.5.1JOINTDISTRIBUTIONSOFORDERSTATISTICSUnliketheoriginalsample(XI.Xz.....X,).thesetoforderstatisticsisin-evitablydependent.Ifthevector(XI.Xz....~X,)hasajointdensityfl2,,(a.z2>...?z,)=fif(GI*z=lthenthejointdensityfortheorderstatistics.fl2,,(XI,....2,)isTounderstandwhythisistrue.considertheconditionaldistributionoftheorderstatisticsy=(21,.22,....,x,,)givenx=(XI.22.....zn).Eachoneofthen!permutationsof(XI.X2....X,)areequalinprobability,socomputingf,=f,~,dF,isincidental.ThejointdensitycanalsobederivedusingaJacobiantransformation(seeExercise5.3). JOINTDlSTRIBUTIONSOFORDERSTATISTICS71"00.20.40.60.81Fig.5.1DistributionoforderstatisticsfromasampleoffiveU(O.l).From(5.3)wecanobtainthedistributionofanysubsetoforderstatistics.ThejointdistributionofX,n.X,n.1Tp+EforsomesmallE>0.Inthiscase.wecanuseanumberthatinterpolatesthevalueof2,usingthelinebetween(Xrn.r/(n+1))and(X(?+l)n.(r+l)/(n+1)):2p=(-P(n+1)+r+1)x,71+(p(n+1)-r)X(?+I)n.(5.7)Notethatifp=112andnisanevennumber.thenr=7212andr+l=n/2+1.and=(XS+X(%+l)n)/2.Thatis.thesamplemedianistheaverageofthetwomiddlesampleorderstatistics.Wenotethatthereart.alternativedefinitionsofsamplequantileintheliterature.buttheyallhavethesamelargesampleproperties.5.3TOLERANCEINTERVALSUnliketheconfidenceinterval,whichisconstructedtocontainanunknownparameterwithsomespecifieddegreeofuncertainty(say.1-y).atolerancezntervalcontainsatleastaproportionpofthepopulationwithprobabilityy.Thatis.atoleranceintervalisaconfidenceintervalforadistribution.Bothp.theproportionofcoverage.and1--,~theuncertaintyassociatedwiththeconfidencestatement.arepredefinedprobabilities.Forinstance,wemaybe95%confidentthat90%ofthepopulationwillfallwithintherangespecifiedbyatoleranceinterval.Orderstatisticsplayanimportantroleintheconstructionoftoleranceintervals.FromasampleXI,....X,from(continuous)distributionF,twostatisticsTIq,)2.AtmostT-1outofnobservationsarelessthanxp3.LetY=numberofobservationslessthanxp.sothatYwBin(n,p)ifxpisthepthquantile4.FindrlargeenoughsothatP(Y5T-1)=y.Example5.4A90%upperconfidenceboundforthe75thpercentile(orupperquartile)isfoundbyassigningY=numberofobservationslessthan20.75,whereY-Bin(n,0.75).Letn=20.NoteP(Y516)=0.7748andP(Y517)=0.9087,sor-1=17.The90%upperboundfor50.75,whichisequivalenttoa90%uppertoleranceboundfor75%ofthepopulation,is21820(thethirdlargestobservationoutof20).Forlargesamples,thenormalapproximationallowsustogenerateanupperboundmoresimply.Fortheupperbound2,,.TisapproximatedwithF=np+z7JmIntheexampleabove.withn=20(ofcourse.thisisnotexactlywhatwethinkofas"large").F=20(0.75)+1.28,/0.75(0.25)20=17.48.Accordingtothisrule.21720isinsufficientfortheapproximateinterval.so21820isagaintheupperbound.Example5.5SampleRange.Fromasampleofn.whatistheprobabilitythatloop%ofthepopulationlieswithinthesamplerangeXI:^,X,,,)?P(F(X,n)-F(X1n)2P)=1-p(U,0.thenTO50=ln(2)/0and&i(2050-5050)==+N(0.0-2). 76ORDERSTATISTICS5.5EXTREMEVALUETHEORYEarlierweequatedaseriessystemlifetime(ofni.i.d.components)withthesampleminimumXI,.Thelimitingdistributionoftheminimaormaximaarenotsointeresting.eg..ifXhasdistributionfunctionF,XI,-+20.where50=inf,{z:F(z)>O}.However.thestandardzzedlzmztismoreinteresting.Foranexampleinvolvingsamplemaxima,withXI,...,X,fromanexponentialdistributionwithmean1,considertheasymptoticdistributionofx,,-log(n):P(X,,-log(n)5t))=P(X,,5t+log(n))=[I-exp{-t-log(n)}]"--[I-e-tn-l]n+exp{-ePt}.Thisisbecause(1+a/n)"-+e"asn---fx.Thisdistribution,aspecialformoftheGumbeldistribution,isalsocalledtheextreme-valuedzstrabutzon.ExtremevaluetheorystatesthatthestandardizedseriessystemlifetimeconvergestooneofthethreefollowingdistributiontypesF*(notincludingscaleandlocationtransformation)asthenumberofcomponentsincreasestoinfinity:GumbelF*(z)=exp(-exp(-z)),-x<2<30Frkchetexp(-(-z)a).J:<0,a>0NegativeWeibullF*(z)=x>o5.6RANKEDSETSAMPLINGSupposearesearcherissentouttoLeechLake.IIinnesota,toascertaintheaverageweightof:alleyefishcaughtfromthatlake.Sheobtainsherdatabystoppingthefishermenastheyarereturningtothedockafteradayoffishing.Inthetimetheresearcherwaitedatthedock,threefishermenarrived.eachwiththeirdailylimitofthreeWalleye.Becauseoflimitedtime,sheonlyhastimetomakeonemeasurementwitheachfisherman.soattheendofherfieldstudy.shewillgetthreemeasurements.hIcIntyre(1952)discoveredthatwiththisforcedlimitationonmeasure-ments.thereisanefficientwayofgettinginformationaboutthepopulationmean.M'emightassumetheresearcherselectedthefishtobemeasuredran- EXERCISES77domlyforeachofthethreefishermenthatwerereturningtoshore.5lcIntyrefoundthatifsheinsteadinspectedthefishvisuallyandselectedthemnon-randomly.thedatacouldbegetabetterestimatorforthemean.Specifically.supposetheresearcherexaminesthethreeWalleyefromthefirstfishermanandselectsthesmallestoneformeasurement.Shemeasuresthesecondsmall-estfromthenextbatch,andthelargestfromthethirdbatch.Opposedtoasimplerandomsample(SRS).thisrankedsetsample(RSS)consistsofindependentorderstatisticswhichwewilldenotebyXll31.XlZ31,Xp31.IfXisthesamplemeanfromaSRSofsizen.andX~ssisthemeanofarankedsetsampleXlln~l....XInn~litiseasytoshowthatlikeX,X~ssisanunbiasedestimatorofthepopulationmean.illoreover.ithassmallervariance.Thatis.Var(XRss)5Var(X).Thispropertyisinvestigatedfurtherintheexercises.Thekeyisthatvariancesfororderstatisticsaregenerallysmallerthanthevarianceofthei.i.d.measurements.IfyouthinkabouttheSRSestimatorasalinearcombinationoforderstatistics.itdiffersfromthelinearcombinationoforderstatisticsfromaRSSbyitscovariancestructure.Itseemsapparent.then.thattheexpectedvalueofX~ssmustbethesameastheexpectedvalueofaX~ss.ThesamplingaspectofRSShasreceivedthemostattention.EstimatorsofotherparameterscanbeconstructedtobemoreefficientthanSRSestimators.includingnonparametricestimatorsoftheCDF(StokesandSager.1988).ThebookbyChen,Bai,andSinha(2003)isacomprehensiveguideaboutbasicresultsandrecentfindingsinRSStheory.5.7EXERCISES5.1.InMATLAB:Generateasequenceof50uniformrandomnumbersandfindtheirrange.RepeatthisprocedureM=1000times:youwillobtain1000rangesfor1000sequencesof50uniforms.Next,simulate1000percentilesfromabetaBe(49.2)distributionforp=(1:1000)/1001.Use?I-filebetainv(p,49,2).Produceahistogramforbothsetsofdata,comparingtheorderedrangesandpercentilesoftheirtheoreticaldistribution.Be(49.2).5.2.Forasetofi.i.d.datafromacontinuousdistributionF(z).derivetheprobabilitydensitj-functionoftheorderstatisticX,in(5.2).5.3.Forasampleofn=3observations.useaJacobiantransformationtoderivethejointdensityoftheorderstatistics.X13.Xz3,X33.5.4.Considerasystemthatiscomposedofnidenticalcomponentsthathaveindependentlifedistributions.Inreliability.ak-out-of-nsystemisoneforwhichatleastkoutofncomponentsmustworkinorderforthesystemtowork.IfthecomponentshavelifetimedistributionF.findthe 78ORDERSTATISTICSdistributionofthesystemlifetimeandrelateittotheorderstatisticsofthecomponentlifetimes.5.5.In2003,thelabofHumanComputerInteractionandHealthCareIn-formaticsattheGeorgiaInstituteofTechnologyconductedempiricalresearchontheperformanceofpatientswithDiabeticRetinopathy.Theexperimentincluded29participantsplacedeitherinthecontrolgroup(withoutDiabeticRetinopathy)orthetreatmentgroup(withDiabeticRetinopathy).Thevisualacuitydataofallparticipantsarelistedbe-low.Normalvisualacuityis20120,and20160meansapersonseesat20feetwhatanormalpersonseesat60feet.201202012020120201252011520130201252012020125201802013020125201302015020130201202011520120201252011620130201152011520125Thedataoffiveparticipantswereexcludedfromthetableduetotheirfailuretomeettherequirementoftheexperiment,so24participantsarecountedinall.Inordertoverifyifthedatacanrepresentthevisualacuityofthegeneralpopulation,a90%uppertoleranceboundfor80%ofthepopulationiscalculated.5.6.InMATLAB.repeatthefollowing111=10000times0Generateanormalsampleofsizen=100,XI....,Xl00.0Foratwo-sidedtoleranceinterval,fixthecoverageprobabilityasp=0.8.andusetherandominterval(X5100,X95100).ThisintervalwillcovertheproportionFx(X95100)-Fx(X5100)=U95100-Us100ofthenormalpopulation.0CounthowmanytimesinMrunsUs5100-U5100exceedsthepreassignedcoveragep?Usethiscounttoestimatey.0Comparethesimulationestimatorofywiththetheory,y=1-betainc(p,s-r,(n+l>-(s-r)).Whatifinsteadofnormalsampleyouusedanexponentiallydis-tributedsample?5.7.Supposethatcomponentsofasystemaredistributedi.i.d.U(0,l)life-time.Bystandardizingwith11.wherenarethenumberofcomponentsinthesystem.findthelimitinglifetimedistributionofaparallelsystemasthenumberofcomponentsincreasestoinfinity.5.8.Howlargeofasampleisneededinorderforthesamplerangetoserveasa99%toleranceintervalthatcontains90%ofthepopulation? EXERClSfS795.9.Howlargemustthesamplebeinordertohave95%confidencethatatleast90%ofthepopulationislessthanX(n-l)n?5.10.Foralargesampleofi.i.d.randomlygeneratedU(O.l)variables.com-paretheasymptoticdistributionofthesamplemeanwiththatofthesamplemedian.5.11.ProvethatarankedsetsamplemeanisunbiasedforestimatingthepopulationmeanbyshowingthatC~=lE(X~,,~)=np.InthecasetheunderlyingdataaregeneratedfromU(0,l),provethatthesamplevariancefortheRSSmeanisstrictlylessthanthatofthesamplemeanfromaSRS.5.12.Finda90%uppertoleranceintervalforthe9gthpercentileofasampleofsizen=1000.5.13.SupposethatNitems,labeledbysequentialintegersas(1.2,....N}.constitutethepopulation.LetXI,X2.....X,beasampleofsizen(withoutrepeating)fromthispopulationandletXI,.....X,,betheorderstatistics.Itisofinteresttoestimatethesizeofpopulation,N.Thistheoreticalscenarioisabasisforseveralinterestingpopularprob-lems:tramcarsinSanFrancisco.capturedGermantanks.maximallot-terynumber.etc.ThemostpopularistheGermantanksstory.featuredinTheGuardzan(2006).Thefullstoryisquiteinteresting.butthebot-tomlineistoestimatetotalsizeofproductioniffiveGermantankswith"serialnumbers"12,33.37.78,and103havebeencapturedbyAlliedforces.(i)ShowthatthedistributionofX,,,isk-1i'-k(1-1)(n--z1P(X,,=k)=~k=2.2+1.....h'-12+1(9(ii)UsingtheidentityCk=,N-n+z(z-I)k--l(A'-kn-,)=(':)anddistributionfrom(i),showthatEX,,=z(N+l)/(n+1).(iii)ShowthattheestimatorY,=(n+l)/zX,,-1isunbiasedforestimatingNforanyz=1,2....,n.EstimatenumberoftanksNonbasisofYsfromtheobservedsample{12,33,37.78.103}. 80ORDERSTAT/ST/CSREFERENCESChen.Z..Bai.Z..andSinha.B.K.(2003),RankedSetSamplzng:TheoryandApplzcatzons,YewYork:SpringerVerlag.David.H.A.andNagaraj,H.N.(2003),OrderStatzstzcs,ThirdEdition.NewYork:Wiley.AlcIntyre,G.A.(1952)...Amethodforunbiasedselectivesamplingusingrankedsets."AustralzanJournalofAgrzculturalResearch.3.385-390.Stokes,S.L..andSager,T.W.(1988).CharacterizationofaRanked-SetSamplewithApplicationtoEstimatingDistributionFunctions.JournaloftheAmerzcanStatzstzcalAssoczatzon.83.374-381.TheGuardzan(2006),"GavynDaviesDoestheMaths:HowaStatisticalFormulaWontheWar,"Thursday,July20,2006. GoodnessofFitBelievenothingjustbecauseaso-calledwisepersonsaidit.Believenothingjustbecauseabeliefisgenerallyheld.Believenothingjustbecauseitissaidinancientbooks.Believenothingjustbecauseitissaidtobeofdivineorigin.Believenothingjustbecausesomeoneelsebelievesit.Believeonlywhatyouyourselftestandjudgetobetrue.paraphrasedfromtheBuddhaModernexperimentsareplaguedbywell-meaningassumptionsthatthedataaredistributedaccordingtosome“textbook“CDF.Thischapterintroducesmethodstotestthemeritsofahypothesizeddistributioninfittingthedata.ThetermgoodnessoffitwascoinedbyPearsonin1902.andreferstosta-tisticalteststhatcheckthequalityofamodeloradistribution’sfittoasetofdata.ThefirstmeasureofgoodnessoffitforgeneraldistributionswasderivedbyKolmogorov(1933).AndreiNikolaevichKolmogorov(Figure6.1(a)),perhapsthemostaccomplishedandcelebratedSovietmathematicianofalltime.madefundamentalcontributionstoprobabilitytheory.includ-ingteststatisticsfordistributionfunctions-someofwhichbearhisname.NikolaiVasil’yevichSmirnov(Figure6.1(b)).anotherSovietmathematician,extendedKolmogorov’sresultstotwosamples.Inthissectionweemphasizeobjectivetests(withp-values.etc.)andlaterweanalyzegraphzcalmethodsfortestinggoodnessoffit.Recalltheempiricaldistributionfunctionsfromp.34.TheKolmogorovstatzstzc(sometimescalled81 82GOODNESSOFFITFig.6.1(a)AndreiDiikolaevichKolmogorov(1905-1987):(b)XkolaiVasil’yevichSmirnov(1900-1966)theKolmogorov-Smirnovteststatistic)isabasistomanynonparametricgoodness-of-fittestsfordistributions.andthisiswherewewillstart.6.1KOLMOGOROV-SMIRNOVTESTSTATISTICLetXI,X2,....X,beasamplefromapopulationwithcontinuous.butun-knownCDFF.Asin(3.1),letF,(z)betheempiricalCDFbasedonthesample.TotestthehypothesisHo:F(z)=Fo(z),(VZ)versusthealternativeweusethemodifiedstatistics&D,=sup,fi~F,,(z)-Fo(x)/calculatedfromthesampleasfiD,=J;;max{maxIFn(Xz)-Fo(X,)/.rnax/F,(X,-)-Fo(X,)/}.Z2ThisisasimplediscreteoptimizationproblembecauseF,,isastepfunctionandFOisnondecreasingsothemaximumdiscrepancybetweenF,,andFOoccursattheobservationpointsorattheirleftlimits.VhenthehypothesisHoistrue.thestatisticJED,isdistributedfreeofF,.Infact.Kolmogorov KOLMOGOROV-SMIRNOVTESTSTATISTIC83(1933)showedthatunderHo.30P(&Dn5d)===+H(d)=1-2~(-1)"-'~-~~~"*.J=1Inpractice,mostKolmogorov-Smirnov(KS)testsaretwosided,testingwhethertheFisequaltoFo.thedistributionpostulatedbyHo.ornot.Alternatively.wemighttesttoseeifthedistributionislargerorsmallerthanahypothesizedFo.Forexample.tofindoutifXisstochasticallysmallerthanY(Fx(x)2Fy(z)).thetwoone-sidedalternativesthatcanbetestedareHI-:F~(z)5Fo(z)or:FX(Z)2F~(z).AppropriatestatisticsfortestingHI,-andHI,+are&D:ESUPfi(&(~)-Fo(z))?XwhicharecalculatedatthesamplevaluesasAD;=&max{max(Fo(X,)-Fn(X7-)).0}and7fiD:=fimax{max(F,(X,)-Po(X,)),O}.7Obviously.D,=max{D;.Dk}.Intermsoforderstatistics,0,'=max{max(F,(X,)-Fo(X,)).O}=max{max(z/n-Fo(X,,).(I}and72D,=max{max(Fo(X,-(Z-1)/n),O}.7UnderHo.thedistributionsofDkandD;coincide.Althoughconceptuallystraightforward.thederivationofthedistributionforDkisquiteinvolved.UnderHo,forcE(0.1).viehaveP(D,'z/n-c.forallz=1.2....~n)wheref(u1....,u,)=n!1(0k,(l-a/2).wherek,(l-a)isthetabledquantileundera.Ifn>40,wecanapproximatethesequantilesk,(ci)ask,11.07/&1.22/&1.36/&1.52/fi1.63/&QI0.100.050.0250.010.005Later,wewilldiscussalternativetestsfordistributiongoodnessoffit.TheKStesthasadvantagesoverexacttestsbasedonthex2goodness-of-fitstatistic(seeChapter9),whichdependonanadequatesamplesizeandproperintervalassignmentsfortheapproximationstobevalid.TheKStesthasimportantlimitations.too.Technically.itonlyappliestocontinuousdistributions.TheKSstatistictendstobemoresensitivenearthecenterofthedistributionthanatthetails.Perhapsthemostseriouslimitationisthatthedistributionmustbefullyspecified.Thatis,iflocation,scale.andshapeparametersareestimatedfromthedata,thecriticalregionoftheKStestisnolongervalid.Ittypicallymustbedeterminedbysimulation.Example6.1With5observations{O.l.0.14.0.2.0.48.0.58).wewishtotestHo:DataaredistributedIA(O.1)versusHI:DataarenotdistributedIA(0.1).WecheckF,andFo(x)=xatthefivepointsofdataalongwiththeirleft-handlimits.IF,(x,)-Fo(x,)Iequals(0.1.0.26,0.4.0.32.0.42)atz=1.....5.andIFn(x2-)-Fo(x,)/equals(0.1.0.06.0.2,0.12.0.22).sothatD,=0.42.Accordingtothetable,k5(.10)=0.44698.Thisisatwo-sidedtest,sotheteststatisticisnotrejectableatQ=0.20.Thisisduemoretothelackofsamplesizethantheevidencepresentedbythefiveobservations.Example6.2Galaxyvelocitydata,availableonthebook'swebsite.wasanalyzedbyRoeder(1990).andconsistsofthevelocitiesof82distantgalaxies.divergingfromourowngalaxy.Amixturemodelwasappliedtodescribetheunderlyingdistribution.Thefirsthypothesizedfitisthenormaldistribution. KOLMOGOROV-SMIRNOVTESTSTAT/ST/C85Table6.4UpperQuantilesforKolmogorov-SmirnovTestStatistic.nIa=.10a=.05a=,025a=.01LY=,0051.90000,95000,97500.99000.993002,68377,77639,84189.90000,929293,56481,63604,70760,78456,829004,49265,56522,62394,68887,734245,44698,50935,56328,62718,668536,41037,46799,51926,57741,616617,38148,43607,48342,53844,575818,35831,40962,45427,50654,541799,33910.38746,43001,47960.5133210,32260,36866,40925,45662,4889311,30829,35242,39122,43670,4677012,29577,33815,37543,41918,4400513,28470,32549,36143,40362,4324714,27481,31417,34890,38970,4176215,26588,30397,33760,37713,4042016,25778,29472,32733,36571,3920117,25039,28627,31796,35528,3808618,24360,27851,30936,34569,3706219,23735,27136,30143,33685,3611720,23156,26473,29408,32866,3524121,22617,25858,28724,32104,3442722,22115,25283,28087,31394,3366623.21645,24746.27490,30728,3295424,21205,24242,26931,30104,3228625.20790,23768,26404.29516,3165726,20399,23320,25907,28962,3106427,20030,22898,25438,28438,3050228,19680,22497,24993,27942,2997129,19348,22117,24571,27471,2946630,19032,21756,24170.27023.2898731,18732,21412,23788,26596,2853032,18445,21085,23424,26189.2809433,18171,20771,23076,23801,2767734,17909,20472,22743,25429,2727935,17659,20185,22425,25073,2689736,17418,19910,22119,24732,2653237,17188,19646,21826,24404,2618038,16966,19392,21544,24089,2584339,16753,19148.21273,23786,2551840,16547,18913,21012,23494,25205 86GOODNESSOFFITspecificallyM(2l.(m)’).andtheKSdistance(&On=1.6224withp-valueof0.0103.Thefollowingmixtureofnormaldistributionswithfivecomponentswasalsofittothedata:I?=0.1@(9.0.5’)+0.02@(17.(m)’)+0.4@(20,(A)’)+0.4@(23.(A)’)+0.05@(33,(A)’).where@(p,o)istheCDFforthenormaldistribution.TheKSstatisticsis&Dn=1.1734andcorrespondingp-valueis0.1273.Figure6.2plotsthetheCDFofthetransformedvariables6(X).soagoodfitisindicatedbyastraightline.Recall,ifXNF.thanF(X)NUU(0.1)andthestraightlineis,infact,theCDFofU(0.1).F(x)=2.05z51.Panel(a)showsthefitfortheM(21,(m)2)modelwhilepanel(b)showsthefitforthemixturemodel.Althoughnotperfectitself,themixturemodelshowssignificantimprovementoverthesinglenormalmodel.Fig.6.2Fitteddistributions:(a)N(21,and(b)MixtureofNormals.6.2SMIRNOVTESTTOCOMPARETWODISTRIBUTIONSSmirnov(1939a,1939b)extendedtheKStesttocomparetwodistributionsbasedonindependentsamplesfromeachpopulation.LetXI,X’,...,X,andYl.Y’.....Y,betwoindependentsamplesfrompopulationswithunknownCDFsFxandGy.LetF,(x)andG,(z)bethecorrespondingempiricaldistributionfunctions.WewouldliketotestWewillusetheanalogoftheKSstatisticD,: SMIRNOVTESTTOCOMPARETWODlSTRIBUTlONS87whereDm,,canbesimplified(intermsofprogrammingconvenience)toDm,n=max{ICn(Zt)-Gn(Zt)I}andZ=21,....Z,+,isthecombznedsampleXI,...~X,.YI.....Y,.Dm,nwillbelargeifthereisaclusterofvaluesfromonesampleafterthesamplesarecombined.Theimbalancecanbeequivalentlymeasuredinhowtheranksofonesamplecomparetothoseoftheotheraftertheyarejoinedtogether.Thatis,valuesfromthesamplesarenotdirectlyrelevantexceptforhowtheyareorderedwhencombined.Thisistheessentialnatureofrankteststhatwewillinvestigatelaterinthenextchapter.Thetwo-distributiontestextendssimplyfromtwo-sidedtoone-sided.Theone-sidedteststatisticsareDL,,=supz(Fm(z)-G,(x))orD;.n=supz(G,(z)-Fm(z)).Notethattheranksofthetwogroupsofdatadeter-minethesupremumdifferencein(6.1)>andthevaluesofthedatadetermineonlythepositionofthejumpsforGn(z)-F,(rc).Example6.3ForthetestofHI:&(z)>Gy(z)with71=m=2,thereare(i)=6differentsamplerepresentations(withequalprobability):sampleorderD+m.nXGy(x))thenourtype-IerrorrateisQ=1/6.Ifm=ningeneral.thenulldistributionoftheteststatisticsimplifiesto(,n(:n+ljJ)P(D:,>d)=P(D&>d)=(2)'where[a]denotesthegreatestinteger5a.Fortwosidedtests,thisisdoubledtoobtainthep-value.Ifmandnarelarge(m,n>30)andofcomparable 88GOODNESSOFFITTable6.5TailProbabilitiesforSmirnovTwo-SampleTest.One-sidedtesta=0.05cy=0.025cy=0.01a=0.005Two-sidedtesta=0.10a:=0.05cy=0.02a=0.011.22e1.36e1.52e1.63msize.thenanapproximatedistributioncanbeused:Asimplerlargesampleapproximation,giveninTable6.5workseffectivelyifmandnarebothlargerthan,say,50.Example6.4Supposewehaven=m=4withdata(~1.~2.~3.~4)=(16.4.7,21)and(yl,y2.y3,yd)=(56,31.15.19).FortheSmirnovtestofHI:F#G,theonlythingimportantaboutthedataishowtheyarerankedwithinthegroupofeightcombinedobservations:IF,-G,Jisneverlargerthanl/2,achievedinintervals(7,15),(16.19),(21.31).Thep-valueforthetwo-sidedtestisExample6.5Figure6.3showstheEDFsfortwosamplesofsize100.Oneisgeneratedfromnormaldata,andtheotherfromexponentialdata.Theyhaveidenticalmean(p=10)andvariance(02=100).TheMATLABm-filekstestandkstest2bothcanbeusedforthetwo-sampletest.TheMATLABcodeshowsthep-valueis0.0018.Ifwecomparedthesamplesusingatwo-samplet-test.thesignificancevalueis0.313becausethet-testistestingonlythemeans.andnotthedistribution(whichisassumedtobenormal).NotethatsupsIFm(x)-Gn(z)l=0.26,andaccordingtoTable6.5,the0.99quantileforthetwo-sidedtestis0.2305.>>xn=randgauss(l0,100,100);>>ne=randexpo(.1,100)>>cdfplot(xn)>>holdon SPEClALlZEDJESTS89Fig.6.3EDFforsamplesofn=m=100generatedfromnormalandexponentialwith=10and0’=100.Currentplotheld>>cdfplot(ne)>>[h,p,ks21=kstest2(xn,ne)h=1p=0.0018ks2=0.2600>>[h,p,ci]=ttest2(ne,xn)h=0p=0.3130ci=-3.89921.25516.3SPECIALIZEDTESTSFORGOODNESSOFFITInthissection.wewillgooversomeofthemostimportantgoodness-of-fitteststhatweremadespecificallyforcertaindistributionssuchasthenormalorexponential.Ingeneral,thereisnotaclearrankingonwhichtestsbelowarebestandwhichareworst.buttheyallhaveclearadvantagesovertheless-specificKStest. 90GOODNESSOfFITTable6.6NullDistributionofAnderson-DarlingTestStatistic:ModificationsofA'andUpperTailPercentagePointsUpperTailProbabilityQModificationA*.A'*0.100.050.0250.01(a)Case0:FullyspecifiedN(p.uE)1.9332.4923.0703.857(b)Case1:N(p,uE).onlyu2known0.8941.0871.2851.551Case2:u2estimatedbys2,pknown1.7432.3082.8983.702Case3:pandc2estimated.A*0.6310.7520.8731.035(c)Case4:Ixp(8).A**1.0621.3211.5911.9596.3.1Anderson-DarlingTestAndersonandDarling(1954)lookedtoimproveupontheKolmogorov-Smirnovstatisticbymodifyingitfordistributionsofinterest.TheAnderson-Darlingtestisusedtoverifyifasampleofdatacamefromapopulationwithaspecificdistribution.ItisamodificationoftheKStestthataccountsforthedistri-butionandtestandgivesmoreattentiontothetails.Asmentionedbefore.theKStestisdistributionfree.inthesensethatthecriticalvaluesdonotdependonthespecificdistributionbeingtested.TheAnderson-Darlingtestmakesuseofthespecificdistributionincalculatingthecriticalvalues.Theadvantageisthatthissharpensthetest,butthedisadvantageisthatcriticalvaluesmustbecalculatedforeachhypothesizeddistribution.ThestatisticsfortestingHo:F(z)=Po(.)versusthetwosidedalterna-tiveisA2=-n-S.whereTabulatedvaluesandformulashavebeenpublished(Stephens.1974.1976)forthenormal,lognormal.andexponentialdistributions.Thehypothesisthatthedistributionisofaspecificformisrejectediftheteststatistic.A2(ormodifiedA*,A*)isgreaterthanthecriticalvaluegiveninTable6.6.Cases0,1,and2donotneedmodification.i.e.,observedA2isdirectlycomparedtothoseinTable.Case3and(c)compareamodifiedA2(A*orA**)tothecriticalvaluesinTable6.6.In(b).A*=A2(1++y).andin(c).A*"=A2(1+y).Example6.6Thefollowingexamplehasbeenusedextensivelyintestingfornormality.Theweightsof11men(inpounds)aregiven:148,154.158.160,161,162,166,170,182,195.and236.Thesamplemeanis172andsamplestandarddeviationis24.952.Becausemeanandvarianceareestimate.thisreferstoCase3inTable6.6.Thestandardizedobserva-tionsare2c1=(148-172)/24.952=-0.9618,....w11=2.5649.and SPECIALIZEDTESTS91z1=@(q)=0.1681,...~211=0.9948.NextwecalculateA'=0.9468andmodifyitasA*=A2(1+0.75/11+0.25/121)=1.029.Fromthetableweseethatthisissignificantatalllevelsexceptfora=0.01,e.g..thenullhy-pothesisofnormalityisrejectedatlevelcy=0.05.HereisthecorrespondingMATLABcode:>>weights=[148,154,158,160,161,162,166,170,182,195,2361;>>n=length(weights);us=(weights-rnean(weights))/std(weights);>>zs=1/2+1/2*erf(ws/sqrt(2));%transformationtouniformO.S.%calculationofanderson-darlings=O;fori=l:n>>s=s+(2*i-l)/n*(log(zs(i))+log(l-zs(n+l-i)));>>a2=-n-s;>>astar=a2*(1+0.75/n+2.25/n-21;Example6.7Weightisoneofthemostimportantqualitycharacteristicsofthepositiveplateinstoragebatteries.Eachpositiveplateconsistsofametalframeinsertedinanacid-resistantbag(called'oxideholder')andtheemptyspaceinthebagisfilledwithactivematerial,suchaspowderedleadoxide.About75%oftheweightofapositiveplateconsistsofthefilledoxide.Itisalsoknownfrompastexperiencethatvariationsinframeandbagweightsarenegligible.Thedistributionoftheweightoffilledplateweightsis,therefore,anindicationofhowgoodthefillingprocesshasbeen.Iftheprocessisperfectlycontrolled.thedistributionshouldbenormal,centeredaroundthetarget:whereasdeparturefromnormalitywouldindicatelackofcontroloverthefillingoperation.Weightsof97filledplates(chosenatrandomfromthelotproducedinashift)aremeasuredingrams.ThedataaretestedfornormalityusingtheAnderson-Darlingtest.ThedataandtheMATLABprogramwrittenforthispartarelistedinAppendixA.TheresultsintheMATLABprogramlistA'=0.8344andA*=0.8410.6.3.2Cram&-VonMisesTestTheCram&-VonLIisestestmeasurestheweighteddistancebetweentheem-piricalCDFF,andpostulatedCDFFo.Basedonasquared-errorfunction,theteststatisticisJ-CXThereareseveralpopularchoicesforthe(weight)functionalq.When$(z)=1,thisisthe*'standard"Cram&-VonMisesstatistic.i)i(l)=u;.inwhichcase 92GOODNESSOfFITFig.6.4HaraldCram&(1893-1985):RichardvonVises(1883-1953).theteststatisticbecomesWhenW(T)=s-'(l-x)-'%wi(l/(FO(l-Fo)))=A2/n.andA'istheAnderson-Darlingstatistic.UnderthehypothesisHO:F=Fo.theasymp-toticdistributionofwi($(F))is(4j+(4j+(4j+q216~}'[J-1/4(162)-J1/4(16z)]'whereJk(z)isthemodifiedBesselfunction(inLIATLAB:bessel(k,z)).InLIATLAB.theparticularCram&-VonLIisestestfornormalztycanbeappliedtoasamplezwiththefunctionmtest(x.a).wheretheweightfunctionisoneandcymustbelessthan0.10.TheAIATLABcodebelowshowshowitworks.Alongwiththesimple"rejectornot''output.them-filealsoproducesagraph(Figure6.5)ofthesampleEDFalongwiththenl'(0.1)CDF.Note:thedataareassumedtobestandardzzed.Theoutputof1implieswedonotrejectthenullhypothesis(Ho:N(O.1))attheenteredalevel. SPECIALIZEDTESTS93:2-15-1-05005115(a)(b)fig.6.5PlotsofEDFversusd(O.1)CDFforn=25observationsofd(O.1)dataandstandardizedBin(100.0.5)data.>>x=rand_nor(O,1,25,1)>>mtest(x',0.05)ans=1>>y=rand-bin(100,0.5,25)>>y2=(y-mean(y))/std(y)>>mtest(y2,0.05)ans=16.3.3Shapiro-WilkTestforNormalityTheShapiro-Wilk(Shapiroandfrill<.1965)testcalculatesastatisticthattestswhetherarandomsample.XI.X2.....X,comesfromanormaldistri-bution.Becauseitiscustommadeforthenormal.thistesthasdonewellincomparisonstudieswithothergoodnessoffittests(andfaroutperformstheKolmogorov-Smirnovtest)ifnormallydistributeddataareinvolved.Theteststatistic(W)iscalculatedaswheretheXI>[XI=randgauss(0,1,1000);>>[y]=randgauss(0.I,1,100);>>~zl=[x,yl; 98GOODNESSOFF/T>>[ggl=mtest(z,.001)>>probplot(z)09rIoat-2r,-318'I41'"-25-2-15-1-05005115225-4-3-2-101234Fig.6.7(a)PlotofEDFvs.normalCDF,(b)normalprobabilityplot.Example6.10Thirtyobservationsweregeneratedfromanormaldistribu-tion.TheMATLABfunctionqqweibconstructsaprobabilityplotforWeibulldata.TheWeibullprobabilityplotinFigure6.8showsaslightcurvaturewhichsuggeststhemodelismisfit.TolinearizetheWeibullCDF,iftheCDFisexpressedasF(s)=1-exp(-(z/y)O),then1ln(z,)=-In(-ln(1-p))+ln(y).0TheplotofIn(%,)versusIn(-ln(1-p))isastraightlinedeterminedbythetwoparametersp-'andln(-y).TheMATLABprocedureqqweibalsoreportsthethescaleparameterscaleandtheshapeparametershape.estimatedbythemethodofleast-squares. PROBABILITYPLOTTING99-4c5.052.12.152.22.252.32.352.42.45log(data)fig.6.8Weibullprobabilityplotof30observationsgeneratedfromanormaldistri-bution.>>[xl=randgauss(10,1,30);>>[shape,scale]=qqweib(x)shape=13.2094scale=9.9904>>Example6.11Quantile-QuantilePlots.Fortestingtheequalityoftwodistributions.thegraphicalanalogtotheSmirnovtestistheQuantile-QuantilePlot,orq-qplot.TheMATLABfunctionqqplot(2.y,*)plotstheempiricalquantilesofthevectorJ:versusthatofy.Thethirdargumentisoptionalandrepresentstheplottingsymboltouseintheq-qplot.Iftheplottedpointsveerawayfromthe45"referenceline.evidencesuggeststhedataaregeneratedbypopulationswithdifferentdistributions.Althoughtheq-qplotleadstosubjectivejudgment,severalaspectsofthedistributionscanbecomparedgraphically.Forexample.ifthetwodistributionsdifferonlybyalocationshift(F(z)=G(x+6)),theplotofpointswillbeparalleltothereferenceline.Manypractitionersusetheq-qplotasaprobabilityplotbyreplacingthesecondsamplewiththequantilesofthehypothesizeddistribution.Three 100GOODNESSOFFITotherMATLABfunctionsforprobabilityplottingarelistedbelow.buttheyusetheq-qplotmoniker.Theargumentsymbolisoptionalinallthree.qqnorm(x,symbol)Normalprobabilityplotqqweib(x,symbol)Weibullprobabilityplotqqgamma(x,symbol)GammaprobabilityplotInFigure6.9,theq-qplotsaredisplayedfortherandomgenerateddataintheMATLABcodebelow.ThestandardqqplothlATLABoutputs(scat-terplotanddottedlinefit)areenhancedbydashedliney=zrepresentingidentityoftwodistributions.Ineachcase,adistributionisplottedagainstN(100,102)data.Thefirstcase(a)representsn/(120,102)andthepointsappearparalleltothereferencelinebecausetheonlydifferencebetweenthetwodistributionsisashiftinthemean.In(b)theseconddistributionisdis-tributedN(100.402).Theonlydifferenceisinvariance.andthisisreflectedintheslopechangeintheplot.Inthecases(c)and(d),thediscrepancyisduetothelackofdistributionfit;thedatain(c)aregeneratedfromthet-distributionwith1degreeoffreedom,sothetailbehaviorismuchdifferentthanthatofthenormaldistribution.Thisisevidentintheleftandrightendoftheq-qplot.In(d),thedataaredistributedgamma,andtheillustrateddifferencebetweenthetwosamplesismoreclear.>>x=rand-nor(100,10,30,1);>>yl=rand-nor(l20,10,30,1);qqplot(x,yl)>>y2=rand_nor(100,40,30,1);qqplot(x,y2)>>y3=100+10*rand-t(1,30,1);qqplot(x,y3)>>y4=rand_gamma(200,2,30,1);qqplot(x,y4)6.5RUNSTESTAchiefconcernintheapplicationofstatisticsistofindandunderstandpat-ternsindataapartfromtherandomness(noise)thatobscuresthem.Whilehumansaregoodatdecipheringandinterpretingpatterns,wearemuchlessabletodetectrandomness.Forexample.ifyouaskanylargegroupofpeo-pletorandomlychooseanintegerfromonetoten,thenumberssevenandfourarechosennearlyhalfthetime.whiletheendpoints(one.ten)arerarelychosen.Someonetryingtothinkofarandomnumberinthatrangeimaginessomethingtowardthemiddle,butnotexactlyinthemiddle.Anythingelsejustdoesn‘tlook“random”tous.Inthissectionweusestatisticstolookforrandomnessinasimplestringofdichotomousdata.Inmanyexamples.therunstestwillnotbethemostefficientstatisticaltoolavailable.buttherunstestisintuitiveandeasier RUNSTEST101250200011301~II15080i1201llojloolo0.,*'0,'90,,'1009095100105110115120125$090100110120,Io(a)60011300,,'I1255001i120-,*'1400-~115-0.-lo8090100110120130'$0901001iO120130(c)(4Fig.69Datafrom,tr(lOO.10')areplottedagainstdatafrom(a)N(120.lo2).(b)N(lO0.402).(c)tland(d)~amma(200.2).ThestandardqqplotSIATLABoutputs(scatterplotanddottedlinefit)areenhancedbydashedliney=5representingidentityoftwodistributions. 102GOODNESSOFFITtointerpretthanmorecomputationaltests.SupposeitemsfromthesampleXI.X2,...,X,couldbeclassifiedastype1ortype2.Ifthesampleisrandom,the1'sand2'sarewellmixed,andanyclusteringorpatternin1'sand2'sisviolatingthehypothesisofrandomness.Todecidewhetherornotthepatternisrandom,weconsiderthestatisticR.definedasthenumberofhomogenousrunsinasequenceofonesandtwos.InotherwordsRrepresentsthenumberoftimesthesymbolschangeinthesequence(includingthefirstone).Forexample,R=5inthissequenceofn=11:12221122111.Obviouslyiftherewereonlytworunsinthatsequence,wecouldseethepatternwherethesymbolsareseparatedrightandleft.OntheotherhandifR=11.thesymbolsareinterminglinginanon-randomway.IfRistoolarge,thesequenceisshowinganti-correlation,arepulsionofsamesymbols.andzig-zagbehavior.IfRistoosmall,thesampleissuggestingtrends,clusteringandgroupingsintheorderofthedichotomoussymbols.Ifthenullhypothesisclaimsthatthepatternofrandomnessexists,thenifRiseithertoobigortoosmall,thealternativehypothesisofanexistingtrendissupported.Assumethatadichotomoussequencehasn1onesandn2twos.nl+n2=n.IfRisthenumberofsubsequentruns,thenifthehypothesisofrandomnessistrue(sequencezsmadebyrandomselectzonof1'sand2'sfromthesetcontaznzngnl1'sandn22's).thenforr=2.3,....n.Hereisahintforsolvingthis:firstnotethatthenumberofwaystoputnobjectsintorgroupswzthnocellbezngemptyis(:It).Thenullhypothesisisthatthesequenceisrandom.andalternativescouldbeone-sidedandtwosided.Also,underthehypothesesofrandomnessthesymbols1and2areinterchangeableandwithoutlossofgeneralityweassumethatn15122.ThefirstthreecentralmomentsforR(underthehypothesisofrandomness)are. RUNSTEST103andwhenevern1>15andn2>15thenormaldistributioncanbeusedtotoapproximatelowerandupperquantiles.Asymptotically,whenn1-+3cjandE5n1/(n1+722)I1-E(forsome0>cruz=[I112211112112111122221;>>[problow,probup,nruns,expectedruns]=runs-test(cruz)runones=4runtwos=4trun=8nl=13n2=8n=21problow=0.1278probup=0.0420nruns=8expectedruns=10.9048IfobservednumberofrunsisLESSthanexpected,problowisP(R=2)+...+P(R=TL~U~S)andprobupisP(R=n-nruns+2)+...+P(R=n).Alternatively,ifnrunsisLARGERthanexpected.thenproblowisP(R=2)+...+P(R=n-nruns+2)andprobupisP(R=nr~ns)+...+P(R=n).Inthiscase.thenumberofruns(8)waslessthanexpected(10.9048),andtheprobabilityofseeing8orfewerrunsinarandomscatteringis0.1278.Butthis RUNSTEST105Fig.6.10ProbabilitydistributionofrunsunderHo.isatwo-sidedtest.ThisLIATLABtestimpliesweshoulduseP(R2n-n2+2)=P(R215)=0.0420asthe“othertail”toincludeinthecriticalregion(whichwouldmakethep-valueequalto0.1698).ButusingP(R215)isslightlymisleading,becausethereisnosymmetryinthenulldistributionofR;instead.wesuggestusing2*problow=0.2556asthecriticalvalueforatwo-sidedtest.Example6.13Thefollowingare30timelapses.measuredinminutes.be-tweeneruptionsofOldFaithfulgeyserinYellowstoneNationalPark.IntheLIATLABcodebelow.forrunsstores2ifthetemperatureisbelowaver-age,otherwisestores1.Theexpectednumberofruns(15.9333)islargerthanwhatwasobserved(13).andthep-valueforthetwo-sidedrunstestis2*0.1678=0.3396.>>oldfaithful=[686366636144606271626255626773...7255676865606171606867726965661;>>mean(oldfaithfu1)ans=64.1667>>forruns=(oldfaithful-64.1667>0)+1forruns=212111112111122212221121222222>>[problow,probup,nruns,expectedrunsl=runs-test(forruns) 106GOODNESSOFFITrunones=6runtwos=7trun=13nl=14n2=16n=30problow=0.1804probup=0.1678nruns=13expectedruns=15.9333Beforewefinishwiththerunstest,wearecompelledtomakenoteofitslimitations.AfteritsinceptionbyMood(1940).therunstestwasusedasacure-allnonparametricprocedureforavarietyofproblems,includingtwo-samplecomparisons.However,itisinferiortomoremoderntestswewilldiscussinChapter7.Morerecently,Mogul1(1994)showedananomalyoftheone-samplerunstest:itisunabletorejectthenullhypothesisforseriesofdatawithrunlengthoftwo.6.6METAANALYSIShletaanalysisisconcernedwithcombiningtheinferencefromseveralstudiesperformedundersimilarconditionsandexperimentaldesign.Fromeachstudyan“effectsize”isderivedbeforetheeffectsarecombinedandtheirvariabilityassessed.However,foroptimalmetaanalysis,theanalystneedssubstantialinformationabouttheexperimentsuchassamplesizes.valuesofthetestst,atistics,thesamplingschemeandthetestdesign.Suchinformationisoftennotprovidedinthepublishedwork.Inmanycases,onlythep-valuesofparticularstudiesareavailabletobecombined.hletaanalysisbasedonp-valuesonlyisoftencallednonparametricorom-nibusmetaanalysisbecausethecombinedinferencedosenotdependontheformofdata,teststatistics,ordistributionsoftheteststatistics.Therearemanysituationsinwhichsuchcombinationoft,estsisneeded.Forexample.onemightbeinterestedin(i)multiplettestsintestingequalityoftwotreatmentsversusonesidedalternative.Suchtestsoftenariseinfunctiontestingandestimation:fMRI,DNAcomparison;etc:(ii)multipleFtestsforequalityofseveraltreatmentmeans.Thetestmaynotinvolvethesametreatmentsandparametricmetaanalysismaynotbeappropriate;or(iii)multiplex2testsfortestingtheindependenceincontingencytables(seeChapter9).Thetablecountsmaynotbegivenorthetablescouldbeofdifferentsize(thesamefactorofinterestcouldbegivenatdifferentlevels). METAANALYSIS107Mostofthemethodsforcombiningthetestsonbasisoftheirp-valuesusethefactsthat.(1)underHoandassunlingtheteststatisticshaveacontinuousdistribution,thep-valuesareuniformand(2)ifGisamonotoneCDFandUNU(O.l).thenG-l(U)hasdistributionG.AniceoverviewcanbefoundinFolks(1984)andthemonographbyHedgesandOlkin(1985).Tippet-WilkinsonMethod.Ifthep-valuesfromnstudies,~1.~2.....p,areorderedinincreasingorder,pln,p2n,....p,n,then.foragivenk.15k5n.thek-thsmallestp-value,pk,.isdistributedBe(k.n-k+1)andp=P(Xipkn).X-Be(k,n-k+l)BetarandomvariablesarerelatedtotheFdistributionviaforVNBe(&.3)andTV-F(23.20).Thus,thecombinedsignificancelevelpiswhereXNF(2(n-k+1).2k).Thissingleprepresentsameasureoftheuniformityofpl.....pnandcanbethoughtasacombinedp-valueofallntests.Thenonparametricnatureofthisprocedureisunmistakable.ThismethodwasproposedbyTippet(1931)withk=1andk=n,andlatergeneralizedbyWilkinson(1951)forarbitrarykbetween1andn.Fork=1,thetestoflevelQrejectsHoifpl51-(1-a)'',.Fisher'sInversex2Method.hlaybethemostpopularmethodofcombin-ingthep-valuesisFisher'sinversex2method(Fisher.1932).UnderHo.therandomvariable-2logp,hasx2distributionwith2degreesoffreedom,sothatC,xi,isdistributedasx2withC2k,degreesoffreedom.Thecombinedp-valueisThistestis.infact.basedontheproductofallp-valuesduetothefactthat-2Clogpt=-2lOgIII-'%.12 108GOODNESSOFFITAveragingpValuesbyInverseNormals.Thefollowingmethodforcombiningp-valuesisbasedonthefactthatifZ1,Z2.....Z,arei.i.d.N(0,l).then(2,+22+...+Z,)/fiisdistributedN(0.l),aswell.Let@-'denotetheinversefunctiontothestandardnormalCDF@,andlet~1.~2.....p,bethep-valuestobeaveraged.Thentheaveragedp-valueiswhereZNN(0,l).Thisprocedurecanbeextendedbyusingweightedsums:Thereareseveralmoreapproachesincombiningthep-values.Good(1955)suggesteduseofweightedproduct-2clogp,=-2lognp;z>22butthedistributionaltheorybehindthisstatisticiscomplex.MudholkarandGeorge(1979)suggesttransformingthep-valuesintologits,thatis,logit(p)=log(p/(l-p)).Thecombinedp-valueisAsanalternative,Lancaster(1961)proposesamethodbasedoninversegammadistributions.Example6.14ThisexampleisadaptedfromapresentationbyJessicaUttsfromUniversityofCalifornia,Davis.Twoscientists.ProfessorsAandB.eachhaveatheorytheywouldliketodemonstrate.EachplanstorunafixednumberofBernoullitrialsandthentestHo:p=0.25versesHI:p>0.25.ProfessorAhasaccesstolargenumbersofstudentseachsemestertouseassubjects.Herunsthefirstexperimentwith100subjects.andthereare33successes(p=0.04).Knowingtheimportanceofreplication.ProfessorAthenrunsanadditionalexperimentwith100subjects.Hefinds36successes(p=0.009).ProfessorBonlyteachessmallclasses.Eachquarter,sherunsanexperi-mentonherstudentstotesthertheory.Resultsofhertenstudiesaregiveninthetablebelow.AtfirstglanceprofessorA'stheoryhasmuchstrongersupport.Afterall,thep-valuesare0.04and0.009.Noneofthetenexperimentsofprofessor EXERCISES109Bwasfoundsignificant.However,iftheresultsoftheexperimentforeachprofessorareaggregated,ProfessorBactuallydemonstratedahigherlevelofsuccessthanProfessorA.with71outof200asopposedto69outof200successfultrials.Thep-valuesforthecombinedtrialsare0.0017forProfessorAand0.0006forProfessorB.1nI#ofsuccessesIp-valueI1040.221560.151760.232580.1730100.2040130.181870.141050.081550.312070.21~Nowsupposethatreportsofthestudieshavebeenincompleteandonlyp-valuesaresupplied.Nonparametricmetaanalysisperformedon10studiesofProfessorBrevealsanoverallomnibustestsignificant.TheMATLABcodeforFisher'sandinverse-normalmethodsarebelow;thecombinedp-valuesforProfessorBare0.0235and0.021.>>pvals=[0.22,0.15,0.23,0.17,0.20,0.18,0.14,0.08,0.31,0.211;>>fisherstat=-2*sum(log(pva1s))fisherstat=34.4016>>I-chi2cdf(fisherstat,2*10)ans=0.0235>>1-normcdf(sum(norminv(1-pvals))/sqrt(length(pvals)))ans=0.00216.7EXERCISES6.1.DerivetheexactdistributionoftheKolmogorovteststatisticD,forthecasen=1.6.2.GotheKISTlinkbelowtodownload31measurementsofpolishedwin-dowstrengthdataforaglassairplanewindow.Inreliabilitytestssuchasthisone.researchersrelyonparametricdistributionstocharacter-izetheobservedlifetimes.butthenormaldistributionisnotcommonly 110GOODNESSOFFITused.Doesthisdatafollowanywell-knowndistribution?Useprobabil-ityplottingtomakeyourpoint.http://www.itl.nist.gov/div898/handbook/eda/section4/eda4291.htm6.3.GototheNISTlinkbelowtodownload100measurementsofthespeedoflightinair.ThisclassicexperimentwascarriedoutbyaU.S.NavalAcademyteacherAlbertMichelsonis1879.Dothedataappeartobenormallydistributed?Usethreetests(Kolmogorov.Anderson-Darling,Shapiro-Wilk)andcompareanswers.http://www.itl.nist.gov/div898/strd/univ/data/Michelso.dat6.4.Dothoselittlepeanutbagshandedoutduringairlineflightsactuallycontainasmanypeanutsastheyclaim?Fromaboxofpeanutbagsthathave14glabelweights,fifteenbagsaresampledandweighed:16.4.14.4,15.5,14.7.15.6,15.2,15.2,15.2,15.3.15.4,14.6,15.6,14.7.15.9,13.9.Arethedataapproximatelynormalsothatat-testhasvalidity?6.5.GenerateasampleSoofsizem=47fromthepopulationwithnormalN(3.1)distribution.TestthehypothesisthatthesampleisstandardnormalHO:F=FO=N(0,l)(notat1-1=3)versusthealternativeHI:FG(x). EXERCISES1116.9.LetXI.X2%....X,,beasamplefromapopulationwithdistributionFxandY1,Y2,....YnzbeasamplefromdistributionFy.IfweareinterestedintestingHO:FX=Fyonepossibilityistousetherunstestinthefollowingway.CombinethetwosamplesandletZ1.Z2,....Znl+nzdenotetherespectiveorderstatistics.Letdichotomousvariables1and2signifyifZisfromthefirstorthesecondsample.Generate50U(O.1)numbersand50N(0.1)numbers.Concatenateandsortthem.Keeptrackofeachnumber'ssourcebyassigning1ifthenumbercamefromtheuniformdistributionand2otherwise.Testthehypothesisthatthedistributionsarethesame.6.10.Combinethep-valuesforProfessorBfromthemeta-analysisexampleusingtheTippet-Wilkinsonmethodwiththesmallestp-valueandLan-caster'sLlethod.6.11.Derivetheexactdistributionofthenumberofrunsforn=4whentherearenl=n2=2observationsofonesandtwos.Baseyourderivationontheexhaustingall(i)possibleoutcomes.6.12.ThelinkbelowconnectsyoutotheDow-JonesIndustrialAverage(DJIA)closingvaluesfrom1900to1993.Firstcolumncontainsthedate(yym-mdd).secondcolumncontainsthevalue.Usetherunstesttoseeifthereisanon-randompatternintheincreasesanddecreasesinthesequenceofclosingvalues.Consulthttp://lib.stat.cmu.edu/datasets/djdcOO936.13.RecallExercise5.1.Repeatthesimulationandmakeacomparisonbetweenthetwopopulationsusingqqplot.BecausethesamplerangehasabetaBe(49.2).distribution.thisshouldbeverifiedwithastraightlineintheplot.6.14.ThetablebelowdisplaystheaccuracyofmeteorologicalforecastsforthecityofMarietta.Georgia.ResultsaresuppliedforthemonthofFebruary.2005.Iftheforecastdifferedfortherealtemperatureformorethan3°F.thesymbol1wasassigned.Iftheforecastwasinerrorlimits<3°F.thesymbol2wasassigned.Isitpossibletoclaimthatcorrectandwrongforecastsgroupatrandom?22222222222111112211222221226.15.PreviousrecordshaveindicatedthatthetotalpointsofOlympicdivesarenormallydistributed.HerearetherecordsforMen10-meterPlat-formPrelzmznaryin2004.Testthenormalityofthepointdistribution.Foracomputationalexercise,generate1000setsof33normalobser-vationswiththesamemeanandvarianceasthedivingpointdata. 112GOODNESSOfNTUsetheSmirnovtesttoseehowoftenthep-valuecorrespondingtothetestofequaldistributionsexceeds0.05.Commentonyourresults.RankNameCountryPointsLag1HELM,MathewAUS513.062DESPATIE,AlexandreCAN500.5512.513TIAN,LiangCHN481.4731.594WATERFIELD.PeterGBR474.0339.035PACHECO,RommelMEX463.4749.596HU,JiaCHN463.4449.627NEWBERY,RobertAUS461.9151.158DOBROSKOK,DmitryRUS445.6867.389MEYER.HeikoGER440.8572.2110URAN-SALAZAR,JuanG.COL439.7773.2911TAYLOR,LeonGBR433.3879.6812KALEC.ChristopherCAN429.7283.3413GALPERIN,GlebRUS427.6885.3814DELL’UOMO,FrancescoITA426.1286.9415ZAKHAROV,AntonUKR420.392.7616CHOE.HyongGilPRK419.5893.4817PAK.YongRyongPRK414.3398.7318ADAM,TonyGER411.3101.7619BRYAN,NicksonMAS407.13105.9320MAZZUCCHI,MassimilianoITA405.18107.8821VOLODKOV.RomanUKR403.59109.4722GAVRIILIDIS,IoannisGRE395.34117.7223GARCIA.CaesarUSA388.77124.2924DURAN.CassiusBRA387.75125.3125GUERRA-OLIVA,JoseAntonioCUB375.87137.1926TRAKAS,SotiriosGRE361.56151.527VARLAMOV.AliaksandrBLR361.41151.6528FORNARIS.ALVAREZErickCUB351.75161.3129PRANDI.KyleUSA346.53166.5330hIAMONTOV.AndreiBLR338.55174.5131DELALOYE.JeanRomainSUI326.82186.2432PARISI,HugoBRA325.08187.9833HAJNAL,AndrasHUN305.79207.276.16.ConsidertheCram&vonMisesteststatisticwith$(x)=1.Withasampleofn=1,derivetheteststatisticdistributionandshowthatitismaximizedatX=112.6.17.GeneratetwosamplesS1andS2.ofsizesm=30andm=40fromtheuniformdistribution.Squaretheobservationsinthesecondsam-ple.Llrhatisthetheoreticaldistributionofthesquareduniforms?Next.“forget”thatyousquaredthesecondsampleandtestequalityofthedis-tributions.Repeatthistestingprocedure(withnewsamples,ofcourse)1000times.PVhatproportionofp-valuesexceeded5%? REFERENCES1136.18.RecalltheGumbeldistribution(orextremevaluedzstrzbution)fromChapter5.LinearizetheCDFoftheGumbeldistributiontoshowhowaprobabilityplotcouldbeconstructed.REFERENCESAnderson,T.W.,andDarling,D.A.(1954);"ATestofGoodnessofFit.''JournaloftheAmericanStatisticalAssociation,49.765-769.Birnbaum,Z.W..andTingey,F.(1951),"One-sidedConfidenceContoursforProbabilityDistributionFunctions,"AnnalsofMathematicalStatistics,22,592-596.D'Agostino,R.B..andSt'ephens,hl.A.(1986),Goodness-of-FitTechniques,KewYork:MarcelDekker.Feller,W.(1948),OntheKolmogorov-SmirnovTheorems,AnnalsofMathe-maticalStatistics,19,177-189.Fisher,R.A.(1932):StatisticalMethodsforResearchWorkers,4thed,Edin-burgh,UK:OliverandBoyd.Folks.J.L.(1984):TombinationofIndependentTests."inHandbookofStatistics4,NonparametricMethods,Eds.P.R.KrishnaiahandP.K.Sen,Amsterdam,iYort,h-Holland:ElsevierScience,pp.113-121.Good.I.J.(1955);"Ont,heWeightedCombinationofSignificanceTests,''JournaloftheRoyalStatisticalSociety(B),17,264265.Hedges,L.V..andOlkin.I.(1985)!StatisticalMethodsforMeta-Analysis:NewYork:AcademicPress.Kolmogorov,A.N.(1933):"SullaDeterminazioneEmpiricadiUnaLeggediDistribuzione."GiornioInstitutoItaliaAttuari,4,83-91.Lancaster,H.0.(1961),-TheCombinationofProbabilities:AnApplicationofOrthonormalFunctions.''AustralianJournalofStatistics,3,20-33.Miller,L.H.(1956).-TableofpercentagepointsofKolmogorovStatistics,"JournaloftheAmericanStatisticalAssociation,51,111-121.Llogull,R.G.(1994).-Theone-samplerunstest:Acategoryofexception,''JournalofEducationalandBehavioralStatistics19,296-303.Mood.A.(1940)."Thedistributiontheoryofruns."AnnalsofMathematicalStatistics,11,367-392.hludholkar,G.S.,andGeorge,E.0.(1979):"TheLogitMethodforCombin-ingProbabilities,"inSymposiumonOptimizingMethodsinStatistics,ed.J.Rustagi,NewYork:AcademicPress,pp.343-366.Pearson,K.(1902)."OntheSystematicFittingofCurvest,oObservationsandAleasurements."Biometrika.1265-303. 114GOODNESSOFFITRoeder,K.(1990),"DensityEstimationwithConfidenceSetsExemplifiedbySuperclustersandVoidsintheGalaxies,''JournaloftheAmericanStatisticalAssociation,85,617-624.Shapiro,S.S.,andWilk,hl.B.(1965),"AnAnalysisofVarianceTestforNormality(CompleteSamples),"Biometrika.52,591-611.Smirnov,N.V.(1939a),"OntheDerivationsoftheEmpiricalDistributionCurve,"MatematicheskiiSbornik,6,2-26.(1939b),"OntheEstimationoftheDiscrepancyBetweenEmpiricalCurvesofDistributionforTwoIndependentSamples,"BulletinMoscowUniversity,2:3-16.Stephens.hl.A.(1974),"EDFStatisticsforGoodnessofFitandSomeCom-parisons,"JournaloftheAmericanStatisticalAssociation.69,730-737.(1976)."AsymptoticResultsforGoodness-of-FitStatisticswithUn-knownParameters,"AnnalsofStatistics,4,357-369.Tippett,L.H.C.(1931),TheMethodofStatistics,1sted..London:WilliamsandNorgate.Wilkinson,B.(1951),"AStatisticalConsiderationinPsychologicalResearch,''PsychologicalBulletin,48,156-158. 7RankTestsEachofushasbeendoingstatisticsallhislife.inthesensethateachofushasbeenbusilyreachingconclusionsbasedonempiricalobservationseversincebirth.WilliamKruskalAllthoseoldbasicstatisticalprocedures~thef-test.thecorrelationcoeffi-cient,theanalysisofvariance(ANOVA)~dependstronglyontheassumptionthatthesampleddata(orthesufficientstatistics)aredistributedaccordingtoawell-knowndistribution.Hardlythefodderforanonparametricstextbook.Butforeveryclassicaltest,thereisanonparametricalternativethatdoesthesamejobwithfewerassumptionsmadeofthedata.Eveniftheassumptionsfromaparametricmodelaremodestandrelativelynon-constraining.theywillundoubtedlybefalseinthemostpuresense.Life.alongwithyourex-perimentaldata.aretoocomplicatedtofitperfectlyintoaframeworkofi.i.d.errorsandexactnormaldistributions.Xlathematicianshavebeenresearchingranksandorderstatisticssinceagesago.butitwasn’tuntilthe1940sthattheideaofranktestsgainedprominenceinthestatisticsliterature.HotellingandPabst(1936)wroteoneofthefirstpapersonthesubject.focusingonrankcorrelations.Therearenonparametricproceduresforonesample.forcomparingtwoormoresamples.matchedsamples.bivariatecorrelation.andmore.Thekeytoevaluatingdatainanonparametricframeworkistocompareobser-vationsbasedontheirrankswithinthesampleratherthanentrustingthe115 116RANKTESTSFig.7.1FrankVileoxon(1892-1965).HenryBertholdSlann(1905-2000).andPro-fessorEmeritusDonaldRansomWhitneyactualdatameasurementstoyouranalyticalverdicts.Thefollowingtableshowsnon-parametriccounterpartstothewellknownparametricprocedures(WSiRT/WSuRTstandsforWilcoxonSigned/SumRankTest).IPARAMETRICINON-PARALlETRICIPearsoncoefficientofcorrelationSpearmancoefficientofcorrelationOnesamplet-testforthelocationsigntest,WSiRTpairedtestttestsigntest,WSiRTtwosamplettestWSurT,hlann-WhitneyANOVAKruskal-WallisTestBlockDesignANOVAFriedmanTestTobefair.itshouldbesaidthatmanyofthesenonparametricprocedurescomewiththeirownsetofassumptions.Wewillsee.infact.thatsomeofthemareratherobtrusiveonanexperimentaldesign.Othersaremuchlessso.Keepthisinmindwhenanonparametrictestistoutedas"assumptionfree".Nothinginlifeisfree.Inadditiontopropertiesofranksandbasicsigntest,inthischapterwewillpresentthefollowingnonparametricprocedures:0SpearmanCoefficient:Two-samplecorrelationstatistic0WilcoxonTest:One-samplemediantest(alsoseeSignTest).0WilcoxonSumRankTest:Two-sampletestofdistributions.0Mann-WhitneyTest:Two-sampletestofmedians. PROPERTIESOFRANKS1177.1PROPERTIESOFRANKSLetXI.X2....~X,beasamplefromapopulationwithcontinuousCDFFx.Thenonparametricproceduresarebasedonhowobservationswithinthesam-pleareranked.whetherintermsofaparameterporanothersample.TheranksconnectedwiththesampleXI.X2....,X,denotedas.(XI),r(X2).....r(X,).aredefinedasEquivalently.rankscanbedefinedviatheorderstatisticsofthesample,r(X,,,)=i.ordSinceXI;...X,isarandomsample,itistruethatXI,....X,=X,,:...XTndwhere7r1....T,isapermutationof1.2:...:nand=denotesequalityindis-tribution.Consequently.P(r(X,)=j)=l/n,15j5n.i.e.;ranksinani.i.d.samplearedistributedasdiscreteuniformrandomvariables.Cor-respondingtothedata~i,letRi=r(X,),therankoftherandomvariableXi.FromChapt,er2)t,hepropertiesofintegersumsleadtothefollowingprop-ertiesforranks:where1IE(X,R,)=E(IE(R,X,)IR,=k)=E(E(kXk.,))=-CiE(X,.,,).n2=1 118RANKTESTSInthecaseofties.itiscustomarytoaveragethetiedrankvalues.TheLIATLABprocedurerankdoesjustthat:>>ranks([31415926535891)ans=Columns1through74.50001.50006.00001.50008.000012.50003.0000Columns8through1310.00008.00004.50008.000011.000012.5000Property(iv)canbeusedtofindthecorrelationbetweenobservationsandtheirranks.Suchcorrelationdependsonthesamplesizeandtheunderlyingdistribution.Forexample,forXNU(0.l),IE(X,R,)=(an+1)/6.whichgives@ov(X,,R,)=(n-l)/l2and@orr(X,.R,)=J(n-l)/(n+1).Withtwosamples.comparisonsbetweenpopulationscanbemadeinanonparametricwaybycomparingranksforthecombinedorderedsamples.Rankstatisticsthataremadeupofsumsofindicatorvariablescomparingitemsfromonesamplewiththoseoftheotherarecalledhearrankstatzstzcs.7.2SIGNTESTSupposeweareinterestedintestingthehypothesisHOthatapopulationwithcontinuousCDFhasamedianmoagainstoneofthealternativesHI:m>mo3HI:mmo(i.e..whenthedifferenceX,-moispositive).andthesign-otherwise.Forcontinuousdistributions,thecaseX,=m(atie)istheoreticallyimpossible,althoughinpracticetiesareoftenpossible,andthisfeaturecanbeaccommodated.Fornow.weassumetheidealsituationinwhichthetiesarenotpresent.Assumptions:Actually,noassumptionsarenecessaryforthesigntestotherthanthedataareatleastordinalIfmoisthemedian,i.e.,ifHoistrue,thenbydefinitionofthemedian,P(X,>mo)=P(X,mo).L=lthenTNBin(n,1/2).Lettheleveloftest.a.bespecified.WhenthealternativeisHI:m>mo,thecriticalvaluesofTareintegerslargerthanorequaltot,,whichisdefinedasthesmallestintegerforwhich SlGNTEST119Likewise.ifthealternativeisHI:mmo.largevaluesofTserveasevidenceagainstHoandthep-valueisWheiitestingagainstthealternativeHI:mx),whichisthetotalnumberofstrictlypositivedifferences. 120RANKTESTSFortwopopulationmeansitistruethatthehypothesisofequalityofmeansisequivalenttothehypothesisthatthemeanofthepopulationdifferencesisequaltozero.Thisisnotalwaystrueforthetestofmedians.Thatis.ifD=X-Y.thenitisquitepossiblethatmD#mx-my.Withthesigntestwearenottestingtheequalztyoftwomedians,butwhetherthemedzanofthedtfferenceis0.UnderHo:equalpopulatzonmedzans.E(T)=CP(X,>yZ)=n/2andVar(T)=n.Var(l(X>Y))=n/4.Withlargeenoughn,Tisapproximatelynormal.soforthestatisticaltestofHI:themedaansarenotequal,wewouldrejectHOifTisfarenoughawayfromn/2:thatis,Example7.1AccordingtoTheRothsteinCatalogonDisasterRecovery.themediannumberofviolentcrimesperstatedroppedfromtheyear1999to2000.Of50states,ifX,isnumberofviolentcrimesinstateiin1999andY,isthenumberfor2000.themedianofsampledifferencesisX,-Y,.Thisnumberdecreasedin38outof50statesinoneyear.WithT=38andn=50.wefindzo=3.67.whichhasap-valueof0.00012fortheone-sidedtest(mediansdecreasedovertheyear)or,00024forthetwo-sidedtest.Example7.2LetX1andX2beindependentrandomvariablesdistributedasPoissonwithparametersA1andA2.WewouldliketotestthehypothesisHO:A1=A2(=A).IfHOistrue.IfweobserveX1andX2andifXI+X2=nthentestingHOisexactlythesigntest.withT=XI.Indeed.Forinstance,ifX1=10andX2=20areobserved.thenthep-valueforthe30two-sidedalternativeHI:A1#A2is2(”)(i)=2.0.0494=0.0987.Example7.3HogmanayCelebration’RogervanGompelandShonaFal-conerattheUniversityofDundeeconductedanexperimenttoexaminetheIHogmanayistheScottishNewYear.celebratedon31stDecembereveryyear.Thenightinvolvesacelebratorydrinkortwo,fireworksandkissingcompletestrangers(notnecessarilyinthatorder). SlGNTEST121drinkingpatternsofMembersoftheScottishParliamentoverthefestivehol-idayseason.BeingelectedtotheScottishParliamentislikelytohavecreatedinmem-bersasenseofstereotypicalconformitysothattheyappeartofitinwiththetraditionalwaysofScotland.pleasingthetabloidnewspapersandensuringpopularsupport.OnestereotypeoftheScottishpeopleisthattheydrinkalotofwhisky.andthattheyenjoycelebratingbothChristmasandHogmanay.However.itispossiblethatmembersofparlimenttendtodrinkmorewhiskyatoneofthesetimescomparedtotheother.andaninvestigationintothiswascarriedout.Themeasureusedtoinvestigateanysuchbiaswasthenumberofunitsofsinglemaltscotchwhisky(“drams“)consumedovertwo48-hourperiods:ChristmasEve/ChristmasDayandHogmanay/NewYear‘sDay.Thehypoth-esisisthatMembersoftheScottishParliamentdrinkasignificantlydifferentamountofwhiskyoverChristmasthanoverHogmanay(eitherconsistentlymoreorconsistentlyless).Thefollowingdatawerecollected.1hISPi1112131415161718191IDramsatChristmas233240362DramsatHogmanay113115I6I417~5I910IIAISP1110111I12113I14115I161171181iDramsatChristmas254360330DramsatHogmanay114151618910615112TheAIATLABfunctionsign-test1listsfivesummarystatisticsfromthedataforthesigntest.Thefirstisap-valuebasedonrandomlyassigninga’+’or‘-‘totiedvalues(seenextsubsection).andthesecondisthep-valuebasedonthenormalapproximation,wheretiesarecountedashalf.nisthenumberofnon-tiedobservations.plusarethenumberofplussesiny-2.andtieisthenumberoftiedobservations.>>x=[2332403622543603301;>>y=[515647590415689065121;>>[pip2nplustie]=sign-testl(x’,y’)pl=0.0021p2=0.0030n=16 122RANKTESTSplus=ztie=L7.2.2TreatmentsofTiesTieddatapresentnumerousproblemsinderivationsofnonparametricmeth-ods,andarefrequentlyencounteredinreal-worlddata.Evenwhenobserva-tionsaregeneratedfromacontinuousdistribution.duetolimitedprecisiononmeasurementandapplication.tiesmayappear.Todealwithties.ATATLABdoesoneofthreethingsviathethirdinputinsign-testl:RRandomlyassigns*+’or‘-*totiedvaluesCUsesleastfavorableassignrnentintermsofHoIIgnorestiedvaluesinteststatisticcomputationThepreferablewaytodealwithtiesisthefirstoption(torandomize).An-otherequivalentwaytodealwithtiesistoaddaslightbitof“noise”tothedata.Thatis,completethesigntestaftermodifyingDbyaddingasmallenoughrandomvariablethatwillnotaffecttherankingofthedifferences:i.e..0,=D,+E,,whereE,-N(O.O.0001).Usingthesecondorthirdoptionsinsign-test1willleadtobiasedormisleadingresults.ingeneral.7.3SPEARMANCOEFFICIENTOFRANKCORRELATIONCharlesEdwardSpearman(Figure7.2)wasalatebloomer,academically.HereceivedhisPh.D.attheageof48.afterservingasanofficerintheBritisharmyfor15years.Heismostfamousinthefieldofpsychology.wherehetheorizedthat“generalintelligence”wasafunctionofacomprehensivementalcompetenceratherthanacollectionofmulti-facetedmentalabilities.Histheorieseventuallyledtothedevelopmentoffactoranalysis.Spearman(1904)proposedtherankcorrelationcoefficientlongbeforestatisticsbecameascientificdiscipline.Forbivariatedata.anobservationhastwocoupledcomponents(X.Y)thatmayormajnotberelatedtoeachother.Letp=@orr(X,Y)representtheunknowncorrelationbetweenthetwocomponents.Inasampleofn.letR1.....R,denotetheranksforthefirstcomponentXandSl.....S,denotetheranksforY.Forexample,if21=2,isthelargestvaluefrom21,...,2,andy1=y1isthesmallest SPEARMANCOEFFlClENTOFRANKCORRELATlON123Fig.7.2CharlesEdwardSpearman(1863-1945)andhlauriceGeorgeKendall(19071983)valuefromy1,...,yn,then(~1%s1)=(n,1).CorrespondingtoPearson's(para-metric)coefficientofcorrelation,theSpearmancoefficientofcorrelationisdefinedasThisexpressioncanbesimplified.From(7.1).R=S=(n+1)/2,andC(R,-I?)'=C(S,-S)2=nVar(R,)=n(n2-1)/12.DefineDasthedifferencebetweenranks,i.e..D,=R,-S,.WithR=9.wecanseethatandnnnn=x(R,-R)'+x(S,-S)2-2x(R,-R)(S,-3).a=lz=1z=l,=lthatis.BydividingbothsidesoftheequationwithC:=l(R,-R)2.CG1(S,-s)2= 124RANKTESTSx:=l((R,-R)’=n(n2-1)/12,weobtainConsistentwithPearson‘scoefficientofcorrelation(thestandardpara-metricmeasureofcovariance),theSpearmancoefficientofcorrelationrangesbetween-1and1.Ifthereisperfectagreement,thatis,allthedifferencesare0,thenj=1.ThescenariothatmaximizesCD:occurswhenranksareperfectlyopposite:T,=n-s,+1.Ifthesampleislargeenough,theSpearmanstatisticcanbeapproximatedusingthenormaldistribution.Itwasshownthatifn>10,Assumptions:Actually.noassumptionsarenecessaryfortestingpotherthanthedataareatleastordinal.Example7.4Stichler,Richey.andMandel(1953)listtreadwearfortires(seetablebelow).eachtiremeasuredbytwomethodsbasedon(a)weightlossand(b)groovewear.In51ATLAB.thefunctionspear(x,y)computestheSpearmancoefficient.Forthisexample,j=0.9265.Notethatifweoptfortheparametricmeasureofcorrelation.thePearsoncoefficientis0.948.WeightGrooveWeightGroove45.935.741.939.237.531.133.428.131.024.030.528.730.925.931.923.330.423.127.323.720.420.924.516.120.919.918.915.213.711.511.411.2Tiesinthedata:Thestatisticsin(7.1)and(7.2)arenotdesignedforpaireddatathatincludetiedmeasurements.Iftiesexistinthedata.asimpleadjustmentshouldbemade.Defineu’=cu(uz-1)/12andc’=Cc(v2-l)/l2wheretheu‘sandv’saretheranksforXandYadjusted(e.g.averaged)forties.Then.p‘n(n’-1)-6El”=,0%-6(u’+u’)={[n(n’-1)-12u’][n(n’-1)-12v’]}1/2 SPEARMANCOEFFlClENTOFRANKCORRELATION125anditholdsthat,forlargen,z=($-p)Jn7i-N(0,I).7.3.1Kendall’sTauKendall(1938)derivedanalternativemeasureofbivariatedependencebyfindingouthowmanypairsinthesampleare“concordant”.whichmeansthesignsbetweenXandYagreeinthepairs.Thatis,outof(i)pairssuchas(Xz,y2)and(X,.?).wecomparethesignof(X,-Y;)tothatof(X,-?).Pairsforwhichonesignisplusandtheotherisminusare“discordant”.TheKendall’srstatisticisdefinedasnnr=2s~.S,=ccsign{r,--rJ).n(n-1)a=13=z~1whererzsaredefinedviaranksofthesecondsamplecorrespondingtotheorderedranksofthefirstsample.(1.2.....n}.thatis,(r:r::::rn)InthisnotationCZ,0;fromtheSpearman‘scoefficientofcorrelationbe-comesC:=l(r,-i)2.Intermsofthenumberofconcordant(n?)anddiscordant(ng=(y)-n,)pairs.andinthecaseofties.useExample7.5TrendsinIndiana’swaterusefrom1986to1996werereportedbyArvinandSpaeth(1997)forIndianaDepartmentofNaturalResources.About95%ofthesurfacewatertakenannuallyisaccountedforbytwocat-egories:surfacewaterwithdrawalandground-waterwithdrawal.Kendall’staustatisticshowednoapparenttrendintotalsurfacewaterwithdrawalovertime(p-valueM0.59).butground-waterwithdrawalincreasedslightlyoverthe10yearspan(p-valueM0.13).>>x=(1986:1996);>>yl=[2.96,3.00,3.12,3.22,3.21,2.96,2.89,3.04,2.99,3.08,3.121;>>y2=[0.175,0.173,0.197,0.182,0.176,0.205,0.188,0.186,0.~02,...0.208,0.2131; 126RANKTESTS>>yl-rank=ranks(yl);y2_rank=ranks(y2);>>n=length(x);S1=0;S2=0;>>fori=l:n-1forj=i+l:nSl=Sl+sign(yl-rank(i)-yl-rank(j));S2=S2+sign(y2_rank(i)-y2_rank(j));endend>>ktaul=2*S1/(n*(n-1))ktaul=-0,0909>>ktau2=2*S2/(n*(n-I))ktau2=-0.6364Withlargesamplesizen,wecanusethefollowingz-statisticasanormalapproximation:Thiscanbeusedtotestthenullhypothesisofzerocorrelationbetweenthepopulations.Kendall'stauisnaturalmeasureoftherelationshipbetweenXandY.M'ecandescribeitasanodds-ratiobynotingthatwhereCistheeventthatanypairinthepopulationisconcordant.andDistheeventanypairisdiscordant.Spearman'scoefficient,ontheotherhand.cannotbeexplainedthisway.Forexample.inapopulationwithr=1/3,anytwosetsofobservationsaretwiceaslikelytobeconcordantthandiscordant.Ontheotherhand,computationsforrgrowasO(n2).comparedtotheSpearmancoefficient,thatgrowsasO(n1nn)7.4WILCOXONSIGNEDRANKTESTRecallthatthesigntestcanbeusedtotestdifferencesinmediansfortwoindependentsamples.AmajorshortcomingofthesigntestisthatonlythesignofD,=X,-mo,orD,=X,-Y,.(dependingifwehaveaone-ortwo-sampleproblem)contributestotheteststatistics.FrankWilcoxonsuggestedthat,inadditiontothesign.theabsolutevalueofthediscrepancybetween WILCOXONSlGNEDRANKTEST127thepairsshouldmatteraswell,anditcouldincreasetheefficiencyofthesigntest.Supposethat.asinthesigntest.weareinterestedintestingthehypothesisthatamedianoftheunknowndistributionismo.Wemakeanimportantassumptionofthedata.Assumption:ThedifferencesD,,z=1,....naresymmetricallydis-tributedabout0Thisimpliesthatpositiveandnegativedifferencesareequallylikely.Forthistest,theabsolutevaluesofthedifferences(IDll./&/,....ID,l)areranked.Theideaistouse(IDll.IDzl....,IDnl)asasetofweightsforcomparingthedifferenceshetween(5’1.....S,).UnderHo(themedianofdistributionismo).theexpectationofthesumofpositivedifferencesshouldbeequaltotheexpectationofthesumofthenegativedifferences.Defineni=lwhereSa=S(D,)=I(D,>0).ThusT++T-=El”=,i=n(n+l)/2andnT=Tf-T-=2Cr(lD,/)S,-n(n+1)/2.(7.3)UnderHo.(S1,....S,)arei.i.d.Bernoullirandomvariableswithp=l/2.independentofthecorrespondingmagnitudes.Thus,whenHoistrue.IE(T+)=n(n+1)/4andVar(T+)=n(n+l)(2n+1)/24.QuantilesforT+arelistedinTable7.9.InMATLAB.thesignedranktestbasedonTfiswilcoxon-signed2.Largesampletestsaretypicallybasedonanormalapproxirriativrloftheteststatistic.whichisevenmoreeffectiveiftherearetiesinthedata.Rule:FortheW’ilcoxonsigned-ranktest.itissuggesiedtouseTfrom(7.3)insteadofT+inthecaseoflarge-sampleapproximationInthiscase,IE(T)=0andVar(T)=C,(R(lDzl)2)=n(n+1)(2n+1)/6underHo.Normalquantiles 128RANKJESTS824624708292946925779010110691126859911111811142794108120121014182810211713113131822291111271411416222630121138152151620263113114816416243036321411601761728354233152171188183341483416318320119384754351751962142044536136187209228215059683719922224222566776382122362572363748439225250272canbeusedtoevaluatep-valuesoftheobservedstatisticsTwithrespecttoaparticularalternative(seethem-filewilcoxon-signed)Example7.6Twelvesetsofidenticaltwinsunderwentpsychologicalteststomeasuretheamountofaggressivenessineachperson'spersonality.Weareinterestedincomparingthetwinstoeachothertoseeifthefirstborntwintendstobemoreaggressivethantheother.Theresultsareasfollows,thehigherscoreindicatesmoreaggressiveness.firstbornX,:867177689172779170718887secondtwinY,:887776649672659065808172Thehypothesesare:Ho:thefirsttwindoesnottendtobemoreaggressivethantheother,thatis.IE(X,)5IE(Y,).andHI:thefirsttwintendstobemoreaggressivethantheother.i.e.,IE(X,)>IE(Y,).TheWilcoxonsigned-ranktestisappropriateifweassumethatD,=X,-Y,areindependent,symmetric,andhavethesamemean.Belowistheoutputofwilcoxon-signed,whereTstatisticshavebeenused.>>fb=[8671776891727791707188871;>>sb=[8877766496726590658081721;>>[tl,zl,p]=wilcoxon-signed(fb,sb,1)tl=17%valueofTzl=0.7565%valueofZ WILCOXON(TWO-SAMPLE)SUMRANKTEST129p=0.2382%p-valueofthetestThefollowingistheoutputofwilcoxon-signed2whereTIstatisticshavebeenused.Thepvaluesareidentical.andthereisinsufficientevidencetoconcludethefirsttwinismoreaggressivethanthenext.>>[t2,22,pl=wilcoxon-signed2(fb,sb,1)t2=41.5000%valueofT^+22=0.7565p=0.23827.5WILCOXON(TWO-SAMPLE)SUMRANKTESTTheM-ilcoxonSumRankTest(WSuRT)isoftenusedinplaceofatwosamplet-testwhenthepopulationsbeingcomparedarenotnormallydistributed.Itrequiresindependentrandomsamplesofsizesn1andnz.Assumption:Actually,noadditionalassumptionsareneededfortheWilcoxontwo-sampletest.AnexampleofthesortofdataforwhichthistestcouldbeusedisresponsesonaLikertscale(e.g.,1=muchworse.2=worse,3=nochange.4=better,5=muchbetter).Itwouldbeinappropriatetousethet-testforsuchdatabecauseitisonlyofanordinalnature.TheWilcoxonranksumtesttellsusmoregenerallywhetherthegroupsarehomogeneousoronegroupis"better'thantheother.Moregenerally,thebasicnullhypothesisoftheWileoxonsumranktestisthatthetwopopulationsareequal.ThatisHo:Fx(2)=Fy(2).Thistestassumesthattheshapesofthedistributionsaresimilar.LetX=XI.....X,,andY=Yl,...,Y,,betwosamplesfrompopula-tionsthatwewanttocompare.Then=n1+n2ranksareassignedastheywereinthesigntest.TheteststatisticIV,isthesumofranks(1ton)forX.Forexample.ifX1=1.X2=13.X3=7.X4=9,andY1=2.Y2=0.Y3=18.thenthevalueofM',is2+4+5+6=17.Ifthetwopopulationshavethesamedistributionthenthesumoftheranksofthefirstsampleandthoseinthesecondsampleshouldbethesamerelativetotheirsamplesizes.Ourteststatisticisn1vn=ciS,(X.Y).Z=1whereS,(X.Y)isanindicatorfunctiondefinedas1ifthezthrankedobser-vationisfromthefirstsampleandas0iftheobservationisfromthesecondsample.Iftherearenoties.thenunderHo, 130RANKTESTSThestatisticW,achievesitsminimumwhenthefirstsampleisentirelysmallerthanthesecond.anditsmaximumwhentheoppositeoccurs:a=1z=n-nI+1TheexactdistributionofW,iscomputedinatediousbutstraightforwardmanner.TheprobabilitiesforW,aresymmetricaboutthevalueofE(W,)=nl(n+1)/2.Example7.7Supposenl=2.n~=3,andofcoursen=5.Thereare(2”)=(:)=10distinguishableconfigurationsofthevector(S1,,572,...%,573).TheminimumofWjis3andthemaximumis9.Table7.10givesthevaluesforIV,inthisexample.alongwiththeconfigurationsofonesinthevector(S1,Sz....~Ss)andtheprobabilityunderHo.Noticethesymmetryinprob-abilitiesaboutE(W5).Table7.10DistributionofWswhenn1=2andn2=3.I.Vsconfigurationprobability3(1J)1/104(1.3)1/105(1.4).(2.3)2/106(1.5).(2.4)2/107(2.5).(3.4)2/108(3.5)1/109(4.5)1/10Let/~~~.,~(m)bethenumberofallarrangementsofzeroesandonesin(SI(X,Y).....Sn(X3Y))suchthatl4’,=Cy=liS,(X.Y)=m.Thentheprobabilitydistributioncanbeusedtoperformanexacttest.Derivingthisdistributionisnotrivialmatter,mindyou.Whennislarge,thecalculationofexactdistributionofW,iscumbersome. MANN-WHITNEYuTEST131ThestatisticW,inWSuRTisanexampleofalineurrankStatistic(seesectiononPropertiesofRanks)forwhichthenormalapproximationholds,n,(n+1)n1nz(n+1)Wn"(2'12).Abetterapproximationisn:+n;+n1nz+nP(W,5w)R5@(z)+d(z)(z3-3z)20n1nz(n+1),where4(z)anda(.)arethePDFandCDFofastandardnormaldistributionandz=(w-lE(W)+0.5)/dm.Thisapproximationissatisfactoryforn1>5andn2>5iftherearenoties.TiesintheData:Iftiesarepresent,lettl:...,tl,bethenumberofdifferentobservationsamongalltheobservationsinthecombinedsample.Theadjust-mentfortiesisneededonlyinVar(W,),becauseE(Wn)doesnotchange.Thevariancedecreaseston1n*(n+1)-721122C;&S--ti)Var(Wn)=(7.4)1212n(n+1)-Foraproofof(7.4)andmoredetails,seeLehmann(1998).Example7.8Letthecombinedsamplebe{2j4/445},wheretheboxednumbersareobservationsfromthefiratsample.Thenn=7,n1=3.nz=4,andtheranksare(1.51.535557).Thestatisticw=1.5+3+5=9.5hasmeanIE(W,)=nl(n+l)/2=12.Toadjustthevarianceforthetiesfirstnotethattherearek==4differentgroupsofobservations,withtl=2.tz=1.t3=3.andt4=1.Witht,=1,t:-t,=0,onlythevaluesoft,>1(genuineties)contributetotheadjustingfactorinthevariance.Inthiscase,3.4.83.4.((8-2)+(27-3))Var(W7)=--=8--0.5357=z7.4643.1212.7.87.6MANN-WHITNEYuTESTLiketheWilcoxontestabove.theXlann-Whitneytestisappliedtofinddif-ferencesintwopopulations.anddoesnotassumetlhatthepopulationsarenormallydistributed.However.ifweextendthemethodtotestsinvolvingpopulationmeans(insteadofjustE(D,,)=P(Y>[w,z,pl=wmw([l23451,[2421113,0) TESTOFVARlANCES133w=27z=-0.1057p=0.87407.7TESTOFVARIANCESComparedtoparametrictestsofthemean,statistic,altestsonpopulationvariancesbasedontheassumptionofnormaldistributedpopulationsarelessrobust.Thatis,theparametrictestsforvariancesareknowntoperformquitepoorlyifthenormalassumptionsarewrong.SupposewehavetwopopulationswithCDFsFandG.andwecollectrandomsamplesXI,....X,,NFandY1....,Y,,NG(thesameset-upusedintheMann-Whitneytest).Thistime,ournullhypothesisisversusoneofthreealternativehypotheses(HI):ax2#cry2,ax2fly2.IfZandaretherespectivesamplemeans,theteststatisticisbasedonfi(z,)=rankof(2,-3)’amongalln=n1+n2,squareddifferencesR(y,)=rankof(yz-g)2amongalln=n1+n2squareddifferenceswithteststatisticT=CR(xi).i=1Assumption:Themeasurementscaleneedstobeinterval(atleast).TiesintheData:Iftherearetiesinthedata,itisbettertousewhereandThecriticalregionforthetestcorrespondstothedirectionofthealternativehypothesis.ThisiscalledtheConovertestofeqzialvariances,andtabled 134RANKTESTSquantilesforthenulldistributionofTarebefoundinConoverandIman(1978).Ifwehavelargersamples(n1210,n22lo),thefollowingnormalapproximationforTcanbeused:nl(n+1)(2n+1)T-+N(~T,&),withp~=,6nin2(n+1)(2n+1)(8n+11)oT2=180Forexample,withana-leveltest,ifHI:ax2>oy2,werejectHOifzo=(T-~T)/OT>za,wherez,isthe1-aquantileofthenormaldistribution.ThetestforthreeormorevariancesisdiscussedinChapter8,aftertheKruskal-Wallistestfortestingdifferencesinthreeormorepopulationmedians.UsetheMATLABfunctionSquaredRanksTest(x,y,p,side,data)forthetestoftwovariances,wherezandyarethesamples,pisthesought-afterquantilefromthenulldistributionofT,side=1forthetestofH1:ax2>oy2(usep/2forthetwo-sidedtest),side=-1forthetestofH1:ax2data=[134345444651;>belong=[l11222333331;>[H,p]=kruskal-walliscdata,belong)CH,pl=3.89230.1428Example8.1Thefollowingdataarefromaclassicagriculturalexperimentmeasuringcropyieldinfourdifferentplots.Forsimplicity.weidentifythe 144DESlGNEDEXPERIMENTStreatment(plot)usingtheintegers{1,2,3,4}.Thethirdtreatmentmeanmea-suresfarabovetherest,andthenullhypothesis(thetreatmentmeansareequal)isrejectedwithapvaluelessthan0.0002.>data=[83919489899691929084919081838483...889189101100919396959481788281777981801;>belong=[l111111111222222222...3333333344444441;>>[H,p]=kruskal-wallis(data,belong)H=20.3371P=1.4451e-004Krushl-WallisPairwiseComparisons.IftheKWtestdetectstreatmentdifferences,wecandetermineiftwoparticulartreatmentgroups(sayiandj)aredifferentatlevelaifExample8.2Wedecidedthefourcroptreatmentswerestatisticallydiffer-ent,anditwouldbenaturaltofindoutwhichonesseembetterandwhichonesseemworse.Inthetablebelow,wecomputethestatistic/s2(n-1-H')n-k($+&)foreverycombinationof15i#j54,andcompareittot30,0.975=2.042101.8561.8595.16921.85603.5703.36331.8593.57006.62645.1693.3636.6260Thisshowsthatthethirdtreatmentisthebest,butnotsignificantlydifferentfromthefirsttreatment,whichissecondbest.Treatment2,whichisthirdbestisnotsignificantlydifferentfromTreatment1,butisdifferentfromTreatment4andTreatment3. FRlEDMANTEST145Fig.8.2MiltonFriedman(1912-2006)8.2FRIEDMANTESTTheFrzedmanTestisanonparametricalternativetotherandomizedblockdesign(RBD)inregularAKOVA.ItreplacestheRBDwhentheassumptionsofnormalityareinquestionorwhenvariancesarepossiblydifferentfrompopulationtopopulation.Thistestusestheranksofthedataratherthantheirrawvaluestocalculatetheteststatistic.BecausetheFriedmantestdoesnotmakedistributionassumptions,itisnotaspowerfulasthestandardtestifthepopulationsareindeednormal.MiltonFriedmanpublishedthefirstresultsforthistest,whichwaseventu-allynamedafterhim.HereceivedtheNobelPrizeforEconomicsin1976andoneofthelistedbreakthroughpublicationswashisarticle“TheUseofRankstoAvoidtheAssumptionofNormalityImplicitintheAnalysisofVariance”.publishedin1937.RecallthattheRBDdesignrequiresrepeatedmeasuresforeachblockateachleveloftreatment.LetX,,representtheexperimentaloutcomeofsubject(or“block”)iwithtreatmentj,wherei=1,...,b,andj=1.....k.ITreatmentsBlocks~12...kI1Ix11x122x21x22...x2kbxblxb2...XbkToformtheteststatistic,weassignranks{1,2....,k}toeachrowinthetableofobservations.ThustheexpectedrankofanyobservationunderHois(k+1)/2.Wenextsumalltheranksbycolumns(bytreatments)toobtainbR,=x,=l~(X,,),15j5k.IfHoistrue,theexpectedvalueforR,is 146DESIGNEDEXPERIMENTSIE(R,)=b(k+1)/2.Thestatistick(Rj-@p)2.j=1isanintuitiveformulatorevealtreatmentdifferences.Ithasexpectationbk(k2-1)/12andvariancek2b(b-l)(k-l)(k+1)2/72.OncenormalizedtoithasmomentsE(S)=k-1andVar(S)=2(k-l)(b-l)/bKZ2(k-1).whichcoincidewiththefirsttwomomentsofXE-~.HighermomentsofSalsoapproximatewellthoseofxiplwhenbislarge.Inthecaseofties,amodificationtoSisneeded.LetC=bk(k+1)2/4kandR*=x:=lC,=lT(X%~)~.Then,isalsoapproximatelydistributedasxEPl.AlthoughtheFriedmanstatisticmakesforasensible,intuitivetest,itturnsoutthereisabetteronetouse.AsanalternativetoS(orS'),theteststatistic(b-1)sF=b(k-1)-sisapproximatelydistributedasFk-l,(b-l)(k-l),andtestsbasedonthisap-proximationaregenerallysuperiortothosebasedonchi-squareteststhatuseS.FordetailsonthecomparisonbetweenSandF,seeImanandDavenport(1980).Example8.3Inanevaluationofvehicleperformance.sixprofessionaldrivers.(labelledI.II.III,IV,V.VI)evaluatedthreecars(A.B.andC)inarandom-izedorder.Theirgradesconcernonlytheperformanceofthevehiclesandsupposedlyarenotinfluencedbythevehiclebrandnameorsimilarexogenousinformation.Herearetheirrankingsonthescale1-10:CarII11111IVvVI FRIEDMANTEST147TousetheMATLABprocedurefriedman(data)~thefirstinputvectorrepresentsblocks(drivers)andthesecondrepresentstreatments(cars).>data=[789;6107;688;..798;7109;8891;>[S,F,pS,pF]=friedman(data)S=8.2727F=11.0976ps=0.0160pF=0.0029%thisp-valueismorereliableFriedmanPairwiseComparisons.Ifthep-valueissmallenoughtowar-rantmultiplecomparisonsoftreatments,weconsidertwotreatmentsiandjtobedifferentatlevelcyifbR*-C,"=,R:IRi-RJI>t(b-l)(lc-l).l-a/2(b-l)(k-1)'Example8.4FromExample8.3,thethreecars(A,B,C)areconsideredsig-nificantlydifferentattestlevelcy=0.01(ifweusetheF-statistic).WecanusetheMATLABprocedurefriedman-pairwise-comparison(x,i,j,a>tomakeapairwisecomparisonbetweentreatmentiandtreatmentjatlevela.Theoutput=1ifthetreatmentsiandjaredifferent.otherwiseitis0.TheFriedmanpairwisecomparisonrevealsthatcarAisratedsignificantlylowerthanbothcarBandcarC,butcarBandcarCarenotconsideredtobedifferent.AnalternativetestforkmatchedpopulationsisthetestbyQuade(1966).whichisanextensionoftheWilcoxonsigned-ranktest.Ingeneral.theQuadetestperformsnobetterthanFriedman'stest,butslightlybetterinthecasek=3.Forthatreason.wereferenceitbutwillnotgooveritinanydetail. 148DESlGNEDEXPERIMENTS8.3VARIANCETESTFORSEVERALPOPULATIONSInthelastchapter,thetestforvariancesfromtwopopulationswasachievedwiththenonparametricConoverTest.Inthissection,thetestisextendedtothreeormorepopulationsusingaset-upsimilartothatoftheKruskal-Wallistest.ForthehypothesesHO:kvariancesareequalversusHI:someofthevariancesaredifferent,letni=thenumberofobservationssampledfromeachpopulationandXijisthejthobservationfrompopulationi.Wedenotethefollowing:0n=nl+...+nk~i=sampleaverageforithpopulationR(zij)=rankof(zij-Zi)2amongnitemsThentheteststatisticisUnderHo,Thasanapproximatex2distributionwithk-1degreesoffreedom,sowecantestforequalvariancesatlevelctbyrejectingHOifT>~:-~(1-a).Conover(1999)notesthattheasymptoticrelativeeffi-ciency,relativetotheregulartestfordifferentvariancesis0.76(whenthedataareactuallydistributednormally).Ifthedataaredistributedasdouble-exponential,theA.R.E.isover1.08.Example8.5ForthecropdataintheExample8.1,wecanapplythevari-ancetestandobtainn=34,TI=3845,Tz=4631,T3=4032,T4=1174.5,andT=402.51.ThevariancetermV,=C,C,R(z,,)~-34(402.51)2)/33=129,090leadstotheteststatisticC?=l(T;/nj)-34(402.51)'T==4.5086.VTUsingtheapproximationthatTNxZ3underthenullhypothesisofequalvariances,thep-valueassociatedwiththistestisP(T>4.5086)=0.2115.Thereisnostrongevidencetoconcludetheunderlyingvariancesforcropyieldsaresignificantlydifferent. EXERClSES149MultipleComparisons.IfNOisrejected,wecandeterminewhichpopula-tionshaveunequalvariancesusingthefollowingpairedcomparisons:wheretn-k(a)isthecyquantileofthetdistributionwithn-kdegreesoffreedom.Iftherearenoties.TandVTaresimpleconstants:T=(n+1)(2n+1)/6andVT=n(n+1)(2n+l)(8n+11)/180.8.4EXERCISES8.1.Show,thatwhentiesarenotpresent,theKruskal-WallisstatisticH’in(8.2)coincideswithNin(8.3).8.2.Generatethreesamplesofsize10fromanexponentialdistributionwithX=0.10.PerformboththeF-testandtheKruskal-Wallistesttoseeiftherearetreatmentdifferencesinthethreegroups.Repeatthis1000times,recordingthep-valueforbothtests.Comparethesimulationre-sultsbycomparingthetwohistogramsmadefromthesepvalues.Whatdotheresultsmean?8.3.ThedatasetHypnosiscontainsdatafromastudyinvestigatingwhetherhypnosishasthesameeffectonskinpotential(measuredinmillivolts)forfouremotions(Lehmann,p.264).Eightsubjectsareaskedtodisplayfear,joy,sadness,andcalmnessunderhypnosis.Thedataarerecordedasoneobservationpersubjectforeachemotion.1fear23.11joy22.71sadness22.51calmness22.62fear57.62joy53.22sadness53.72calmness53.13fear10.53joy9.73sadness10.83calmness8.34fear23.64joy19.64sadness21.14calmness21.65fear11.95joy13.85sadness13.75calmness13.36fear54.66joy47.16sadness39.26calmness37.07fear21.07joy13.67sadness13.77calmness14.88fear20.38joy23.68sadness16.38calmness14.88.4.Thepoints-per-gamestatisticsfromthe1993NBAseasonwereanalyzedforbasketballplayerswhowenttocollegeinfourparticularACCschools:Duke,NorthCarolina.NorthCarolinaState.andGeorgiaTech.Wewanttofindoutifscoringisdifferentfortheplayersfromdifferentschools.Canthisbeanalyzedwithaparametricprocedure?Whyorwhynot?TheclassicalF-testthatassumesnormalityofthepopulationsyieldsF=0.41andNOisnotrejected.Whataboutthenonparametricprocedure? 150DESIGNEDEXPERIMENTSDukeUNCNCSUGT7.55.516.97.98.76.24.57.87.113.010.514.518.29.74.46.112.94.64.05.918.714.01.98.715.88.5.Somevarietiesofnematodes(roundwormsthatliveinthesoilandarefrequentlysosmalltheyareinvisibletothenakedeye)feedontherootsoflawngrassesandcropssuchasstrawberriesandtomatoes.Thispest;whichisparticularlytroublesomeinwarmclimates,canbetreatedbytheapplicationofnematocides.However,becauseofsizeoftheworms,itisdifficulttomeasuretheeffectivenessofthesepesticidesdirectly.Tocomparefournematocides,theyieldsofequal-sizeplotsofonevarietyoftomatoeswerecollected.Thedata(yieldsinpoundsperplot)areshowninthetable.Useanonparametrictesttofindoutwhichnematocidesaredifferent.NematocideANematocideBNematocideCNematocideD18.618.719.419.018.419.018.918.818.418.919.518.618.518.519.118.717.918.58.6.Anexperimentwasruntodeterminewhetherfourspecificfiringtem-peraturesaffectthedensityofacertaintypeofbrick.Theexperimentledtothefollowingdata.Doesthefiringtemperatureaffectthedensityofthebricks?Temperature1Density10021.821.921.721.721.621.712521.721.421.521.415021.921.821.821.821.621.517521.921.721.821.48.7.Achemistwishestotesttheeffectoffourchemicalagentsonthestrengthofaparticulartypeofcloth.Becausetheremightbevariabilityfromonebolttoanother.thechemistdecidestousearandomizedblockdesign, EXERCISES151withtheboltsofclothconsideredasblocks.Sheselectsfiveboltsandappliesallfourchemicalsinrandomordertoeachbolt.Theresultingtensilestrengthsfollow.Howdotheeffectsofthechemicalagentsdiffer?BoltBoltBoltBoltBoltChemicalNo.1No.2No.3No.4No.5173687471672736775727037568787368473717575698.8.ThevenerableauctionhouseofSnootly&Snobswillsoonbeputtingthreefine17th-and18th-centuryviolins,A,B,andC,upforbidding.Acertainmusicalartsfoundation.wishingtodeterminewhichofthesein-strumentstoaddtoitscollection,arrangestohavethemplayedbyeachof10concertviolinists.Theplayersareblindfolded,sothattheycan-nottellwhichvioliniswhich;andeachplaystheviolinsinarandomlydeterminedsequence(BCA,ACB,etc.)Theviolinistsarenotinformedthattheinstrumentsareclassicmas-terworks;alltheyknowisthattheyareplayingthreedifferentviolins.Aftereachviolinisplayed,theplayerratestheinstrumentona10-pointscaleofoverallexcellence(1=lowest,10=highest).Theplayersaretoldthattheycanalsogivefractionalratings,suchas6.2or4.5,iftheywish.Theresultsareshowninthetablebelow.Forthesakeofconsistency,then=10playersarelistedas"subjects."SubjectViolin1234567891099.557.59.57.5878.5676.577.55866.577684676.5646.538.9.FromExercise8.5,testtoseeiftheunderlyingvariancesforthefourplotyieldsarethesame.Useatestlevelofcu=0.05. 152DESIGNEDEXPERIMENTSREFERENCESFriedman,M.(1937),“TheUseofRankstoAvoidtheAssumptionofNor-malityImplicitintheAnalysisofVariance,”JournaloftheAmericanStatisticalAssociation,32,675-701.Iman,R.L.,andDavenport,J.M.(1980),“ApproximationsoftheCriti-calRegionoftheFriedmanStatistic,”CommunicationsinStatisticsA:TheoryandMethods,9,571-595.Kruskal,W.H.(1952),“ANonparametricTestfortheSeveralSampleProb-lem,“AnnalsofMathematicalStatistics,23,525-540.KruskalW.H.,andWallisW.A.(1952);“UseofRanksinOne-CriterionVarianceAnalysis,”JournaloftheAmericanStatisticalAssociation.47,583-621.Lehmann,E.L.(1975),TestingStatisticalHypotheses,NewYork:Wiley.Quade,D.(1966),“OntheAnalysisofVarianceforthek-samplePopulation,”AnnalsofMathematicalStatistics,37.1747-1785. CategoricalDataStatisticallyspeaking,U.S.soldiershavelessofachanceofdyingfromallcausesinIraqthancitizenshaveofbeingmurderedinCalifornia,whichisroughlythesamegeographicalsize.Californiahasmorethan2300homicideseachyear,whichmeansabout6.6murderseachday.Meanwhile,U.S.troopshavebeeninIraqfor160days,whichmeansthey'reincurringabout1.7deaths,includingillnessandaccidentseachday.'BritHume,FoxNews,August2003.Acategoricalvariableisavariablewhichisnominalorordinalinscale.Ordinalvariableshavemoreinformationthannominalonesbecausetheirlevelscanbeordered.Forexample.anautomobilecouldbecategorizedinanordinalscale(compact,mid-size,large)oranominalscale(Honda,Buick,Audi).Opposedtointervaldata,whicharequantitative,nominaldataarequalztative,socomparisonsbetweenthevariablescannotbedescribedmathematically.Ordinalvariablesaremoreusefulthannominalonesbecausetheycanpossiblyberanked,yettheyarenotquitequantitative.Categoricaldataanalysisisseeminglyubiquitousinstatisticalpractice.andweencouragereaderswhoareinterestedinamorecomprehensivecoveragetoconsultmonographsby'Bynottakingthetotalpopulationofeachgroupintoaccount,Humefailedtonotetherelativeriskofdeath(Section9.2)toasoldierinIraqwas65timeshigherthanthemurderrateinCalifornia.153 154CATEGORICALDATAAgresti(1996)andSimonoff(2003).Attheturnofthe19thcentury,whileprobabilistsinRussia,Franceandotherpartsoftheworldwerehasteningthedevelopmentofstatisticaltheorythroughprobability,Britishacademicsmadegreatmethodologicaldevelop-mentsinstatisticsthroughapplicationsinthebiologicalsciences.ThiswasdueinpartfromthegushofresearchfollowingCharlesDarwin’spublica-tionofTheOriginofSpeciesin1859.Darwin‘stheorieshelpedtocatalyzeresearchinthevariationsoftraitswithinspecies,andthisstronglyaffectedthegrowthofappliedstatisticsandbiometrics.Soonafter,GregorMendel‘spreviousfindingsingenetics(fromoveragenerationbeforeDarwin)were“rediscovered”inlightofthesenewtheoriesofevolution.Fig.9.1CharlesDarwin(1843-1927),GregorMendel(1780-1880)Whenitcomestothedevelopmentofstatisticalmethods,twoindividu-alsaredominantfromthisera:KarlPearsonandR.A.Fisher.BothwerecantankerousresearchersinfluencedbyWilliamS.Gosset,themanwhode-rivedthe(Student’s)tdistribution.KarlPearson.inparticular,contributedseminalresultstothestudyofcategoricaldata.includingthechi-squaretestofstatisticalsignificance(Pearson,1900).FisherusedXlendel‘stheoriesasaframeworkfortheresearchofbiologicalinheritance’.Bothresearchersweremotivatedbyproblemsinheredity.andbothplayedaninterestingroleinitspromotion.Fisher.anupper-classBritishconservativeandintellectual.theorizedthedeclineofwesterncivilizationduetothediminishedfertilityoftheupperclasses.Pearson,hisrival,wasastaunchsocialist,yetironicallyadvocateda“waroninferiorraces”,whichheoftenassociatedwiththeworkingclass.Pearsonsaid,”nodegenerateandfeeblestockwilleverbeconvertedinto2Actually.Fishershowedstatisticallythathlendel’sdatawereprobablyfudgedalittleinordertosupportthetheoryforhisnewgeneticmodel.SeeSection9.2. CHI-SQUAREANDGOODNESS-Of-FIT155Fig.9.2KarlPearson(1857-1936),WilliamSealyGosset(a.k.a.Student)(1876-1937),andRonaldFisher(1890-1962)healthyandsoundstockbytheaccumulatedeffectsofeducation,goodlawsandsanitarysurroundings.”Althoughtheirresearchwasundoubtedlybril-liant,racialbigotrystronglyprevailedinwesternsocietyduringthiscolonialperiod,andscientistswerehardlyexceptionalinthisregard.9.1CHI-SQUAREANDGOODNESS-OF-FITPearson’schi-squarestatisticfoundimmediateapplicationsinbiometry,ge-neticsandotherlifesciences.Itisintroducedinthemostrudimentarysciencecourses.Forinstance,ifyouareatapartyandyoumeetacollegegraduateofthesocialsciences,it’slikelyoneofthefewthingstheyrememberabouttherequiredstatisticsclasstheysufferedthroughincollegeistheterm“chi-square“.Tomotivatethechi-squarestatistic,letXI.X2,...,X,beasamplefromanydistribution.AsinChapter6.wewouldliketotestthegoodness-of-fithypothesisHo:Fx(x)=Fo(z).LetthedomainofthedistributionD=(ab)besplitintoTnon-overlappingintervals.11=(a,211,12=(~1.221...1,=(~~-1,b).Suchintervalshave(theoretical)probabilitiespl=Fo(z1)-F,(a),pz=Fo(22)-Fo(z1)....%p,=Fo(b)-Fo(Lc,-~).underHo.Let121.722.....n,beobservedfrequenciesofintervals11.12,....1,.Inthisnotation,n1isthenumberofelementsofthesampleXI,...,X,thatfallsintotheinterval11.Ofcourse,nl+...+n,=nbecausetheintervalsareapartitionofthedomainofthesample.Thediscrepancybetweenobservedfrequenciesn2andtheoreticalfrequenciesnp,istherationaleforformingthestatistic 156CATEGORICALDATAthathasachi-square(x’)distributionwithr-1degreesoffreedom.LargevaluesofX2arecriticalforHo.Alternativerepresentationsincludewherepz=12,172.Insomeexperiments,thedistributionunderHOcannotbefullyspecified;forexample,onemightconjecturethedataaregeneratedfromanormaldistri-butionwithoutknowingtheexactvaluesofporu2.Inthiscase,theunknownparametersareestimatedusingthesample.SupposethatkparametersareestimatedinordertofullyspecifyFo.Then,theresultingstatisticin(9.1)hasax2distributionwithr-k-1degreesoffreedom.Adegreeoffreedomislostwiththeestimationofaparameter.Infairness,ifweestimatedaparameterandtheninserteditintothehypothesiswithoutfurtheracknowledgment,thehypothesiswillundoubtedlyfitthedataatleastaswellasanyalternativehypothesiswecouldconstructwithaknownparameter.Sothelostdegreeoffreedomrepresentsaformofhandicapping.Thereisnoorthodoxyinselectingthecategoriesoreventhenumberofcategoriestouse.Ifpossible,makethecategoriesapproximatelyequalinprobability.Practitionersmaywanttoarrangeintervalselectionsothatallnp,>1andthatatleast80%ofthenp,’sexceed5.Therule-of-thumbis:n210,r23,n2/r210,npz20.25.AsmentionedinChapter6,thechi-squaretestisnotaltogetherefficientfortestingknowncontinuousdistributions.especiallycomparedtoindivid-ualizedtestssuchasShapiro-WilkorAnderson-Darling.ItsadvantageismanifestwithdiscretedataandspecialdistributionsthatcannotbefitinaKolmogorov-typestatisticaltest.Example9.1Mendel’sData.In1865.hlendeldiscoveredabasicgeneticcodebybreedinggreenandyellowpeasinanexperiment.Becausetheyellowpeageneisdominant,thefirstgenerationhybridsallappearedyellow,butthesecondgenerationhybridswereabout75%yellowand25%green.Thegreencolorreappearsinthesecondgenerationbecausethereisa25%chancethattwopeas,bothhavingayellowandgreengene.willcontributethegreengenetothenexthybridseed.Inanotherpeaexperiment3thatconsideredbothcolorandtexturetraits.theoutcomesfromrepeatedexperimentscameoutasinTable9.113SeeSection16.1formoredetailonprobabilitymodelsinbasicgenetics. CHI-SQUAREANDGOODNESS-OF-FIT157Table9.11Mendel’sDataTypeofObservedExpectedPeaNumberNumberSmoothYellow315313WrinkledYellow101104SmoothGreen108104WrinkledGreen3235Thestatisticalanalysisshowsastrongagreementwiththehypothesizedout-comewithap-valueof0.9166.Whilethis,byitself.isnotsufficientprooftoconsiderfoulplay.Fishernotedthiskindofresultinasequenceofseveralexperiments.His“meta-analysis”(seeChapter6)revealedap-valuearound0.00013.>>0=[315101108321;>>th=[313104104351;>>sum((0-th).-2./thans=0.5103>>1-chi2cdf(0.5103,4-1)ans=0.9166Example9.2Horse-KickFatalities.Duringthelatterpartofthenine-teenthcentury,Prussianofficialscollectedinformationonthehazardsthathorsesposedtocavalrysoldiers.Atotalof10cavalrycorpsweremonitoredoveraperiodof20years.RecordedforeachyearandeachcorpswasX,thenumberoffatalitiesduetokicks.Table9.12showsthedistributionofXforthese200“corps-years“.Altogethertherewere122fatalities(109(0)+65(1)+22(2)+3(3)+l(4)).meaningthattheobservedfatalityratewas122/200=0.61fatalitiespercorps-year.APoissonmodelforXwithameanofp=.61wasproposedbyvonBortkiewicz(1898).Table9.12showstheexpectedfrequencycorrespond-ingtoIC=0,l.....etc..assumingthePoissonmodelforXwascorrect.Theagreementbetweentheobservedandtheexpectedfrequenciesisremarkable.TheMATLABprocedurebelowshowsthattheresultingX2statistic=0.6104.IfthePoissondistributioniscorrect.thestatisticisdistributedx2with3de-greesoffreedom,sothep-valueiscomputedP(W>0.6104)=0.8940.>>0=[1096522311; 158CATEGORICALDATATable9.12Horse-kickfatalitiesdataObservedNumberExpectedNumber5ofCorps-YearsofCorps-Years0109108.716566.322220.2334.1410.7200200>>th=[108.766.320.24.10.71;>>sum((0-th).-2./thans=0.6104>>l-chiZcdf(0.6104,5-1-1)ans=0.8940Example9.3Benford’sLaw.Benford’slaw(Benford,1938;Hill,1998)concernsrelativefrequenciesofleadingdigitsofvariousdatasets,numericaltables,accountingdata,etc.Benford’slaw.alsocalledthefirstdigitlaw.statesthatinnumbersfrommanysources.theleadingdigit1occursmuchmoreoftenthantheothers(namelyabout30%ofthetime).Furthermore,thehigherthedigit,thelesslikelyitistooccurastheleadingdigitofanumber.Thisappliestofiguresrelatedtothenaturalworldorofsocialsignificance,beitnumberstakenfromelectricitybills,newspaperarticles,streetaddresses,stockprices,populationnumbers,deathrates,areasorlengthsofriversorphysicalandmathematicalconstants.Tobeprecise,Benford’slawstatesthattheleadingdigitn,(n=1,....9)occurswithprobabilityP(n)=loglo(n+1)-loglo(n),approximatedtothreedigitsinthetablebelow.Digitn123456789P(n)0.3010.1760.1250.0970.0790.0670.0580.0510.046Thetablebelowliststhedistributionoftheleadingdigitforall307numbersappearinginaparticularissueofReader’sDigest.Withp-valueof0.8719,thesupportforHO(ThefirstdigitsinReader’sDigestaredistributedaccordingtoBenford‘sLaw)isstrong. CONT/NGENCYTABLES159Digit123456789count1035738232021171513TheagreementbetweentheobserveddigitfrequenciesandBenford'sdistribu-tionisgood.TheMATLABcalculationshowsthattheresultingX2statisticis3.8322.UnderHo.X2isdistributedasxgandmoreextremevaluesofX2arequitelikely.Thep-valueisalmost90%.>>x=El0357382320211715131;>>e=307*[0.3010.1760.1250.0970.079...0.0670.0580.0510.0461;>>sum((x-e).-2./e)ans=3.8322>>1-chi2cdf(3.8322,8)ans=0.87199.2CONTINGENCYTABLES:TESTINGFORHOMOGENEITYANDINDEPENDENCESupposetherearempopulations(morespecifically,mlevelsoffactorA:(R1,....R,)underconsideration.Furthermore,eachobservationcanbeclas-sifiedinadifferentways.accordingtoanotherfactorB.whichhasklevels(C1,...,Ck).LetnZ3bethenumberofallobservationsattheithlevelofAandjthlevelofB.M:eseektofindoutifthepopulations(fromA)andtreatments(fromB)areindependent.IfwetreatthelevelsofAaspopulationgroupsandthelevelsofBastreatmentgroups,thereare3=1observationsinpopulationi,wherei=1....,m.Eachofthetreatmentgroupsisrepresentedn723=C%J,2=1times,andthetotalnumberofobservationsis721.+'..+nm,=nThefollowingtablesummarizestheabovedescription. 160CATEGORICALDATAIIIII/IIWeareinterestedintestingindependenceoffactorsAandB,representedbytheirrespectivelevelsR1,...,RmandC1,...,ck,onthebasisofobservedfrequenciesn,j.RecallthedefinitionofindependenceofcomponentrandomvariablesXandYintherandomvector(X,Y),P(X=2,,Y=yj)=P(X=2,)’P(Y=Yj)Assumethattherandomvariable20.theMcNemarstatisticiscalculatedaswhichhasax2distributionwith1degreeoffreedom.Someauthorsrec-ommendaversionoftheMcNemartestwithacorrectionfordiscontinuity, 166CATEGORICALDATAcalculatedasX2=(Ib-c/-1)2/(b+c),butthereisnoconsensusamongexpertsthatthisstatisticisbetter.Ifb+c<20,asimplestatisticsT=bcanbeused.IfHoistrue,TNBin(b+c,1/2)andtestingisasinthesign-test.Insomesense,whatthestandardtwo-samplepairedt-testisfornormallydistributedresponses,theMcNemartestisforpairedbinaryresponses.Example9.5AstudybyJohnsonandJohnson(1972)involved85patientswithHodgkin’sdisease.Hodgkin’sdiseaseisacancerofthelymphaticsystem;itisknownalsoasalymphoma.Eachpatientinthestudyhadasiblingwhodidnothavethedisease.In26ofthesepairs,bothindividualshadatonsillectomy(T).In37pairs,neitherofthesiblingshadatonsillectomy(N).In15pairs,onlytheindividualwithHodgkin’shadatonsillectomyandin7pairs,onlythenon-Hodgkin’sdiseasesiblinghadatonsillectomy.IISibling/T1Sibling/N/ITotalI1Patient/TI261151141I1Patient/NI7I37/I44I1TotalI33I52/I851Thepairs(Xi,yZ),i=1,...,85representsiblings-oneofwhichisapatientwithHodgkin’sdisease(X)andthesecondwithoutthedisease(Y).,,,,Eachofthesiblingsisalsoclassified(asT=1orN=0)withrespecttohavingatonsillectomy.I1Y=lIY=O1/X=l/26I15IThetestweareinterestedinisbasedonHO:P(X=1)=P(Y=l),i.e.,thattheprobabilitiesofsiblingshavingatonsillectomyarethesamewithandwithoutthedisease.Becauseb+c>20.thestatisticofchoiceisThep-valueisp=P(W22.9091)=0.0881,whereWNxf.UnderHo,T=15isarealizationofabinomialBin(22,O.s)randomvariableandthepvalueis2.P(T215)=2.P(T>14)=0.1338,thatis, COCHRAN'STEST167>>2*(1-binocdf(l4,22,0.5))ans=0.1338Withsuchahighp-value,thereisscantevidencetorejectthenullhypoth-esisofhomogeneityofthetwogroupsofpatientswithrespecttohavingatonsillectomy.9.5COCHRAN'STESTCochran's(1950)testisessentiallyarandomizedblockdesign(RBD),asdescribedinChapter8,buttheresponsesaredichotomous.Thatis,eachtreatment-blockcombinationreceivesa0or1response.Ifthereareonlytwotreatments.theexperimentaloutcomeisequivalenttoMcNemar'stestwithmarginaltotalsequalingthenumberofblocks.Toseethis,considerthelastexampleasacollectionofdichotomousoutcomes:eachofthe85patientsareinitiallyclassifiedintotwoblocksdependingonwhetherthepatienthadorhadnotreceivedatonsillectomy.Theresponseis0ifthepatient'ssiblingdidnothaveatonsillectomyand1iftheydid.Example9.6ConsiderthesoftwaredebuggingdatainTable9.14.Herethesoftwarereviewers(A,B,C,D,E)representfiveblocks,andthe27bugsareconsideredtobetreatments.Letthecolumntotalsbedenoted{Cl....,C5)anddenoterowtotalsas{Rl,....R27).WeareessentiallytestingHo:treatments(softwarebugs)haveanequalchanceofbeingdiscovered.versusHa:somesoftwarebugsaremoreprevalent(oreasilyfound)thanothers.theteststatisticiswheren=CC,=CR,.m=5(blocks)andk=27treatments(softwarebugs).UnderHo,TChasanapproximatechi-squaredistributionwithm-1degreesoffreedom.Inthisexample,TC=17.647,correspondingtoatestp-valueof0.00145.9.6MANTEL-HAENSZELTESTSupposethatkindependentclassificationsintoa2x2tableareobserved.Wecoulddenotetheithsuchtableby 168CATEGORICALDATATable9.14FiveReviewersFound27IssuesinSoftwareExampleasinGilbandGra-ham(1993)111110010010101001001110100010101111110110111001011011110000111110100011111101110010000000101000000001000010011001110000101011000000101Fig.9.3QuinnhicNemar(1900-1986).WilliamGemmellCochran(1909-1980),andNathanMantel(1919-2002)Itisassumedthatthemarginaltotals(rE.12%orjustn,)arefixedinadvanceandthatthesamplingwascarriedoutuntilsuchfixedmarginaltotalsaresatisfied.Ifeachofthektablesrepresentanindependentstudyofthesameclassifications,theMantel-HaenszelTestessentiallypoolsthestudiestogetherina"meta-analysis"thatcombinesallexperimentaloutcomesintoasingle MANTEL-HAENSZELTEST169statistic.Formoreaboutnon-parametricapproachestothiskindofproblem,seethesectiononmeta-analysisinChapter6.Fortheithtable,pl,istheproportionofsubjectsfromthefirstrowfallinginthefirstcolumn,andlikewise.p2,istheproportionofsubjectsfromthe2ndrowfallinginthefirstcolumn.Thehypothesisofinteresthereisifthepopulationproportionspl,andp2,coincideoverallkexperiments.Supposethatinexperimentitherearen,observations.Allitemscanbecategorizedastype1(T,ofthem)ortype2(n,-T,ofthem).Ifc,itemsareselectedfromthetotalofn,items,theprobabilitythatexactly2,oftheselecteditemsareofthetype1is(9.5)Likewise.allitemscanbecategorizedastypeA(c,ofthem)ortypeB(n,-c,ofthem).Ifr,itemsareselectedfromthetotalofn,items,theprobabilitythatexactly2,oftheselectedareofthetypeAisOfcoursethesetwoprobabilitiesareequal,i.e,Thesearehypergeometricprobabilitieswithmeanandvariance-.r,c,r,.c,.(%-r,).(n,-c,)and122n:(%-1)respectively.Thekexperimentsareindependentandthestatisticc,=1kku22-c2=1n,T=(9.7)isapproximatelynormal(ifn,islarge,thedistributionsofthe2,'sareclosetobinomialandthusthenormalapproximationholds.Inaddition,summingoverIcindependentexperimentsmakesthenormalapproximationmoreaccu-rate.)Largevaluesof/TIindicatethattheproportionschangeacrossthekexperiments.Example9.7Thethree2x2tablesprovideclassificationofpeoplefrom3Chinesecities,Zhengzhou.Taiyuan,andNanchangwithrespecttosmokinghabitsandincidenceoflungcancer(Liu.1992). 170CATEGORKALDATAZhengzhouTaiyuanNanchangCancerDiagnosis:yesnoItotal/IyesnoItotal/IyesnoItotalSmoker1821563386099159104193Non-Smoker72981170111143I541121:i157Total254254I5081171142I21311125125I250WecanapplytheMantel-HaenszelTesttodecideiftheproportionsofcancerincidenceforsmokersandnon-smokerscoincideforthethreecities,i.e.,HO:pli=p2iwherepliistheproportionofincidenceofcanceramongsmokersinthecityi,andpziistheproportionofincidenceofcanceramongnonsmokersinthecityi,i=1,2,3.Weusethetwo-sidedalternative,H1:pli#p2iforsomeiE{1,2,3}andfixthetype-IerrorrateatQ=0.10.Fromthetables,Cixi=182+60+104=346.Also,Cirici/ni=338.254/508+159.71/213+193.125/250=169+53+96.5=318.5.TOcomputeTin(9.7))r,c,(n,-T,)(n,-c,)--338.254.170.254159‘71.54‘142nf(n,-1)508’,507+213’.212193.125.57.125+2502.249=28.33333+9+11.04518=48.37851.Therefore.BecauseTisapproximatelyN(0,l),thep-value(viaMATLAB)is>>[st,p]=mantel-haenszel([l82156;7298;6099;1143;10489;21361)st=3.9537p=7.6944e-005Inthiscase,thereisclearevidencethatthedifferencesincancerratesisnotconstantacrossthethreecities. CLTFORMULTlNOMlALPROBABlLlTlES1719.7CENTRALLIMITTHEOREMFORMULTINOMIALPROBABILITIESLetEl,E2.....E,beeventsthathaveprobabilitiespl.p2.....p,:C,p,=1.SupposethatinnindependenttrialstheeventE,appearsn,times(n1t...+n,=n).ConsiderThevector@"Icanberepresentedaswherecomponents@J)aregivenbyp2-1/2[1(E,)-p,].z=1,....r.Vectors$3)arei.i.d..withE([L("))=p,-'(E1(Et)-p,)=0,E(<,'")2=(p,-')p,(l-p,)=1-P,.andE(<,'J)...1000...11and71=0,w.p.1;72,...,7,arei.i.d.N(0,l).Theorthogonaltransforma-tionpreservestheL2norm,i=29.8SIMPSON’SPARADOXSimpson’sParadoxisanexampleofchangingthefavor-abilityofmarginalproportionsinasetofcontingencytablesduetoaggregationofclasses.Inthiscasethemannerofclassificationcanbethoughtasa“lurkingvariable”causingseeminglyparadoxicalreversaloftheinequalitiesinthemarginalproportionswhentheyareaggregated.Mathematically,thereisnoparadox-thesetofvectorscannotbeorderedinthetraditionalfashion.AsanexampleofSimpson’sParadox,Radelet(1981)investigatedthere-lationshipbetweenraceandwhethercriminals(convictedofhomicide)receivethedeathpenalty(versusalessersentence)forregionalFloridacourtcasesduring1976-1977.Outof326defendantswhowereCaucasianorAfrican-American,thetablebelowshowsthatahigherpercentageofCaucasiande-fendants(11.88%)receivedadeathsentencethanforAfrican-Americande-fendants(10.24%).1RaceofDefendant1DeathPenalt,yILesserSentence1CaucasianIAfrican-American149ITotalI36I290IWhatthetabledoesn’tshowyouistherealstorybehindthesestatistics.Thenext2x2x2tableliststhedeathsentencefrequenciescategorizedbythedefendant’sraceandthe(murder)victim’srace.Thetableaboveiscon-structedbyaggregatingoverthisnewcategory.Oncethefulltableisshown,weseetheimportanceofthevictim‘sraceindeathpenaltydecisions.African- EXERCISES173AmericansweresentencedtodeathmoreoftenifthevictimwasCaucasian(17.5%versus12.6%)orAfrican-American(5.8%to0.0%).Whyisthisso?Becauseofthedramaticdifferenceinmarginalfrequencies(i.e..9CaucasiansdefendantswithAfrican-Americanvictimsversus103African-Americande-fendantswithAfrican-Americanvictims).Whenbothmarginalassociationspointtoasingleconclusion(asinthetablebelow)butthatconclusioniscontradictedwhenaggregatingoveracategory,thisisSimpson’sparado^.^RaceofRaceof~DeathILesserIIDefendantVictimPenaltySentenceCaucasianCaucasian19IAfrican-AmericanAfrican-AmericanCaucasian52African-American9.9EXERCISES9.1.DukeUniversityhasalwaysbeenknownforitsgreatschoolspirit,es-peciallywhenitcomestoMen’sbasketball.Onewaythatschoolen-thusiasmisshownisbydonningDukeparaphernaliaincludingshirts,hats,shortsandsweat-shirts.AclassofDukestudentsexploredpos-siblelinksbetweenschoolspirit(measuredbythenumberofstudentswearingparaphernalia)andsomeotherattributes.ItwashypothesizedthatmaleswouldwearDukeclothesmorefrequentlythanfemales.ThedatawerecollectedontheBryanCenterwalkwaystartingat12:OOpmontendifferentdays.Eachday50menand50womenweretallied.Dothedatabearoutthisclaim?I1DukeParaphernalia1NoDukeParaphernalia/ITotal11MaleI131I369/I500I1Female152I448/I500I1Total11831817llloo0I9.2.GeneSiskelandRogerEberthostedthemostfamousmoviereviewshowsinhistory.Belowaretheirrespectivejudgmentson43filmsthatwerereleasedin1995.Eachcriticgiveshisjudgmentwitha“thumbs4Notethatothercovariateinformationaboutthedefendantandvictim.suchasincomeorwealth.mighthaveledtosimilarresults 174CATEGORICALDATAup”or“thumbsdown.”Dotheyhavethesamelikelihoodofgivingamovieapositiverating?Ebert’sReviewThumbsUpThumbsDownSiskel’sThumbsUp186ReviewThumbsDown9109.3.Bickel,Hammel,andOConnell(1975)investigatedwhethertherewasanyevidenceofgenderbiasingraduateadmissionsattheUniversityofCaliforniaatBerkeley.Thetablebelowcomesfromtheircross-classificationof4,526applicationstograduateprogramsin1973bygender(maleorfemale),admission(whetherornottheapplicantwasadmittedtotheprogram)andprogram(A,B,C,D,EorF).Whatdoesthedatareveal?1A:AdmitIMaleFemale11B:Admit1MaleFemaleIAdmitted512Admitted353Rejected313?:11RejectedI207l78IIC:AdmitIMaleFemaleIAdmitted120Rejected205391202IID:AdmitIMaleFemaleIIE:AdmitIMaleFemaleIIEEFteeddI13813111AdmittedI53279244Rejected1382991F:Admit1MaleFemale1Admitted22Rejected3513179.4.WhenanepidemicofsevereintestinaldiseaseoccurredamongworkersinaplantinSouthBend,Indiana,doctorssaidthattheillnessresultedfrominfectionwiththeamoebaEntamoebahistolytica5.Thereareactu-allytworacesoftheseamoebas,largeandsmall,andthelargeoneswere5Source:J.E.Cohen(1973).IndependenceofAmoebas.InStatisticsbyExample:Weigh-ingChances,editedbyF.Mosteller,R.s.Pieters,W.H.Kruskal,G.R.Rising,andR.F.Link,withtheassistanceofR.CarlsonandM.Zelinka,p.72.Addison-Wesley:Reading,MA. EXERCISES175believedtobecausingthedisease.Doctorssuspectedthatthepresenceofthesmallonesmighthelppeopleresistinfectionbythelargeones.Tocheckonthis,publichealthofficialschosearandomsampleof138apparentlyhealthyworkersanddeterminediftheywereinfectedwitheitherthelargeorsmallamoebas.Thetablebelowgivestheresultingdata.Isthepresenceofthelargeraceindependentofthepresenceofthesmallone?LargeRaceSmallRace----PresentAbsentTotalPresent12Absent3568Total4791I1389.5.Astudywasdesignedtotestwhetherornotaggressionisafunctionofanonymity.ThestudywasconductedasafieldexperimentonHal-loween;300childrenwereobservedunobtrusivelyastheymadetheirrounds.Ofthese300children,173woremasksthatcompletelycoveredtheirfaces.while127worenomasks.Itwasfoundthat101childreninthemaskedgroupdisplayedaggressiveorantisocialbehaviorversus36childreninunmaskedgroup.Whatconclusioncanbedrawn?Stateyourconclusioninterminologyoftheproblem.usingcy=0.01.9.6.Deathbedscenesinwhichadyingmotherorfatherholdstolifeuntilafterthelong-absentsonreturnshomeanddiesimmediatelyafterarealltoofamiliarinmovies.Dosuchthingshappenineverydaylife?Aresomepeopleabletopostponetheirdeathuntilafterananticipatedeventtakesplace?Itisbelievedthatfamouspeopledosowithrespecttotheirbirthdaystowhichtheyattachsomeimportance.AstudybyDavidP.Phillips(inTanur,1972,pp.52-65)seemstobeconsistentwiththenotion.Phillipsobtaineddata6onmonthsofbirthanddeathof1251famousAmericans:thedeathswereclassifiedbythetimeperiodbetweenthebirthdatesanddeathdatesasshowninthetablebelow.Whatdothedatasuggest?beforeBirthafter654321Month12345901008796101861191181211141131066348werepeoplelistedinFourHundredNotableAmerzcansand903arelistedasforemostfamiliesinthreevolumesofWhoWasWhofortheyears1951-60.1943-50and1897-1942. 176CATEGORICALDATA9.7.UsingacalculatormimictheMATLABresultsforX2fromBenford'slawexample(fromp.158).Herearesometheoreticalfrequenciesroundedto2decimalplaces:92.4154.0629.7524.3115.7214.06Usex2tablesandcompareX2withthecriticalx2quantileato=0.05.9.8.Assumethatacontingencytablehastworowsandtwocolumnswithfrequenciesofaandbinthefirstrowandfrequenciesofcanddinthesecondrow.(a)Verifythatthex2teststatisticcanbeexpressedas2(U+b+c+d)(ad-x=(a+b)(c+d)(b+d)(a+c)'(b)Letfil=a/(.+c)and62=b/(b+d).Showthattheteststatistic($1-lj2)-0a+b,where17=z=4-a+b+c+dand4=1-p,coincideswithx2from(a).9.9.Generateasampleofsizen=216fromN(0,l).SelectintervalsbypartitioningRatpoints-2.7,-2.2,-2,-1.7,-1.5,-1.2,-1,-0.8,-0.5;-0.3,0,0.2,0.4,0.9,1,1.4,1.6,1.9,2,2.5,and2.8.UsingaX2-test,confirmthenormalityofthesample.RepeatthisprocedureusingsamplecontaminatedbytheCauchydistributioninthefollowingway:0.95*normal-sample+0.05*cauchy-sample.9.10.ItiswellknownthatwhenthearrivaltimesofcustomersconstituteaPoissonprocesswiththerateAt,theinter-arrivaltimesfollowanexponentialdistributionwithdensityf(t)=XePxt.t20,X>0.ItisoftenofinteresttoestablishthattheprocessisPoissonbecausemanytheoreticalresultsareavailableforsuchprocesses,ubiquitousinthedomainofIndustrialEngineering.Inthefollowingexample,n=109inter-arrivaltimesofanarrivalprocesswererecorded,averaged(Z=2.5)andcategorizedintotimeintervalsasfollows:FrequencyI34201615978 EXERCISES177Testthehypothesisthattheprocessdescribedwiththeaboveinter-arrivaltimesisPoisson,atlevela=0.05.YoumustfirstestimateXfromthedata.9.11.Inalongstudyofheartdisease,thedayoftheweekonwhich63seem-inglyhealthymendiedwasrecorded.Thesemenhadnohistoryofdiseaseanddiedsuddenly.DayofWeek1Mon.Tues.Weds.Thurs.Fri.Sat.Sun.No.ofDeaths1227613546(i)Testthehypothesisthatthesemenwerejustaslikelytodieononedayasonanyother.UseQ=0.05.(ii)ExplaininwordswhatconstitutesTypeI1errorintheabovetesting.9.12.WriteaMATLABfunctionmcnemar.m.Ifb+c220.usethex2ap-proximation.Ifb+c<20useexactbinomialp-values.Youwillneedchi2cdfandbincdf.Useyourprogramtosolveexercise9.4.9.13.Doucetetal.(1999)comparedapplicationstodifferentprimarycareprogramsatTulaneUniversity.The“Medicine/Pediatrics”programstudentsaretrainedinbothprimarycarespecialties.Theresultsfor148surveyresponses,inthetablebelow,arebrokendownbyrace.Doesethnicityseemtobeafactorinprogramchoice?IIMedicalSchoolApplicantsI1EthnicityIMedicinePediatricsMedicine/Pediatrics1White303519Black1169Hispanic396Asian9389.14.TheDonnerpartyisthenamegiventoagroupofemigrants,includ-ingthefamiliesofGeorgeDonnerandhisbrotherJacob,whobecametrappedintheSierraNevadamountainsduringthewinterof1846-47.Nearlyhalfofthepartydied.TheexperiencehasbecomelegendaryasoneofthemostspectacularepisodesintherecordofWesternmigrationintheUnitedStates.Intotal,ofthe89men,womenandchildrenintheDonnerparty.48survived,41died.Thefollowingtablearegivesthenumbersofmales/famalesaccordingtheirsurvivalstatus:1MaleFemaleDied329Survived12325 178CATEGORICALDATATestthehypothesisthatinthepopulationofconsistingofmembersofDonner’sPartythegenderandsurvivalstatuswereindependent.Usea=0.05.Thefollowingtablearegivesthenumbersofmales/famaleswhosurvivedaccordingtotheirage(children/adults).Testthehy-pothesisthatinthepopulationofconsistingofsurvivingmembersofDonner’sPartythegenderandagewereindependent.Usecy=0.05.1AdultChildren16Female15Fig.9.4SurvivingdaughtersofGeorgeDonner.Georgia(4y.0.)andEliza(3y.0.)withtheiradoptivemotherMaryBrunner.Interestingfacts(notneededforthesolution):Two-thirdsofthewomensurvived:two-thirdsofthemendied.Fourgirlsagedthreeandunderdied;twosurvived.Nogirlsbetweentheagesof4and16died.Fourboysagedthreeandunderdied:nonesurvived.Sixboysbetweentheagesof4and16died.Alltheadultmaleswhosurvivedtheentrapment(Breen.Eddy.Foster,Keseberg)werefathers.Allthebachelors(singlemalesoverage21)whoweretrappedintheSierradied.Jean-BaptisteTrudeauandNoahJamessurvivedtheentrapment,butwereonlyabout16yearsoldandarenotconsideredbachelors.9.15.WestofTokyoliesalargealluvialplain,dottedbyanetworkoffarmingvillages.Matui(1968)analyzedthepositionofthe911housesmakinguponeofthosevillages.Theareastudiedwasarectangle,3kmby4km.Agridwassuperimposedoveramapofthevillage.dividingits EXERCISES17912squarekilometersinto1200plots,each100metersonaside.Thenumberofhousesoneachofthoseplotswasrecordedina30by40matrixofdata.TestthehypothesisthatthedistributionofnumberofhousesperplotisPoisson.Usecy=0.05.FrequencyI5843981683596Hznt:AssumethatparameterX=0.76(approximatelytheratio911/1200).Findtheoreticalfrequenciesfirst.Forexample,thetheoreticalfrequencyforNumber=2isnpz=1200x0.76’/2!xexp{-0.76)=162.0745.whiletheobservedfrequencyis168.SubtractanadditionaldegreeoffreedombecauseXisestimatedfromthedata.Fig.9.5(a)LIatrixof1200plots(30x40).Lightercolorcorrespondstohighernumberofhouses:(b)Histogramofnumberofhousesperplot.9.16.Apollwasconductedtodetermineifperceptionsofthehazardsofsmok-ingweredependentonwhetherornotthepersonsmoked.Onehundredpeoplewererandomlyselectedandsurveyed.Theresultsaregivenbe-low.IerySomewhatNotDangerousDangerousDangerousDangerous~[code01~[code111[code21I[code31IISmokers111(18.13)I15(15.19)114(9.80)19()I1Nonsmokers126(18.87)116()I6()I3(6.12)1 180CATEGORICALDATA(a)TestthehypothesisthatsmokingstatusdoesnotaffectperceptionofthedangersofsmokingatQ=0.05(Fivetheoretical/expectedfre-quenciesaregivenintheparentheses).(b)Observedfrequenciesofperceptionsofdanger[codes]forsmokersare[code01[code11[code21[code311115149Arethecodescorningfromadiscreteuniformdistribution(i.e.,eachcodeisequallylikely)?Usea=0.01.REFERENCESAgresti,A.(1992):CategoricalDataAnalysis,2nded,NewYork:Wiley.Benford,F.(1938),“TheLawofAnomalousNumbers,”ProceedingsoftheAmericanPhilosophicalSociety,78,551.Bickel,P.J.,Hammel,E.A.,andO‘Connell,J.W.(1975),“SexBiasinGraduateAdmissions:DatafromBerkeley,”Science,187,398-404.Cochran,W.G.(1950),“TheComparisonofPercentagesinMatchedSam-ples,”Biometrika,37,256-266.Darwin,C.(1859),TheOriginofSpeciesbyMeansofNaturalSelection,1sted,London:UK:Murray.Deonier,R.C.,Tavare,S.,andWaterman,M.S.(2005),ComputationalGenomeAnalysis:AnIntroduction.NewYork:SpringerVerlag.Doucet;H.,Shah,hl.K.,Cummings,T.L.,andKahm,M.J.(1999),“Com-parisonofInternalMedicine,PediatricandMedicine/PediatricsAppli-cantsandFactorsInfluencingCareerChoices,‘‘SouthernMedicalJour-nal,92,296-299.Fisher,R.A.(1918),“TheCorrelationBetweenRelativesontheSupposi-tionofMendelianInheritance,‘’PhilosophicalTransactionsoftheRoyalSocietyofEdinburgh,52,399433.(1922),“OntheInterpretationofChi-squarefromContingencyTa-bles,andtheCalculationofP,”JournaloftheRoyalStatisticalSociety,85,87-94.(1966),TheDesignofExperiments,8thed.,Edinburgh,UK:OliverandBoyd. REFERENCES181Gilb,T.,andGraham,D.(1993):SoftwareInspection,Reading,MA:Addison-Wesley.Hill,T.(1998),”TheFirstDigitPhenomenon;”AmericanScientist,86,358.Johnson,S.,andJohnson,R.(1972),“TonsillectomyHistoryinHodgkin‘sDisease,”NewEnglandJournalofMedicine,287,1122-1125.Liu,Z.(1992);”SmokingandLungCancerinChina:CombinedAnalysisofEightCase-ControlStudies,”InternationalJournalofEpidemiology,21,197-201.Mantel,N.,andHaenszel,W.(1959),“StatisticalAspectsoftheAnalysisofDatafromRetrospectiveStudiesofDisease,”JournaloftheNationalCancerInstitute,22,719-729.Matui,I.(1968),”StatisticalStudyoftheDistributionofScatteredVillagesinTwoRegionsoftheTonamiPlain,ToyamaPrefecture,”inSpatialPatterns,Eds.BerryandMarble,EnglewoodClifs,NJ:Prentice-Hall.McNemarQ.(1947),“ANoteontheSamplingErroroftheDifferenceBetweenCorrelatedProportionsorPercentages,”Psychometrika,12%153-157.McWilliams,W.C.andPiotrowski,H.(2005)TheWorldSince1945:AHistoryOfInternationalRelations,LynneRiennerPublishers.(1960),“AtRandom:SenseandNonsense,’‘AmericanPsychologist,15,295-300.(1969),PsychologicalStatistics,4thEdition,NewYork:Wiley.Pearson,K.(1900),“OntheCriterionthataGivenSystemofDeviationsfromtheProbableint’heCaseofaCorrelatedSystemofVariablesissuchthatitcanbeReasonablySupposedtohaveArisenfromRandomSampling,“PhilosophicalMagazine,50,157-175.Radelet,M.(1981),“RacialCharacteristicsandtheImpositionoftheDeathPenalty,”AmericanSociologicalReview,46,918-927.Rasmussen,M.H.,andMiller,L.A.(2004),“EcholocationandSocialSignalsfromWhite-beakedDolphins,Lagenorhyncusalbirostris?recordedinIce-landicwaters,“inEcholocationinBatsandDolphins,ed.J.A.Thomas,etal,Chicago:UniversityofChicagoPress.Simonoff,J.S.(2003),AnalyzingCategoricalData,NewYork:SpringerVer-lag.TanurJ.hf.ed.(1972),Statistics:AGuidetotheUnknown,SanFrancisco:Holden-Day.vonBortkiewicz,L.(1898),”DasGesetzderKleinenZahlen,”Leipzig,Ger-many:Teubner. ThisPageIntentionallyLeftBlank I0EstimatingDistributionFunctionsTheharderyoufighttoholdontospecificassumptions,themorelikelythere’sgoldinlettinggoofthem.JohnSeelyBrown.formerChiefScientistatXeroxCorporation10.1INTRODUCTIONLetXI,Xz,...,X,beasamplefromapopulationwithcontinuousCDFF.InChapter3,wedefinedtheempirical(cumulative)distributionfunction(EDF)basedonarandomsampleaslnF,(z)=-c1(Xi5z).i=lBecauseF,(z).forafixedz.hasasamplingdistributiondirectlyrelatedtothebinomialdistribution,itspropertiesarereadilyapparentanditiseasytoworkwithasanestimatingfunction.TheEDFprovidesasoundestimatorfortheCDF,butnotthroughanymethodologythatcanbeextendedtogeneralestimationproblemsinnon-parametricstatistics.Forexample.whatifthesampleisrighttruncated?Orcensored?Whatifthesampleobservationsarenotindependentoridenticallydistributed?Instandardstatisticalanalysis,themethodofmaxzmumlike-lihoodprovidesageneralmethodologyforachievinginferenceprocedureson183 184ESTIMATINGDISTRIBUTIONFUNCTIONSunknownparameters.butinthenonparametriccase,theunknownparameteristhefunctionF(z)(or,equivalently,thesurvivalfunctionS(z)=1-F(z)).Essentially,thereareaninfinitenumberofparameters.Inthenextsectionwedevelopageneralformulaforestimatingthedistributionfunctionfornon-i.i.d.samples.Specifically,theKaplan-MeierestimatorisconstructedtoestimateF(x)whencensoringisobservedinthedata.ThisthemecontinuesinChapter11whereweintroduceDensztyEstzma-tzonasapracticalalternativetoestimatingtheCDF.Unlikethecumulativedistribution,thedensityfunctionprovidesabettervisualsummaryofhowtherandomvariableisdistributed.CorrespondingtotheEDF,theempzrzcaldensityfunctzonisadiscreteuniformprobabilitydistributionontheobserveddata,anditsgraphdoesn’texplainmuchaboutthedistributionofthedata.ThepropertiesofthemorerefineddensityestimatorsinChapter11arenotsoeasilydiscerned,butitwillgivetheresearcherasmootherandvisuallymoreinterestingestimatortoworkwith.Inmedicalresearch,survivalanalysisisthestudyoflifetimedistributionsalongwithassociatedfactorsthataffectsurvivalrates.Thetimeeventmightbeanorganism’sdeath,orperhapstheoccurrenceorrecurrenceofadiseaseorsymptom.10.2N0NPARAMETRICMAXIMUMLIKELIH00DAsacounterparttotheparametriclikelihood.wedefinethenonparametriclikelihoodofthesampleXI,...,X,asnL(F)=n(F(z2)-F(z,)).(10.1)2=1whereF(z,)isdefinedasP(X0isrequired.orelseL(F)=0.Wealsoknowthatpl+...+p,=1.becauseifthesumislessthanone,therewouldbeprobabilitymassassignedoutsidetheset21,...,2,.Thatwouldbeimpracticalbecauseifwereassignedthatresidualprobabilitymass(sayq=1-pl-..-p,>0)toanyoneofthevaluesz2, KAPLAN-MEIERESTIMATOR185thelikelihoodL(F)wouldincreaseinthetermF(z,)-F(z,)=p,+q.SotheNPMLEnotonlyassignsprobabilitymasstoeveryobservation,butonlytothatset,hencethelikelihoodcanbeequivalentlyexpressedasnwhich.undertheconstraintthatCp,=1,isthemultznomiallikelihood.TheNPMLEiseasilycomputedasfi2=1/72,i=1,...,n.Notethatthissolutionisquiteintuitive~itplacesequal“importance”onallnoftheobservations,anditsatisfiestheconstraintgivenabovethatCp2=1.Thisessentiallyprovesthefollowingtheorem.Theorem10.1LetXI,...XnbearandomsamplegeneratedfromF.ForanydistributionfunctionFo,thenonparametriclikelihoodL(F0)5L(Fn),sothattheempiricaldistributionfunctionisthenonparametricmaximumlikelihoodestimator.10.3KAPLAN-MEIERESTIMATORThenonparametriclikelihoodcanbegeneralizedtoallsortsofobserveddatasetsbeyondasimplei.i.d.sample.Themostcommonlyobservedphenomenonoutsidethei.i.d.caseinvolvescensoring.Todescribecensoring,wewillcon-siderX>0,becausemostproblemsinvolvingcensoringconsistoflifetimemeasurements(e.g.,timeuntilfailure).(a)(b)Fig.10.1EdwardKaplan(1920-2006)andPaulbleier(1924-).Definition10.1SupposeXisalifetimemeasurement.Xisrightcensoredattimetifweknowthefailuretimeoccurredaftertimet,buttheactualtime 186ESTlMATlNGDISTRlBUTlONFUNCTIONSisunknown.Xisleftcensoredattimetifweknowthefailuretameoccurredbeforetimet,buttheactualtimeisunknown.Definition10.2Type-Icensoringoccurswhennitemsontestarestoppedatafixedtimeto,atwhichtimeallsurvivingtestitemsaretakenofftestandarerightcensored.Definition10.3Type-I1censoringoccurswhennitems(XI,...,Xn)ontestarestoppedafteraprefixednumberofthem(say,k5n)havefailed,leavingtheremainingitemstoberightcensoredattherandomtimet=Xk:,.TypeIcensoringisacommonproblemindrugtreatmentexperimentsbasedonhumantrials;ifapatientreceivinganexperimentaldrugisknowntosurviveuptoatimetbutleavesthestudy(andhumansareknowntoleavesuchclinicaltrialsmuchmorefrequentlythanlabmice)thelifetimeisrightcensored.Supposewehaveasampleofpossiblyright-censoredvalues.Wewillas-sumetherandomvariablesrepresentlifetimes(or“occurrencetimes“).Thesampleissummarizedas{(X,>6,),i=1.....n},whereX,isatimemea-surement,and6,equalsliftheX,representsthelifetime,andequals0ifX,isa(right)censoringtime.If6,=1,X,contributesdF(z,)=F(x,)-F(z,)tothelikelihood(asitdoesinthei.i.d.case).If6,=0,weknowonlythatthelifetimesurpassedtimeX,,sothiseventcontributes1-F(x,)tothelikelihood.ThennL(F)=n(1-F(X,))’-~’(dF(z,))6z.(10.2)t=1TheargumentabouttheNPMLEhaschangedfrom(10.1).Inthiscase,noprobabilitymassneedbeassignedtoavalueX,forwhich6,=0,be-causeinthatcase,dF(X,)doesnotappearinthelikelihood.Furthermore.theaccumulatedprobabilitymassoftheNPMLEontheobserveddatadoesnotnecessarilysumtoone,becauseifthelargestvalueofX,isacensoredobservation,thetermS(X,)=1-F(X,)willonlybepositiveifprobabilitymassisassignedtoapointorintervaltotherightofX,.Letp,betheprobabilitymassassignedtoX,n.Thisnewnotationallowsforpositiveprobabilitymass(callitP,+~)thatcanbeassignedtosomearbi-trarypointorintervalafterthelastobservationX,,.Letd,bethecensoringindicatorassociatedwithX,n.NotethateventhoughX1<......,in)isnotnecessarilyso(8,iscalledaconcornztant).If8,=1,thelikelihoodisclearlymaximizedbysettingprobabilitymass(sayp,)onX,,.If8,=0,somemasswillbeassignedtotherightofX,,,whichhasintervalprobabilityp,+l+...+p,+l.Thelikelihoodbasedon KAPLAN-MEIERESTIMATOR187censoreddataisexpressedInsteadofmaximizingthelikelihoodintermsof(PI....,pn+l),itwillprovetobemucheasierusingthetransformationPzA,=n+l.CpP,Thisisaconvenientone-to-onemappingwherenil2-12-1ThelikelihoodsimplifiestoAsafunctionof(A1,....An+l).Lismaximizedatit=&/(n-i+l),i=1....,n+1.Equivalently,=n-lii+1~=1f-j(l-n-j+lsj).TheNPMLEofthedistributionfunction(denotedFKM(z))canbeexpressedasasuminp,.Forexample,attheobservedorderstatistics,weseethat 188ESTIMATINGDISTRIBUTIONFUNCTIONSThisistheKaplan-Mezernonparametricestimator,developedbyKaplanandMeier(1958)forcensoredlifetimedataanalysis.It'sbeenoneofthemostinfluentialdevelopmentsinthepastcentury;theirpaperisthemostcitedpaperinstatistics(Stigler,1994).E.L.KaplanandPaulMeierneveractuallymetduringthistime.buttheybothsubmittedtheirideaofthe"productlimitestimator"totheJournaloftheAmerzcanStatzstzcalAssoczatzonatapproximatelythesametime,sotheirjointresultswereamalgamatedthroughlettercorrespondence.Fornon-censoredobservations,theKaplan-MeierestimatorisidenticaltotheregularMLE.Thedifferenceoccurswhenthereisacensoredobservation-thentheKaplan-Meierestimatortakesthe"weight"normallyassignedtothatobservationanddistributesitevenlyamongallobservedvaluestotherightoftheobservation.Thisisintuitivebecauseweknowthatthetruevalueofthecensoredobservationmustbesomewheretotherightofthecensoredvalue,butwedon'thaveanymoreinformationaboutwhattheexactvalueshouldbe.Theestimatoriseasilyextendedtosetsofdatathathavepotentialtiedvalues.Ifwedefined3=numberoffailuresatx3,m3=numberofobservationsthathadsurviveduptox;,then(10.4)Example10.1Muenchow(1986)testedwhethermaleorfemaleflowers(ofWesternWhiteClematis),wereequallyattractivetoinsects.ThedataintheTable10.15representwaitingtimes(inminutes),whichincludescensoreddata.InMATLAB,usethefunctionKMcdfSM(x,y,j)whereccisavectorofeventtimes,yisavectorofzeros(indicatingcensor)andones(indicatingfailure),andj=1indicatesthevectorvaluesordered(j=0meansthedatawillbesortedfirst).Example10.2DatafromCrowderetal.(1991)listsstrengthmeasurements(incodedunits)for48piecesofweatheredcord.Sevenofthepiecesofcordweredamagedandyieldedstrengthmeasurementsthatareconsideredrightcensored.Thatis,becausethedamagedcordwastakenofftest,weknowonlythelowerlimitofitsstrength.IntheMATLABcodebelow.vectordatarepresentsthestrengthmeasurements,andthevectorcensorindicates(withazero)ifthecorrespondingobservationindataiscensored.>>data=[36.3,41.7,43.9,49.9,50.1,50.8,51.9,52.1,52.3,52.3,52.4,52.6,... KAPLAN-MEIERESTIMATOR189Table10.15WaitingTimesforInsectstoVisitFlowers~~~~~MaleFlowersFemaleFlowers1927~119571927,223592930'42367211314267141135528754143662975*5144072978*51443729816165483090*6166183294*61768835967176993596*718701437100*819831539102*819951843105*819102"1856104*1-0.9-,J..-..,-....................,*0.8~,....-......0.7-1;1,........-----...0.6051Fig.10.2Kaplan-SleierestimatorforWaitingTimes(solidlineformaleflowers,dashedlineforfemaleflowers). Fig.10.3Kaplan-Meierestimatorcordstrength(incodedunits).52.7,53.1,53.6,53.6,53.9,53.9,54.1,54.6,54.8,54.8,55.1,55.4,55.9,...56.0,56.1,56.5,56.9,57.1,57.1,57.3,57.7,57.8,58.1,58.9,59.0,59.1,...59.6,60.4,60.7,26.8,29.6,33.4,35.0,40.0,41.9,42.51;>>censor=[ones(i,41),zeros(l,7)1;>>[kmest,sortdat,sortcen]=kmcdfsm(data’,censor’,O);>>plot(sortdat,kmest,’k’);ThetablebelowshowshowtheKaplan-Meierestimatoriscalculatedusingtheformulain(10.4)forthefirst16measurements.whichincludessevencensoredobservations.Figure10.3showstheestimatedsurvivalfunctionforthecordstrengthdata. KAPLAN-MNERESTIMATOR191UncensoredxJm3dJ-m3-41-FKM(x~)m326.84801.0001.00029.64701.0001.00033.44601.0001.ooo35.04501.0001.ooo136.34410.9770.97740.04301.ooo0.977241.74210.9760.95441.94101.0000.95442.54001.0000.954343.93910.9740.930449.93810.9740.905550.13710.9730.881650.83610.9720.856751.93510.9710.832852.13410.9710.807952.33320.9390.758Example10.3Considerobservingthelifetimeofaseriessystem.Recallaseriessystemisasystemofk21componentsthatfailsatthetimethefirstcomponentfails.Supposeweobservendifferentsystemsthatareeachmadeofk,identicalcomponents(i=1....,n)withlifetimedistributionF.Thelifetimedataisdenoted(XI.....xn).Furthersupposethereis(random)rightcensoring,andS,=I(xzrepresentsalifetimemeasurement).HowdoweestimateF?IfF(z)iscontinuouswithderivativef(z),thentheithsystem'ssurvivalfunctionisS(X)~%anditscorrespondinglikelihoodis&(F)=k,(1-F(x))"-lf(x).It'seasiertoexpressthefulllikelihoodintermsofS(x)=1-F(z):where1-6indicatescensoring.Tomakethelikelihoodmoreeasytosolve,let'sexaminetheorderedsampley,=x,soweobservey1p.Forexample,withtheflowerdatainTable10.15,theme-dianwaitingtimesareeasilyestimatedasthesmallestvalues(x)forwhichFKAJ(X)51/2,whichare16(forthemaleflowers)and29(forthefemaleflowers).Ifthedataarenoti.i.d..theNPMLEFcanbepluggedinforFinQ(F).Thisisakeysellingpointtotheplug-inprinciple:itcanbeusedtoformulateestimatorswherewemighthavenosetruletoestimatethem.Dependingonthesample,FmightbetheEDFortheKaplan-Meierestimator.Theplug-intechniqueissimple,anditwillformabasisforestimatinguncertaintyusingre-samplingtechniquesinChapter15.Example10.5Tofindtheaveragecordstrengthfromthecensoreddata,forexample.itwouldbeimprudenttomerelyaveragethedata,asthecensoredobservationsrepresentalowerboundonthedata,hencethetruemeanwillbe SEMI-PARAMETRIClNfERENCE195underestimated.Byusingthepluginprinciple,wewillgetamoreaccurateestimate;thecodebelowestimatesthemeancordstrengthas54.1946(seealsotheMATLABm-filepluginmu.Thesamplemean,ignoringthecensoringindicator,is51.4438.>>[cdfysvdatasvcensor1=kmcdfsm(vdata,vcensor,ipresorted);>>ifmin(svdata)>O;skm=1-cdfy;%survivalfunctionskml=[l,skm’l;svdata2=[Osvdata’l;svdata3=[svdata’svdata(end1l;dx=svdata3-svdata2;mu-hat=skml*dx’;else;cdfyl=CO,cdfy’l;cdfyi!=[cdfy’11;df=cdfy2-cdfyl;svdatal=[svdata’,01;mu-hat=svdatal*df’;end;>>mu-hatans=154.194610.6SEMI-PARAMETRICINFERENCETheproportionalhazardsmodelforlifetimedatarelatestwopopulationsac-cordingtoacommonunderlyinghazardrate.Supposero(t)isabaselinehazardrate,wherer(t)=f(t)/(l-F(t)).Inreliabilitytheory,r(t)iscalledthefailurerate.Forsomecovariatezthatisobservedalongwiththelife-time,thepositivefunctionofQ(z)describeshowthelevelof5canchangethefailurerate(andthusthelifetimedistribution):r(t;z)=ro(t)Q(z).Thisistermedasemi-parametricmodelbecausero(t)isusuallyleftun-specified(andthusacandidatefornonparametricestimation)whereasQ(5)isaknownpositivefunction,atleastuptosomepossiblyunknownparameters.RecallthattheCDFisrelatedtothefailurerateasLr(u)du=R(u)=-InS(z).whereS(z)=1-F(z)iscalledthesurvivorfunction.R(t)iscalledthecumulativefailurerateinreliabilityandlifetesting.Inthiscase,So(t)isthe 196ESTIMATINGDlSTRlBUTIONFUNCTIONSbaselinesurvivorfunction,andrelatestothelifetimeaffectedbyQ(z)asS(t;z)=So(t)WThemostcommonlyusedproportionalhazardsmodelusedinsurvivalanalysisiscalledtheCoxModel(namedafterSirDavidCox),whichhastheformWiththismodel,the(vector)parameter,8isleftunspecifiedandmustbeestimated.Supposethebaselinehazardfunctionoftwodifferentpopulationsarerelatedbyproportionalhazardsasrl(t)=rO(t)Xandrz(t)=ro(t)Q.ThenifTIandTzrepresentlifetimesfromthesetwopopulations,Theprobabilitydoesnotdependatallontheunderlyingbaselinehazard(orsurvivor)function.Withthisconvenientset-up.nonparametricestimationofS(t)ispossiblethroughmaximizingthenonparametriclikelihood.Supposenpossiblyright-censoredobservations(21,...,z,)fromF=1-Sareobserved.Let&representthenumberofobservationsatriskjustbeforetime2,.Then,ifS,=lindicatesthelifetimewasobservedatxi,Ingeneral.thelikelihoodmustbesolvednumerically.Forathoroughstudyofinferencewithasemi-parametricmodel.wesuggestStatisticalModelsandMethodsforLifetimeDatabyLawless.Thisareaofresearchisparamountinsurvivalanalysis.Relatedtotheproportionalhazardmodel,istheacceleratedlafetimemodelusedinengineering.Inthiscase,thebaselinesurvivorfunctionSo(t)canrep-resentthelifetimeofatestproductunderusageconditions.Inanacceleratedlifetest,andadditionalstressisputonthetestunit,suchashighorlowtemperature,highvoltage,highhumidity,etc.Thisstressischaracterizedthroughthefunction@(z)andthesurvivorfunctionofthestressedtestitemisS(t;z)=So(t@(.)).Acceleratedlifetestingisanimportanttoolinproductdevelopment,especiallyforelectronicsmanufacturerswhoproducegadgetsthatareexpectedtolastseveralyearsontest.Byincreasingthevoltageinaparticularway,asoneexample,thelifetimescanbeshortenedtohours.Thekeyishowmuchfaith EMPIRICALPROCESSES197themanufacturerhasontheknownaccelerationfunction9(z).InMATLAB,theStatisticsToolboxofferstheroutinecoxphfit,whichcomputesCoxproportionalhazardsestimatorforinputdata,muchinthesamewaythekmcdfsmcomputestheKaplan-Meierestimator.10.7EMPIRICALPROCESSESIfweexpressthesampleasX,(w),....Xn(w),wenotethatF,(z)isbothafunctionofzandwEa.Fromthis.theEDFcanbetreatedasarandomprocess.TheGlivenko-CantelliTheoremfromChapter3statesthattheEDFF,(z)convergestoF(z)(i)almostsurely(asrandomvariable,zfixed).and(ii)uniformlyinz.(asafunctionofzwithwfixed).Thiscanbeexpressedas:LetW(z)beastandardBrownianmotionprocess.ItisdefinedasastochasticprocessforwhichW(0)=0,W(t)N"(0,t),W(t)hasindependentincrements,andthepathsofW(t)arecontinuous.ABrownianBridgeisdefinedasB(t)=W(t)-tW(l),05t51.BothendsofaBrownianBridge,B(0)andB(l),aretiedto0.andthispropertymotivatesthename.ABrownianmotionW(z)hascovariancefunctiony(t,s)=tAs=rnin(t.s).ThisisbecauseIE(N'(t))=0,Var(W(t))=s,fors0.2585.Theconfidenceintervaliscomputedas(28.78ksi,33.02ksi).>>x=[18.8320.821.65723.0323.2324.0524.32125.525.5225.8...26.6926.7726.7827.0527.6729.931.1133.233.7333.7633.89...34.7635.7535.9136.9837.0837.0939.5844.04545.2945.3811;>>n=size(x);i=i;>>formu=min(x):O.l:max(x)R-mu=elm(x,mu,zeros(l,l),100,le-7,le-9,0);ELR-mu(i)=R-mu;Mu(i)=mu;i=i+l;end-1Fig.10.4Empiricallikelihoodratioasafunctionof(a)themeanand(b)themedian(fordifferentsamples).Owen'sextensionofWilk'stheoremforparametriclikelihoodratiosisvalidforotherfunctionsofF,includingthevariance,quantilesandmore.ToconstructRforthemedian,weneedonlychangethestructuralconstraintfromCpizi=,LLtoCpisign(zi-20.50)=0.ConfidenceIntervalfortheMedian.Ingeneral,computingR(z)isdifficult.Forthecaseofestimatingapopulationquantile,however,theopti-mizingbecomesrathereasy.Forexample,supposethatn1observationsoutofnarelessthanthepopulationmedian20.50andn2=n-n1observationsaregreaterthan20.50.Undertheconstraint20.50=20.50,thenonparametriclikelihoodestimatorassignsmass(2nl)-'toeachobservationlessthanZ0.50andassignsmass(2nz)-ltoeachobservationtotherightof20.50,leavingus EXERClSES201withExample10.7Figure10.4(b).basedontheMATLABcodebelow,showstheempiricallikelihoodforthemedianbasedon30randomlygeneratednumbersfromtheexponentialdistribution(withp=land20.50=-ln(0.5)=0.6931).A90%confidenceintervalfor20.50,againbasedonTO>0.2585.is(0.3035,0.9021).Forgeneralproblems,computingtheempiricallikelihoodisnoeasymatter.andtoreallyutilizethemethodfully,moreadvancedstudyisneeded.Thissectionprovidesamodestintroductiontoletyouknowwhatispossibleusingtheempiricallikelihood.StudentsinterestedinfurtherpursuingthismethodarerecommendedtoreadOwen’sbook.10.9EXERCISES10.1.Withani.i.d.sampleofnmeasurements.usetheplug-inprincipletoderiveanestimatorforpopulationvariance.10.2.Twelvepeoplewereinterviewedandaskedhowmanyyearstheystayedattheirfirstjob.Threepeoplearestillemployedattheirfirstjobandhavebeentherefor1.5.3.0and6.2years.Theothersreportedthefollowingdataforyearsatfirstjob:0.4,0.9,1.1.1.9.2.0,3.3,5.3,5.8.14.0.Usinghandcalculations.computeanoriparametricestimatorforthedistributionofT=timespent(inyears)atfirstjob.VerifyyourhandcalculationsusingMATLAB.Accordingtoyourestimator,whatistheestimatedprobabilitythatapersonstaysattheirjobforlessthanfouryears?Constructa95%confidenceintervalforthisestimate.10.3.UsingtheestimatorinExercise10.2.usetheplug-inprincipletocom-putetheunderlyingmeannumberofyearsapersonstaysattheirfirstjob.Compareittothefaultyestimatorsbasedonusing(a)onlythenoncensoreditemsand(b)usingthecensoredtimesbutignoringthecensoringmechanism.10.4.ConsiderExample10.3,whereweobserveseries-systemlifetimesofaseriessystem.WeobservendifferentsystemsthatareeachmadeofkE 202ESTlMATlNGDlSTRlBUTlONFUNCTIONSidenticalcomponents(i=1,...,n)withlifetimedistributionF.Thelifetimedataisdenoted(51,...,z,)andarepossiblyrightcensored.Showthatifweletrj=kj+...+in,thelikelihoodcanbeexpressedas(10.5)andsolveforthenonparametricmaximumlikelihoodestimator.10.5.Supposeweobservemdifferentk-out-of-nsystemsandeachsystemcontainsi.i.d.components(withdistributionF),andtheithsystemcontainsnicomponents.SetupthenonparametriclikelihoodfunctionforFbasedonthensystemlifetimes(butdonotsolvethelikelihood).10.6.Gotothelinkbelowtodownloadsurvivaltimesfor87peoplewithlupusnephritis.Theywerefollowedfor15+ormoreyearsafteraninitialrenalbiopsy.Thedurationvariableindicateshowlongthepatienthadthediseasebeforethebiopsy;constructtheKaplan-Meierestimatorforsurvival,ignoringthedurationvariable.http://lib.stat.cmu.edu/datasets/lupus10.7.RecallExercise6.3basedon100measurementsofthespeedoflightinair.Useempiricallikelihoodtoconstructa90%confidenceintervalforthemeanandmedian.http://www.itl.nist.gov/div898/strd/univ/data/Michelso.dat10.8.SupposetheempiricallikelihoodratioforthemeanwasequaltoR(p)=pl(05p51)+(2-p)1(15p52).Finda95%confidenceintervalforp.10.9.TheReceiverOperatingCharacteristic(ROC)curveisastatisticaltooltocomparediagnostictests.Supposewehaveasampleofmeasurements(scores)XI,...,X,fromadiseasedpopulationF(z),andasampleofYl,...,Y,fromahealthypopulationG(y).Thehealthypopulationhaslowerscores,soanobservationiscategorizedasbeingdiseasedifitexceedsagiventhresholdvalue,e.g.,ifX>c.Thentherateoffalse-positiveresultswouldbeP(Y>c).TheROCcurveisdefinedastheplotofR(p)=F(G-l(p)).TheROCestimatorcanbecomputedusingtheplug-inprinciple:=Fn(G;VP)).AcommontesttoseeifthediagnostictestiseffectiveistoseeifR(p)remainswellabove0.5for05p51.TheAreaUndertheCurve(AUC)isdefinedas1AUC=.IR(p)dp. REFERENCES203ShowthatAUC=P(X5Y)andshowthatbyusingtheplug-inprinciple,thesampleestimatoroftheAUCisequivalenttotheMann-Whitneytwo-sampleteststatistic.REFERENCESBrown,J.S.(1997),WhatItMeanstoLead,FastCompany.7.NewYork.RlansuetoVentures.LLC.Cox,D.R.(1972),"RegressionModelsandLifeTables.''JournaloftheRoyalStatzstzcalSoczety(B),34,187-220.Crowder.Ll.J..Kimber,A.C.,Smith.R.L.,andSweeting,T.J.(1991).StatzstzcalAnalystsofRelzabzlztyData.London,Chapman&Hall.FullerJr..E.R..Frieman,S.W..Quinn,J.B..Quinn.G.D.,andCarter,W.C.(1994)."FractureMechanicsApproachtotheDesignofGlassAircraftWindows:ACaseStudy",SPIEProceedangs.Vol.2286.(SocietyofPhoto-OpticalInstrumentationEngineers(SPIE).Bellingham.WA).Greenwood,hl.(1926),"TheNaturalDurationofCancer,"inReportsonPublzcHealthandMedzcalSubjects.33.London:H.hlStationeryOffice.Hall,W.J..andWellner.J.A.(1980).-ConfidenceBandsforaSurvivalCurve."Bzometrzka,67.133-143Kaplan,E.L..andNeier.P.(1958)."NonparametricEstimationfromIncom-pleteObservations."JournaloftheAmerzcanStatzstzcalAssoczatzon.53,457-481.Kiefer.J.,andWolfowitz,J.(1956)."ConsistencyofthehlaximumLikelihoodEstimatorinthePresenceofInfinitelyManyIncidentalParameters,"AnnalsofMathematzcalStatzstzcs.27,887-906.Lawless.J.F.(1982),StatzstzcalModelsandMethodsforLzfetzmeData,NewYork:Wiley.Muenchow,G(1986)."EcologicalUseofFailureTirneAnalysis."Ecology67,246250.Kair,V.N.(1984)."ConfidenceBandsforSurvivalFunctionswithCensoredData.AComparativeStudy,"Technometrzcs,26,265-275.Owen,A.B.(1988)."EmpiricalLikelihoodRatioConfidenceIntervalsforaSingleFunctional.''Bzometrzka.75,237-249.(1990)."EmpiricalLikelihoodConfidenceRegions,"AnnalsofStatzs-tzcs.18.90-120(2001),EmpzrzcalLzkelzhood.BocaRaton.FL:Chapman&Hall/CRC.Stigler,S.M.(1994),"CitationsPatternsintheJournalsofStatisticsandProbability."StatzstzcalSczence.9.94-108. ThisPageIntentionallyLeftBlank DensityEstimationGeorgeMcFly:Lorraine,mydensityhasbroughtmetoyou.LorraineBaines:What?GeorgeMcFly:Oh.whatImeanttosaywas...LorraineBaines:Waitaminute,don’tIknowyoufromsomewhere?GeorgeMcFly:Yes.Yes.I’mGeorge,GeorgeMcFly.I’myourdensity.Imean...yourdestiny.FromthemovieBacktotheFuture,1985Probabilitydensityestimationgoeshandinhandwithnonparametricestima-tionofthecumulativedistributionfunctiondiscussedinChapter10.There.wenotedthatthedensityfunctionprovidesabettervisualsummaryofhowtherandomvariableisdistributedacrossitssupport.Symmetry,skewness.dispersenessandunimodalityarejustafewofthepropertiesthatareascer-tainedwhenwevisuallyscrutinizeaprobabilitydensityplot.Recall.forcontinuousi.i.d.data.theempzrzcaldensztyfunctzonplacesprobabilitymass1/noneachoftheobservations.Whiletheplotoftheempiricaldzstrzbutzonfunction(EDF)emulatestheunderlyingdistributionfunction.forcontinuousdistributionstheempiricaldensityfunctiontakesnoshapebesidethechangingfrequencyofdiscretejumpsof1/nacrossthedomainoftheunderlyingdistribution-seeFigure11.2(a).205 206DENSlTYESTlMATlONFig.11.1Playfair’s1786barchartofwheatpricesinEngland11.1HISTOGRAMThehistogramprovidesaquickpictureoftheunderlyingdensitybyweightingfixedintervalsaccordingthetheirrelativefrequencyinthedata.Pearson(1895)coinedthetermforthisempiricalplotofthedata,butitshistorygoesasfarbackasthe18thcentury.WilliamPlayfair(1786)iscreditedwiththefirstappearanceofabarchart(seeFigure11.1)thatplottedthepriceofwheatinEnglandthroughthe17thand18thcenturies.InMATLAB,theprocedurehist(x)willcreateahistogramwithtenbinsusingtheinputvectorx.Figure11.2shows(a)theempiricaldensityfunctionwhereverticalbarsrepresentDirac’spointmassesattheobservations,and(b)a10-binhistogramforasetof30generatedN(0,l)randomvariables.Obviously,byaggregatingobservationswithinthedisjointintervals,wegetabetter,smoothervisualconstructionofthefrequencydistributionofthesample.>>x=rand_nor(O,l,30,1);>>hist(x)>>histfit(x,1000)Thehistogramrepresentsarudimentarysmoothingoperationthatpro-videstheuserawayofvisualizingthetrueempiricaldensityofthesample.Still,thissimpleplotisprimitive,anddependsonthesubjectivechoicestheusermakesforbinwidthsandnumberofbins.Withlargerdatasets,wecanincreasethenumberofbinswhilestillkeepingaveragebinfrequencyatareasonablenumber.say5ormore.Iftheunderlyingdataarecontinuous,thehistogramappearslessdiscreteasthesamplesize(andnumberofbins)grow,butwithsmallersamples,thegraphofbinnedfrequencycountswillnotpickupthenuancesoftheunderlyingdistribution. KERNELANDBANDWIDTH207104-03-02-Oi-1225-2-15-1-05005l15(a)Fig.11.2Empirical"density"(a)andhistogram(b)for30normalN(0,l)variables.TheMATLABfunctionhistfit(x,n)plotsahistogramwithnbinsalongwiththebestfittingnormaldensitycurve.Figure11.3showshowtheappear-anceofcontinuitychangesasthehistogrambecomesmorerefined(withmorebinsofsmallerbinwidth).Ofcourse,wedonothavesuchluxurywithsmallerormediumsizeddatasets;andaremorelikelylefttoponderthequestionofunderlyingnormalitywithasampleofsize30,asinFigure11.2(b).>>x=rand_nor(O,1,5000,1);>>histfit(x,lO)>>histfit(x,1000)Ifyouhavenoscruples,thehistogramprovidesforyoumanyopportunitiestomisleadyouraudience,asyoucanmakethedistributionofthedataappeardifferentlybychoosingyourownbinwidthscenteredatasetofpointsarbi-trarilylefttoyourownchoosing.Ifyouarecompletelyuntrustworthy,youmightevenconsidermakingbinsofunequallength.Thatissuretosupportaconjecturedbutotherwiseunsupportablethesiswithyourdata,andmightjump-startapromisingcareerforyouinpolitics.11.2KERNELANDBANDWIDTHTheideaofthedensityestimatoristospreadouttheweightofasingleobservationinaplotoftheempiricaldensityfunction.Thehistogram,then,isthepictureofadensityestimatorthatspreadstheprobabilitymassofeachsampleitemuniformlythroughouttheinterval(i.e..bin)itisobservedin. 208DENSlTYESTlMATlONFig.11.3Histogramswithnormalfitof5000generatedvariablesusing(a)10binsand(b)50bins.Notethattheobservationsareinnowayexpectedtobeuniformlyspreadoutwithinanyparticularinterval,sothemassisnotspreadequallyaroundtheobservationunlessithappenstofallexactlyinthecenteroftheinterval.Inthischapter,wefocusonthekerneldensityestimatorthatmorefairlyspreadsouttheprobabilitymassofeachobservation,notarbitrarilyinafixedinterval,butsmoothlyaroundtheobservation,typicallyinasymmetricway.WithasampleXI,...,X,,wewritethedensityestimator(11.1)forX,=x,.i=1,...,n.ThekernelfunctionKrepresentshowtheprobabil-itymassisassigned,soforthehistogram.itisjustaconstantinanyparticu-larinterval.Thesmoothingfunctionh,isapositivesequenceofbandwidthsanalogoustothebinwidthinahistogram.ThekernelfunctionKhasfiveimportantproperties-1.K(x)20vx2.K(x)=K(-x)forIC>03.JK(u)du=14.JuK(u)du=05.JuZK(u)du=0:0.3.Epanechnickovkernel(describedbelow).4.Boxkernel,K(z)=1(-c0.WhileKcontrolstheshape.h,controlsthespreadofthekernel.Theaccuracyofadensityestimatorcanbeevaluatedusingthemeanintegratedsquarederror,definedasMISE=E(/(f(z)-f(~))~dz)=/Bias'(f(z))dz+sVar(f(z))dz.(11.2)TofindadensityestimatorthatminimizestheMISEunderthefivementionedconstraints,wealsowillassumethatf(x)iscontinuous(andtwicedifferen-tiable),h,-+0andnh,+ccasn4m.Undertheseconditionsitcanbe 210DENSITYESTIMATIONshownthatBias(f(x))=&f”(x)+O(h:)and2(11.3)whereR(g)=Jg(u)2du.Wedetermine(andminimize)theMISEbyourchoiceofh,.Fromtheequationsin(11.3),weseethatthereisatradeoff.Choosingh,toreducebiaswillincreasethevariance,andviceversa.Thechoiceofbandwidthisimportantintheconstructionoff(x).Ifhischosentobesmall,thesubtlenuancesinthemainpartofthedensitywillbehighlighted,butthetailofthedistributionwillbeunseemlybumpy.Ifhischosenlarge.thetailsofthedistributionarebetterhandled,butwefailtoseeimportantcharacteristicsinthemiddlequartilesofthedata.Bysubstitutinginthebiasandvarianceintheformulafor(11.2),weminimizeMISEwithAtthispoint,wecanstillchooseK(x)andinserta“representative”densityforf(x)tosolveforthebandwidth.Epanechnickov(1969)showedthat.uponsubstitutinginf(z)=q5(x)?thekernelthatminimizesMISEisTheresultingbandwidthbecomesh:FZ1.068n-’/’,where8isthesamplestandarddeviation.Thischoicereliesontheapproximationof0forf(x).Alternativeapproaches.includingcross-validation,leadtoslightlydifferentanswers.Adaptivekernelswerederivedtoalleviatethisproblem.Ifweuseamoregeneralsmoothingfunctiontiedtothedensityatx3.wecouldgeneralizethedensityestimatoras(11.4)Thisisanadvancedtopicindensityestimation,andwewillnotfurtherpur-suelearningmoreaboutoptimalestimatorsbasedonadaptivekernelshere.Wewillalsoleaveoutdetailsaboutestimatorlimitproperties,andinsteadpointoutthatifh,isadecreasingfunctionofn,undersomemildregularityconditions,lf(x)-f(x)I50.Fordetailsandmoreadvancedtopicsindensity KERNELANDBANDWIDTH211fig.11.5Densityestimationforsampleofsizen=7usingvariouskernels:(all)Nor-mal,(a)Box,(b)Triangle,(c)Epanechnikov.04035-1Fig.11.6Densityestimationforsampleofsizen=7usingvariousbandwidths.estimation,seeSilverman(1986)andEfromovich(1999).The(univariate)densityestimatorfromT;IATLAB.calledksdensity(data1.isillustratedinFigure11.5usingasampleofsevenobservations.Thedefaultestimateisbasedonanormalkernel:touseanotherkernel,justenter'box','triangle',or'epanechnikov'(seecodebelow).Figure11.5showshowthenor-malkernelcomparestothe(a)box.(2)triangleand(c)epanechnikovkernels.Figure11.6showsthedensityestimatorusingthesamedatabasedonthenormalkernel.butusingthreedifferentbandwidths.Notetheoptimalband-width(0.7449)canbefoundbyallowingathirdargumentintheksdensityoutput.>>datal=[11,12,12.2,12.3,13,13.7,18]; 212DENSlTYESTlMATlON>>data2=[50,21,25.5,40.0,41,47.6,39];>>[fl,xl]=ksdensity(datal,’kernel’,’box’);>>plot(xl,fl,’-k’)>>holdon>>[f2,~2,band]=ksdensity(datal);>>plot(x2,f2,’:kJ)>>bandband=0.7449>>[fl,xl]=ksdensity(datal,’width’,2);>>plot(xl,fl,’--k’)>>holdon>>[fl,xl]=ksdensity(datal,’width’,l);>>plot(xl,fl,’-k’)>>[fl,xll=ksdensity(datal,’width’,.5);>>plot(xl,fl,’:k’)Censoring.TheMATLABfunctionksdensityalsohandlesright-censoreddatabyaddinganoptionalvectordesignatingcensoring.Althoughwewillnotstudythedetailsaboutthewaydensityestimatorshandlethisproblem.censoredobservationsaretreatedinawaysimilartononparametricmax-imumlikelihood,withtheweightassignedtothecensoredobservationxcbeingdistributedproportionallytonon-censoredobservationsxt2x,(seetheKaplan-MeierestimatorinChapter10).Generalweightingcanalsobeincludedinthedensityestimationforksdensitywithanoptionalvectorofweights.Example11.1RadiationMeasurements.Insomesituations,theexper-imentermightprefertosubjectivelydecideonaproperbandwidthinsteadoftheobjectivechoiceofbandwidththatminimizesMISE.Ifoutliersandsubtlechangesintheprobabilitydistributionarecrucialinthemodel,amorejaggeddensityestimator(withasmallerbandwidth)mightbepreferredtotheoptimalone.InDaviesandGather(1993),2001radiationmeasurementsweretakenfromaballoonataheightof100feet.Outliersoccurwhentheballoonrotates,causingtheballoon‘sropestoblockdirectradiationfromthesuntothemeasuringdevice.Figure11.7showstwodensityestimatesoftherawdata.onebasedonanarrowbandwidthandtheothermoresmoothdensitybasedonabandwidth10timeslarger(0.01to0.1).Bothdensitiesarebaseduponanormal(Gaussian)kernel.Whilethemorejaggedestima-tordoesshowthemodeandskewofthedensityasclearlyasthesmootherestimator,outliersaremoreeasilydiscerned.>>T=load(’balloondata.txt’);>>~1=~(:,1);T2=T(:,2);>>[fl,xl]=ksdensity(Tl,’width’,.Oil; EXERCISES213Fig.11.7Densityestimationfor2001radiationmeasurementsusingbandwidthsband=0.5andband=0.05.>>plot(xl,fl,’-k’)>>holdon>>[f2,~2]=ksdensity(TI,’width’,.I);>>plot(x2,f2,’:k’)11.2.1BivariateDensityEstimatorsToplotdensityestimatorsforbivariatedata,athree-dimensionalplotcanbeconstructedusingMATLABfunctionkdfft2,notingthatbothxandy,thevectorsdesignatingplottingpointsforthedensity,mustbeofthesamesize.InFigure11.8;(univariate)densityestimatesareplottedforthesevenobservations[datal,data21.InFigure11.9,kdfft2isusedtoproduceatwo-dimensionaldensityplotforthesevenbivariateobservations(coupledtogether).11.3EXERCISES11.1.Whichofthefollowingserveaskernelfunctionsforadensityestimator?Proveyourassertiononewayortheother.a.K(z)=I(-1PIO,HI:PI>n0=1000;>>S=load(’activelearning.txt’);>>tradl=S(:,l);trad2=S(:,2);>>acti=S(:,3);act2=S(:,4);>>trad=[tradltrad21;act=[actlact21;>>r=zeros(no,1);p=zeros(no,1);b=zeros(nO,1);>>fori=l:nOb(i)=(i-(n0/2)/(n0/2);[rOz0PO]=spear(actl,act2-b(i)*act11;r(i)=rO;p(i)=pO;end>>stairs(b,p,’:k’)>>holdon>>stairs(b,r,’-k’)I-021-0.4I12141618-1-0.500.51Test1SlweParameterFig.12.1(a)Plotoftest#1scores(duringterm)andtest#2scores(8monthsafter).(b)PlotofSpearmancorrelationcoefficient(solid)andcorrespondingp-value(dotted)fornonparametrictestofslopefor-15P~o51. ROBUSTREGRESSlON22112.2.1Sen-TheilEstimatorofRegressionSlopeAmongnbivariateobservations.thereare(z)differentpairs(Xt,yZ)and(X,,q),i#j.Foreachpair(X,,Y,)and(X3.q),1<-i0.G(z)usessquared-errorlossforsmallerrors,butthelossfunctionflattensoutforlargererrors.12.3.3LeastTrimmedSquaresRegressionLeastTrimmedSquares(LTS)isanotherrobustregressiontechniqueproposedbyRousseeuw(1985)asarobustalternativetoordinaryleastsquaresregres-sion.Withinthecontextofthelinearmodely,=P’x,,i=I,....n,theLTShestimatorisrepresentedbythevalueofthatminimizesCz=lT,n.Here.xtisapxlvectorandT,isthezthorderstatisticfromthesquaredresidualsT,=(y,-P’X,)~andhisatrimmingconstant(n/25h5n)chosensothatthelargestn-hresidualsdonotaffectthemodelestimate.RousseeuwandLeroy(1987)showedthattheLTSestimatorhasitshighestlevelofrobustnesswhenh=[n/2]+[(p+l)/2].Whilechoosinghtobelowleadstoamorerobustestimator,thereisatradeoffofrobustnessforefficiency.12.3.4WeightedLeastSquaresRegressionForsomedata,onecanimprovemodelfitbyincludingascalefactor(weight)inthedeviationfunction.Weightedleastsquaresminimizesn2=1wherew,areweightsthatdeterminehowmuchinfluenceeachresponsewillhaveonthefinalregression.Withtheweightsinthemodel,weestimate/3inthelinearmodelwith9=(x’wx)-lx’wy,whereXisthedesignmatrixmadeupofthevectorszz,yistheresponsevector.andWisadiagonalmatrixoftheweightsw1,....w,.Thiscanbeespeciallyhelpfuliftheresponsesseemnottohaveconstantvariances.Weightsthatcountertheeffectofheteroskedasticity,suchas 224BEYONDLINEARREGRESSIONworkwellifyourdatacontainalotofreplicates;heremisthenumberofreplicatesatyz.TocomputethisinMATLAB,thefunctionlscovcomputesleast-squaresestimateswithknowncovariance;forexample,theoutputoflscov(A,B,W)returnstheweightedleastsquaressolutiontothelinearsystemAX=BwithdiagonalweightmatrixX.12.3.5LeastMedianSquaresRegressionTheleastmedianofsquares(LMS)regressionfindsthelinethroughthedatathatminimizesthemedian(ratherthanthemean)ofthesquaresoftheerrors.WhiletheLMSmethodisproventoberobust,itcannotbeeasilysolvedlikeaweightedleast-squaresproblem.Thesolutionmustbesolvedbysearchinginthespaceofpossibleestimatesgeneratedfromthedata,whichisusuallytoolargetodoanalytically.Instead,randomlychosensubsetsofthedataarechosensothatanapproximatesolutioncanbecomputedwithouttoomuchtrouble.TheMATLABfunctionlmsreg(y,X>computestheLMSforsmallormediumsizeddatasets.Example12.3StarData.DatafromRousseeuwandLeroy(1987),p.27,Table3,aregiveninallpanelsofFigure12.3asascatterplotoftemperatureversuslightintensityfor47stars.Thefirstvariableisthelogarithmoftheeffectivetemperatureatthesurfaceofthestar(Te)andthesecondoneisthelogarithmofitslightintensity(LILO).Insequence,thefourpanelsinFigure12.3showplotsofthebivariatedatawithfittedregressionsbasedon(a)LeastSquares,(b)LeastAbsoluteResiduals.(c)HuberLoss&LeastTrimmedSquares,and(d)LeastMedianSquares.Observationsfarawayfrommostoftheotherobservationsarecalledleveragepoints;inthisexample,onlytheLeastMedianSquaresapproachworkswellbecauseoftheeffectoftheleveragepoints.>>stars=load(’stars.txt’);n=size(stars,l);>>x=Cones(n,i)stars(:,2)1;y=stars(:,3);>>bols=Xy;[ignore,idx]=sort(stars(:,2));>>plot(stars(:,2),stars(:,3),’o’,stars(idx,2),...X(idx,:)+bols,’-.’)legend(’Data’,’OLS’)>>%>>%LeastAbsoluteDeviation>>blad=medianregress(stars(:,2),stars(:,3));>>plot(stars(:,2),stars(:,3),’oJ,stars(idx,2),...X(idx,:)*bols,’-.’,stars(idx,2),X(idx,:)*blad,’-.’)legend(’Data’,’OLS’,’LAD’); ROBUSTREGRESSlON225651,,-10II351'''''I3.5--,''3436384424446465343636442444648535.-3436384424446405Fig.12.3Stardatawith(a)OLSRegression,(b)LeastAbsoluteDeviation.(c)HuberEstimationandLeastTrimmedSquares,(d)LeastMedianSquares.>>%>>%HuberEstimation>>k=1.345;%tuningparametersinHuber'sweightfunction>>wgtfun=O(e)(k*(abs(e)>k)-abs(e).*(abs(e)>k))./abs(e)+l;>>%Huber'sweightfunction>>wgt=rand(length(y),l);%InitialWeights>>bO=lscov(X,y,wgt);>>res=y-X*bO;%RawResiduals>>res=res/mad(res)/0.6745;%StandardizedResidua1:s>>rn=30;>>fori=l:mwgt=wgtfun(res);%Computetheweightedestimateusingtheseweightsbhuber=lscov(X,y,wgt);ifall((bhuber-bO)<.Ol*bO)%Stopwithconvergencereturn;elseres=y-X+bhuber; 226BEYONDLINEARREGRESSlONres=res/mad(res)/0.6745;endend>>plot(stars(:,2),stars(:,3),’o’,stars(idx,2),X(idx,:).*bols,’-.’,stars(idx,2),X(idx,:)*blad,’-x’,...stars(idx,2),X(idx,:)*bhuber,’-s’)legend(’Data’,’OLS’,’LAD’,’Huber’);>>%>>%LeastTrimmedSquares>>blts=lts(stars(:,2),y);>>plot(stars(:,2),stars(:,3),’oJ,stars(idx,2),X(idx,:).*bols,’-.’,stars(idx,2),X(idx,:)*blad,’-x’,...stars(idx,2),X(idx,:)*bhuber,’-s’,stars(idx,2),...X(idx,:)*blad,’-+’)legend(’Data’,’OLS’,’LAD’,’Huber’,’LTS’);>>%>>%LeastMedianSquares>>[LMSout,blms,Rsq]=LMSreg(y,stars(:,2));>>plot(stars(:,2),stars(:,3),’0’,stars(idx,Z),X(idx,:).*bols,’-.’,stars(idx,2),X(idx,:)*blad,’-x’,...stars(idx,2),X(idx,:)*bhuber,’-s’,stars(idx,2),...X(idx,:)+blad,’-+’,stars(idx,Z),X(idx,:)*blms,’-d’)legend(’Data’,’OLS’,’LAD’,’Huber’,’LTS’,’LMS’);Example12.4Anscombe’sFourRegressions.AcelebratedexampleoftheroleofresidualanalysisandstatisticalgraphicsinstatisticalmodelingwascreatedbyAnscombe(1973).Heconstructedfourdifferentdatasets(Xz.x),i=1....,11thatsharethesamedescriptivestatistics(X,Y,bo.81,MSE,R2.F)necessarytoestablishlinearregressionfitY=bo+&X.Thefollowingstatisticsarecommonforthefourdatasets:SamplesizeN11MeanofX(X)9MeanofY(Y)7.5Intercept3Slope(A)0.5EstimatorofCT,(s)1.2366CorrelationTX,~0.816Frominspection,onecanascertainthatalinearmodelisappropriateforDataSet1.butthescatterplotsandresidualanalysissuggestthattheDataSets2-4arenotamenabletolinearmodeling.Plottedwiththedataarethelinesforleast-squarefit(dotted)andrankregression(solidline).SeeExercise12.1forfurtherexaminationofthethreeregressionarchetypes. lSOTONlCREGRESSlON227DataSet1X1081391114641275Y8.046.957.588.818.339.967.244.2610.844.825.68DataSet2X1081391114641275Y9.148.148.748.779.268.106.133.109.137.264.74DataSet3X1081391114641275Y7.466.7712.747.117.818.846.085.398.156.425.73DataSet4X888888819888Y6.585.767.718.848.477.045.2512.505.567.916.8912.4ISOTONICREGRESSIONInthissectionweconsiderbivariatedatathatsatisfyanorderorrestrictioninfunctionalform.Forexample,ifYisknowntobeadecreasingfunctionofX,asimplelinearregressionneedonlyconsidervaluesoftheslopeparameter/31<0.Ifwehavenolinearmodel,however:thereisnothingintheempiricalbivariatemodeltoensuresuchaconstraintissatisfied.Isotonicregressionconsidersarestrictedclassofestimatorswithouttheuseofanexplicitregres-sionmodel.ConsiderthedentalstudydatainTable12.16,whichwasusedtoillustrateisotonicregressionbyRobertson,Wright,andDykstra(1988).Thedataareoriginallyfromastudyofdentalgrowthmeasurementsofthedistance(mm)fromthecenterofthepituitaryglandtothepterygoniaxillaryfissure(referringtotheboneinthelowerjaw)for11girlsbetweentheageof8and14.ItisassumedthatPFincreaseswithage.sotheregressionofPFonageisnondecreasing.ButitisalsoassumedthattherelationshipbetweenPFandageisnotnecessarilylinear.Themeans(ormedians,forthatmatter)arenotstrictlyincreasinginthePFdata.LeastsquaresregressiondoesyieldanincreasingfunctionforPF:Y=0.065X+21.89.butthefunctionisnearlyflatandnotaltogetherwell-suitedtothedata.Foranisotonicregression,weimposesomeorderoftheresponseasafunctionoftheregressors.Definition12.1Iftheregressorshaveasimpleorderx15...5x,,afunctionfisisotonicwithrespecttoxiff(x1)5;...5f(x,).Forourpurposes,isotonicwallbesynonymouswithmon,otonic.ForsamefunctiongofX,wecallthefunctiongl;anisotonicregressionofgwithweightswifand 228BEYONDLlNEARREGRESSlON14I/t14,I4;6b101214161s20(4Fig.12.4Anscombe’sregressions:LSandRobust.onlyifg*isisotonic(i.e.,retainsthenecessaryorder)andminimizesn(12.3)intheclassofallisotonicfunctionsf12.4.1GraphicalSolutiontoRegressionWecancreateasimlegraphtoshowhowtheisotonicregressioncanbe$ksolved.Letwk=xi)andGk=zizlg(zi)w(xi).Intheexample,themeansareordered,sof(xi)=piandwi=ni,thenumberofobservationsateachagegroup.WeletgbethesetofPFmeans,andtheplotofwkversusGk,calledthecumulativesumdiagram(CSD),showsthattheempirical ISOTONICREGRfSSlON229Table12.16SizeofPituitaryFissureforSubjectsofVariousAges.Age8101214PF21.23.5,2324.21.2521.5,22,1923.5.25Mean22.5023.3320.8324.25PAVA22.2222.2222.2224.25relationshipbetweenPFandageisnotisotonic.DefineG*tobethegreatestconvexminorant(GCM)whichrepresentsthelargestconvexfunctionthatliesbelowtheCSD.YoucanenvisionG*asatautstringtiedtotheleftmostobservation(Wl,GI)andpulledupandundertheCSD,endingatthelastobservation.TheexampleinFigure12.5(a)showsthattheGCMforthenineobservationstouchesonlyfouroftheminformingatightconvexbowlaroundthedata.25C200fig.12.5(a)Greatestconvexminorantbasedonnineobservations.(b)Greatestconvexminorantfordentaldata.TheGCMrepresentstheisotonicregression.Thereasoningfollowsbelow(andinthetheoremthatfollows).BecauseG*isconvex,itisleftdifferentiableatW,.Letg*(z,)betheleft-derivativeofG*atW,.IfthegraphoftheGCMisunderthegraphofCSDatW,,theslopesoftheGCMtotheleftandrightofW,remainthesame,i.e.,ifG*(W,)g(zi)forsomei,thengisnotisotonic.Toconstructanisotonicg*,takethefirstsuchpairandreplacethemwiththeweightedaverageReplacetheweightsxi)andw(z2-1)withw(zi)+w(z2-1).Ifthiscorrection(replacinggwith3)makestheregressionisotonic,wearefinished.Otherwise,werepeatthisprocesswithuntilanisotonicisset.ThisiscalledthePoolAdjacentViolatorsAlgorithmorPAVA.Example12.5InTable12.16,thereisadecreaseinPFbetweenages10and12,whichviolatestheassumptionthatpituitaryfissureincreasesinage.OncewereplacethePFaveragesbytheaverageoverbothagegroups(22.083),westilllackmonotonicitybecausethePFaverageforgirlsofage8was22.5.Consequently,thesetwocategories,whichnowcomprisethreeagegroups,areaveraged.ThefinalaveragesarelistedinthebottomrowofTable12.1612.5GENERALIZEDLINEARMODELSAssumethatn(p+1)-tuples(yx.z12,xZ2,....xpz).i=1,....nareobserved.Thevaluesyzareresponsesandcomponentsofvectorsz,=(zlz,XZ~,....xp2)’arepredictors.Aswediscussedatthebeginningofthischapter,thestandardtheoryoflinearregressionconsidersthemodelY=Xp+E,(12.4) GENERALIZEDLINEARMODELS231whereY=(Yl.....Y,)istheresponsevector.X=(1,51x2...xP)isthedesignmatrix(1,isacolumnvectorofnl's),andEisvectoroferrorsconsistingofni.i.dnormalN(0,a2)randomvariables.Thevarianceu2iscommonforallyZsandindependentofpredictorsirtheorderofobservation.Theparameter,!?isavectorof(p+1)parametersinthelinearrelationship.Ey,=.',p=30+R121z+.../!3z1,2p,.Fig.12.6(a)PeterMcCullaghand(b)JohnNelder.Thetermgenerulzzedhearmodel(GLM)referstoalargeclassofmodels.introducedbyNelderandWedderburn(1972)andpopularizedbyMcCullaghandNelder(1994),Figure12.6(a-b).InacanonicalGLM.theresponsevari-ableY,isassumedtofollowanexponentialfamilydistributionwithmeanpuz.whichisassumedtobeafunctionofxi,!?.Thisdependencecanbenonlin-ear,butthedistributionofY,dependsoncovariatesonlythroughtheirlinearcombination,7%=zi~3,calledahearpredzctor.Asinthelinearregression.theepithethearreferstobeinglinearinparameters.notintheexplanatoryvariables.Thus,forexample.thelinearcombinationPo+P151+$2z;+43log(z1+52)+0421.22,isaperfectlinearpredictor.Whatisgeneralizedinmodelgivenin(12.4)byaGLM?Thethreemaingeneralizationsconcernthedistributionsofresponses,thedependenceofresponseonlinearpredictor.andvarianceiftheerror.1.AlthoughY,sremainindependent.their(common)distributionisgen-eralized.Insteadofnormal,theirdistributionisselectedfromtheex-ponentialfamilyofdistributions(seeChapter2).Thisfamilyisquiteversatileandincludesnormal,binomial.Poisson,negativebinomial]andgammaasspecialcases. 232BEYONDLlNEARREGRESSlON2.Inthelinearmodel(12.4)themeanofY,,pi=EYiwasequaltozip.ThemeanpiinGLMdependsonthepredictorqi=x',pas(12.5)Thefunctiongiscalledthelinkfunction.Forthemodel(12.4),thelinkistheidentityfunction.3.ThevarianceofY,wasconstant(12.4).InGLMitmaynotbeconstantandcoulddependonthemeanpi.Modelsandinferenceforcategoricaldata,traditionallyanon-parametrictopic,areunifiedbyalargerclassofmodelswhichareparametricinnatureandthatarespecialcasesofGLM.Forexample,incontingencytables.thecellcountsN,,couldbemodeledbymultinomialMn(n,{pz,})distribution.Thestandardhypothesisincontingencytablesisconcerningtheindependenceofrow/columnfactors.ThisisequivalenttotestingHO:p,,=azp3forsomeunknowna,andp,suchthatC,a,=C,p3=1.TheexpectedcellcountEN,,=np,,,sothatunderHObecomesEN,,=no$,,bytakingthelogarithmofbothsidesoneobtainslogENij=logn+logai+logpj=const+ai+bj,forsomeparametersaiandbj.Thus,thetestofgoodnessoffitforthismodellinearandadditiveinparametersaandb,isequivalenttothetestoftheoriginalindependencehypothesisHOinthecontingencytable.MoreofsuchexampleswillbediscussedinChapter18.12.5.1GLMAlgorithmThealgorithmsforfittinggeneralizedlinearmodelsarerobustandwelles-tablished(seeNelderandWedderburn(1972)andMcCullaghandNelder(1994)).Themaximumlikelihoodestimatesof,!?canbeobtainedusingiter-ativeweightedleast-squares(IWLS).(i)Givenvectorii(k),theinitialvalueofthelinearpredictor@')isformedusingthelinkfunction,andcomponentsofadjusteddependentvariate(workingresponse),z:'),canbeformedaswherethederivativeisevaluatedatthetheavailablekthvalue. GENERALIZEDLINEARMODELS233(ii)Thequadratic(working)weights,W2('),aredefinedsothatwhereVisthevariancefunctionevaluatedattheinitialvalues.(iii)Theworkingresponsez(~)isthenregressedontothecovariatesIC,,withweightsW,(')toproducenewparameterestimates,g(lC+').Thisvectoristhenusedtoformnewestimates7(k+1)=X/fi(k+l)andfi(k++1)=--IA(k+l)9(71Werepeatiterationsuntilchangesbecomesufficientlysmall.Startingvaluesareobtaineddirectlyfromthedata.usingfi(O)=y;withocca-sionalrefinementsinsomecases(forexample,toavoidevaluatinglog0whenfittingalog-linearmodelwithzerocounts).Bydefault,thescaleparametershouldbeestimatedbythemeandevaance.n-lCr=lD(yz,p).fromp.44inChapter3,inthecaseofthenormalandgammadistributions.12.5.2LinksIntheGLMthepredictorsforY,aresummarizedasthelinearpredictor7%=zip.Thelinkfunctionisamonotonedifferentiablefunctiongsuchthat7,=g(pz).wherept=IEY,.Wealreadymentionedthatinthenormalcasep=7andthelinkisidentity.g(p)=p.Example12.6Foranalyzingcountdata(e.g..contingencytables).thePois-sonmodelisstandardlyassumed.Asp>0,theidentitylinkisinappropriatebecause7couldbenegative.However.ifp=eq.thenthemeanisalwayspositive,and7=log(p)isanadequatelink.Alinkiscallednaturalifitisconnecting8(thenaturalparameterintheexponentialfamilyofdistributions)andp.InthePoissoncase,p=Xand8=logp.Accordingly,thelogisthenaturallinkforthePoissondistribution.Example12.7Forthebinomialdistribution,f(y(7r)=(;)rry(l-7r)n--Y 234BEYONDLlNEARREGRESSlONcanberepresentedasThenaturallink7=log(x/(l-7~))iscalledlogitlink.Withthebinomialdistribution,severalmorelinksarecommonlyused.Examplesaretheprobitlink77=@-‘(n),where@isastandardnormalCDF,andthecomplementarylog-loglinkwith77=log{-log(1-n)}.Forthesethreelinks,theprobability7rofinterestisexpressedas7~=eq/(l+eq),7r=@(q),and7~=l-exp{-eq},respectively.Whendatay,fromtheexponentialfamilyareexpressedingroupedform(fromwhichanaverageisconsideredasthegroupresponse),thenthedistri-butionforY,takestheform(12.6)Theweightsw,areequalto1ifindividualresponsesareconsidered,w,=n,ifresponsey,isanaverageofn,responses,andw,=l/n,ifthesumofn,individualresponsesisconsidered.ThevarianceofY,thentakestheform12.5.3DevianceAnalysisinGLMInGLM,thegoodnessoffitofaproposedmodelcanbeassessedinseveralways.Thecustomarymeasureisdewzancestatistics.Foradatasetwithnobservations,assumethedispersionq5isknownandequalto1,andconsiderthetwoextrememodels,thesingleparametermodelstatingEY,=fiandtheRparametersaturatedmodelsettingEY,=fi,=Y,.Mostlikely,theinterestingmodelisbetweenthetwoextremes.SupposeMistheinterestingmodelwith1>infection=1111002823801;>>total=[18982058264091;>>proportion=infection./total;>>noplan=[O10101011;>>riskfac=[l10011001;>>antibio=[lI1100001;>>[logitCoef2,dev]=glmfit(Cnop1an’riskfact’antibio’],...[infection’total’l,’binomial’,’logit’);>>logitFit=glmval(logitCoef2,[noplan’riskfact’antibio’1,’logit’);>>plot(l:8,proportion,’ks’,1:8,logitFit,’ko’);Thescaleddevianceofthismodelisdistributedasx23.Thenumberofdegreesoffreedomisequalto8(n)vectorinfectionminus5forthefiveestimatedparameters.PO.PI.Pz.03,d.Thedeviancedev=llissignificant.yieldingapvalueof1-chi2cdf(11,3)=O.0117.Theadditivemodel(withnointeractions)inMATLAByieldsP(infection)=+p1noplan+p2risk:fac+P3antibio.logP(noinfection)Theestimatorsof(PO,PI.B,,,!33)are,respectively,(-1.89.1.07,2.03,-3.25).TheinterpretationoftheestimatorsismademoreclearifwelookattheoddsratioP(infection)--.eolnoplan.,ozriskfac,~~antibioeP(noinfection)Atthevalueantibio=1,theantibioticshavetheoddsratioofinfection/noinfection.Thisincreasesbythefactorexp(-3.25)==0.0376,whichisade-creaseofmorethan25times.Figure12.7showstheobservedproportionsofinfectionsfor16combinationsofcovariates(noplan,riskfac,antibio)markedbysquaresandmodel-predictedprobabilitiesforthesamecombi-nationsmarkedbycircles.WewillrevisitthisexampleinChapter18;seeExample18.5.12.6EXERCISES12.1.Usingrobustregression.findtheinterceptandslopePOandforeachofthefourdatasetsofAnscombe(1973)fromp.226.Plottheordinaryleastsquaresregressionalongwiththerankregressionestimatorofslope.Contrastthesewithoneoftheotherrobustregressiontechniques.Forwhichsetdoes&differthemostfromitsLScounterpart=0.5?Notethatinthefourthset,10outof11Xsareequal.sooneshoulduseS,,=(5-x)/(Xj-X,+E)toavoiddividingby0.Afterfinding&and81,aretheydifferentthan,&andbl?IsthehypothesisHO:/31=1/2rejectedinarobusttestagainstthealternativeH1:0.Nearestneighborestzmatorsusethespanproducedbyafixednumberofdesignpointsthatareclosesttoz.13.1KERNELESTIMATORSLetK(x)beareal-valuedfunctionforassigninglocalweightstothelinearestimator.thatis,IfK(u)3:l(lul51)thenafittedcurvebasedonK(y)willestimatem(z)usingonlydesignpointswithinhunitsof.c.UsuallyitisassumedthatS,K(z)dx=1,soanyboundedprobabilitydensitycouldserveasakernel.Unlikekernelfunctionsusedindensityestimation,nowK(x)alsocantakenegativevalues,andinfactsuchunrestrictedkernelsareneededtoachieveoptimalestimatorsintheasymptoticsense.Anexampleisthebetakerneldefinedas1K(x)=(1-22)11(1x)5l),-i=o.1,2...(13.1)B(1/2.y+1)Withtheaddedparameter-1.thebeta-kernelisremarkablyflexible.Fory=0.thebetakernelbecomesuniform.Ify=1wegettheEpanechikovkernel,y=2producesthebiweightkernel,y=3thetriweight,andsoon(seeFigure11.4onp.209).For-1largeenough.thebetakernelisclosetheGaussiankernelK(x)=witho2=l/(2y+3).whichisthevarianceofdensitiesfrom(13.1).For12example.ify=10.thens-,(K(z)-a-'d(z/o))dx=0.00114,whereo=1/Jm.Defineascalingcoefficienthsothat(13.2)wherehistheassociatedbandwzdth.Byincreasingh.thekernelfunctionspreadsweightawayfromitscenter,thusgivinglessweighttothosedatapointsclosetozandsharingtheweightmoreequallywithalargergroupof 244CURVEFlTTlNGTECHNlQUESdesignpoints.AfamilyofbetakernelsandtheEpanechikovkernelaregiveninFigure13.2.035103-1025-02Fig.13.2(a)Afamilyofsymmetricbetakernels;(b)K(z)=exp{--/zl/fi}sin(Jzl/JZ-7~/4).13.1.1Nadaraya-WatsonEstimatorNadaraya(1964)andWatson(1964)independentlypublishedtheearliestre-sultsonforsmoothingfunctions(butthisisdebateable),andtheNadaraya-WatsonEstimator(NWE)ofm(z)isdefinedas(13.3)ForICfixed,thevalue6thatminimizesnC(YL-8)2Kh(Xi-Z)>(13.4)i=lisoftheformC,"=,a,K.TheNadaraya-Watsonestimatoristheminimizerof(13.4)witha,=Kh(X,-x)/Cr=lKh(X,-x).Althoughseveralcompetingkernel-basedestimatorshavebeenderivedsince.theNWEprovidedthebasicframeworkforkernelestimators,includinglocalpolynomialfittingwhichisdescribedlaterinthissection.TheMATLABfunctionmda-wat(x0,X,Y,bw) KERNELESTIMATORS245Fig.13.3Nadaraya-14'atsonEstimatorsfordifferentvaluesofbandwidth.computestheNadaraya-Watsonkernelestimateat:c=x0.Here.(X,Y)areinputdata,andbwisthebandwidth.Example13.1Noisypairs(Xi,yZ),i=1,...,200aregeneratedinthefol-lowingway:>>x=sort(rand(1,200));>>y=sort(rand(1,200));>>y=sin(4*pi*y)+0,9*randn(1,200);Threebandwidthsareselectedh=0.015,0.030,and0.060.ThethreeNadaraya-WatsonEstimatorsareshowninFigure13.3.Asexpected,thees-timatorsconstructedwiththelargerbandwidthsappearsmootherthanthosewithsmallerbandwidths.13.1.2Gasser-MiillerEstimator.TheGasser-hliillerestimatorproposedin1979usesareasofthekernelfortheweights.SupposeX,areordered,XI5Xz...5Xn.LetXo=--ooandXn+l=ccanddefinemidpointssz=(X,+X,+1)/2.Then(13.5)TheGasser-hlullerestimatoristheminimizerof(13.4)withtheweightsai=ss:-,Kh(U-z)du. 246CURVENTT/NGTECHNIQUES13.1.3LocalPolynomialEstimatorBothNadaraya-WatsonandGasser-Miillerestimatorsarelocalconstantfitestimators,thatis,theyminimizeweightedsquarederrorCy=“=,(yi-Q)2wifordifferentvaluesofweightswi.Assumethatforzinasmallneighborhoodofxthefunctionm(z)canwellbeapproximatedbyapolynomialoforderp:j=Owhere/3j=m(j)(x)/j!.Insteadofminimizing(13.4),thelocalpolynomial(LP)estimatorminimizes(13.6)overPI....,0”.Assume,forafixedx,pj,j=0,...,pminimize(13.6).Then,riz(z)=Bo,andanestimatorofjthderivativeofmis7i2(3)(z)=j!&,j=0,1,...,p.(13.7)Ifp=0,thatis,ifthepolynomialsareconstants,thelocalpolynomialesti-matorisNadaraya-Watson.Itisnotclearthattheestimator&(x)forgeneralpisalocallyweightedaverageofresponses,(oftheformC:=la,Y,)asaretheNadaraya-WatsonandGasser-Mullerestimators.Thefollowingrepre-sentationoftheLPestimatormakesitscalculationeasyviatheweightedleastsquareproblem.Considerthenx(p+1)matrixdependingonxandX,-x,i=1,...,n.1XI-x(XI-z)2...(Xl-x)”1Xz-x(X2-2)2...(X2-XI”x=(............1x,-x(X,-2)2...(X,-x)PDefinealsothediagonalweightmatrixWandresponsevectorY:Thentheminimizationproblemcanbewrittenas(Y-Xp)’W(Y-Xp).Thesolutioniswellknown:6=(X’WX)-lX’WY.Thus,if(a1a2...a,)isthefirstrowofmatrix(X’WX)-lX’W,h(x)=a.Y=C,a,Y,.Thisrepre- NEARESTFJEIGHBORMETHODS247sentation(inmatrixform)providesanefficientandelegantwaytocalculatetheLPregressionestimator.InMATLAB,usethefunctionlpfit(x,y,p,h),where(2;y)istheinputdata,pistheorderandhisthebandwidth.Forgeneralp,thefirstrow(a1a2...a,)ofs(X’WX)-lX’Wisquitecomplicated.Yet,forp=1(thelocallinearestimator),theexpressionforh(z)simplifiestowhereS,=~~=l(X,-x)JKh(X,-x),j=0.1.and2.ThisestimatorisimplementedinMATLABbythefunctionloc-lin.m.13.2NEARESTNEIGHBORMETHODSAsanalternativetokernelestimators,nearestneighborestimatorsdefinepointslocaltoX,notthroughakernelbandwidth,whichisafixedstripalongthex-axis,butinsteadonasetofpointsclosesttoX,.Forexample,aneighborhoodforxmightbedefinedtobetheclosestkdesignpointsoneithersideofx,wherekisapositiveintegersuchthatk5n/2.Nearestneighbormethodsmakesenseifwehavespaceswithclustereddesignpointsfollowedbyintervalswithsparsedesignpoints.Thenearestneighborestima-torwillincreaseitsspanifthedesignpointsarespreadout.Thereisaddedcomplexity,however,ifthedataincludesrepeateddesignpoints.forpurposesofillustration,wewillassumethisisnotthecaseinourexamples.Nearestneighborandkernelestimatorsproducesimilarresults,ingeneral.Intermsofbiasandvariance.thenearestneighborestimatordescribedinthissectionperformswellifthevariancedecreasesmorethanthesquaredbiasincreases(seeAltman,1992).13.2.1LOESSWilliamCleveland(1979),Figure13.4(a),introduceda,curvefittingregressiontechniquecalledLOWESS,whichstandsforlocallyweightedregressionscatterplotsmoothing.Itsderivative,LOESS1,standsmoregenerallyforalocalregression,butmanyresearchersconsiderLOWESSandLOESSassynonyms.lTermactuallydefinedbygeologistsasdepositsoffinesoilthatarehighlysusceptibletowinderosion.Wewillstickwithourlesssiltymathematicaldefinitioninthischapter. 248CURVEFlTTlNGTECHNlQUESFig.13.4(a)WilliamS.Cleveland,PurdueUniversity;(b)GeologicalLoess.ConsideramultiplelinearregressionsetupwithasetofregressorsX,=X,l....,X,I,topredictY,,i=l,...,n.IfY=!(XI,...,XI,)+E,whereEN~lf(0.0~).Adjacencyoftheregressorsisdefinedbyadistancefunctiond(X.X*).Fork=2,ifwearefittingacurveat(Xrl,Xrz)with15T5n.thenfori=1,...,n,Eachdatapointinfluencestheregressionat(Xrl,Xr2)accordingtoitsdistancetothatpoint.IntheLOESSmethod,thisisdonewithatri-cubeweightfunctionwhereonlyqofnpointsclosesttoX,areconsideredtobe"intheneighbor-hood"ofX,,andd,isthedistanceofthefurthestX,thatisintheneigh-borhood.Actually,manyotherweightfunctionscanservejustaswellasthetri-weightfunction:requirementsforw,arediscussedinCleveland(1979).Ifqislarge,theLOESScurvewillbesmootherbutlesssensitivetonuancesinthedata.Asqdecreases,thefitlooksmorelikeaninterpolationofthedata.andthecurveiszig-zaggy.Usually,qischosensothat0.105q/n50.25.WithinthewindowofobservationsintheneighborhoodofX,weconstructtheLOESScurveY(X)usingeitherlinearregression(calledfirstorder)orquadratic(secondorder).Therearegreatadvantagestothiscurveestimationscheme.LOESSdoesnotrequireaspecificfunctiontofitthemodeltothedata;onlyasmoothingparameter(a=q/n)andlocalpolynomial(firstorsecondorder)arerequired. VARlANCEESTIMATlON249Giventhatcomplexfunctionscanbemodeledwithsuchasimpleprecept,theLOESSprocedureispopularforconstructingaregressionequationwithcloudy,multidimensionaldata.Ontheotherhand.LOESSrequiresalargedatasetinorderforthecurve-fittingtoworkwell.Unlikeleast-squaresregression(and,forthatmatter.manynon-linearregressiontechniques).theLOESScurvedoesnotgivetheuserasimplemathformulatorelatetheregressorstotheresponse.Becauseofthis,oneofthemostvaluableusesofLOESSisasanexploratorytool.Itallowsthepractitionertovisuallychecktherelationshipbetweenaregressorandresponsenomatterhowcomplexorconvolutedthedataappeartobe.InMATLAB.usethefunctionloess(x,y,newx,a,b)wherexandyrepresentthebivariatedata(vectors),newxisthevectoroffittedpoints,aisthesmoothingparameter(usually0.10or0.25).andbistheorderofpolynomial(1or2).loessproducesanoutputequaltonewx.Example13.2ConsiderthemotorcycleaccidentdatafoundinSchmidt.MatterandSchuler(1981).Thefirstcolumnistime.measuredinmilliseconds,afterasimulatedimpactofamotorcycle.Thesecondcolumnistheaccelera-tionfactorofthedriver’shead(accel),measureding(9.8m/s2).T’imeversusaccelisgraphedinFigure13.5.TheMATLABcodebelowcreatesaLOESScurvetomodelaccelerationasafunctionoftime(alsointhefigure).Notehowthesmoothingparameterinfluencesthefitofthecurve.>>loadmotorcycle.dat>>time=motorcycle(:,I);>>accel=motorcycle(:,2);>>loess(time,accel,newx,0.20,1);>>plot(time,acce1,’o’);>>holdon>>plot(time,newx,’-’);Forregressionwithtworegressors(x,y),usetheMATLABfunction:loess2(x,y,z,newx,newy,a,b)thatcontainsinputs(x,y,z)andcreatesasurfacefitin(newx,newy).13.3VARIANCEESTIMATIONInconstructingconfidenceintervalsform(x),thevarianceestimatebasedonthesmoothlinearregression(withpooled-varianceestimate)willproducethe 250CURVENTTlNGTECHNlQUESFig.13.5Loesscurve-fittingforMotorcycleDatausing(a)cy=0.05,(b)cy=0.20,(c)cy=0.50,and(d)a=0.80.narrowestinterval.Butiftheestimateisbiased.theconfidenceintervalwillhavepoorcoverageprobability.Anestimatorofm(z)basedonlyonpointsnearxwillproduceapoorestimateofvariance,andasaresultisapttogeneratewide.uninformativeintervals.Onewaytoavoidtheworstpitfallsofthesetwoextremesistodetrendthedatalocallyandusetheestimatedvariancefromthedetrendeddata.AltmanandPaulson(1993)usepsuedo-residualsE7=yz-(yZ+l+yz-1)/2toformavarianceestimatorn-1wherea2/a2isdistributedx2with(n-2)/2degreesoffreedom.Becauseboththekernelandnearestneighborestimatorshavelinearforminyz,a SPLINES251Fig.13.6I.J.Schoenberg(1903-1990).100(1-a)%confidenceintervalform(t)canbeapproximatedwithwherer=(n-2)/2.13.4SPLINESspline(sphe)n.1.Aflexiblepieceofwood,hardrubber,ormetalusedindrawingcurves.2.Awoodenormetalstrip;aslat.TheAmericanHeritageDictionarySplines,inthemathematicalsense,areconcatenatedpiecewisepolynomialfunctionsthateitherinterpolateorapproximatethescatterplotgeneratedbynobservedpairs,(XI.Yl),...~(Xn,Yn).IsaacJ.Schoenberg,the“fatherofsplines,”wasborninGalatz.Romania,onApril21,1903,anddiedinMadison,Wisconsin,USA.onFebruary21.1990.Themorethan40papersonsplineswrittenbySchoenbergafter1960gavemuchimpetustotherapiddevelopmentofthefield.Hewrotethefirstseveralin1963,duringayear’sleaveinPrincetonattheInstituteforAdvancedStudy:theothersarepartofhisprolificoutputasamemberoftheLlathematicsFlesearchCenterattheUniversityofWisconsin-Madison,whichhejoinedin1965. 252CURVEFlTTlNGTECHNlQUES13.4.1interpolatingSplinesTherearemanyvarietiesofsplines.Althoughpiecewiseconstant,linear,andquadraticsplineseasytoconstruct,cubicsplinesaremostcommonlyusedbecausetheyhaveadesirableextremalproperty.Denotethecubicsplinefunctionbym(z).AssumeXI.Xz,...,X,areorderedandbelongtoafiniteinterval[u,b].WewillcallXI,X2,...,X,knots.Oneachinterval[Xz-l,Xz],i=1,2,...,n+1.Xo=a.X,+1=b.thesplinem(z)isapolynomialofdegreelessthanorequalto3.Inaddition,thesepolynomialpiecesareconnectedinsuchawaythatthesecondderivativesarecontinuous.ThatmeansthatattheknotpointsX,,a=1,...,nwherethetwopolynomialsfromtheneighboringintervalsmeet,thepolynomialshavecommontangentandcurvature.WesaythatsuchfunctionsbelongtoC2[a,b],thespaceofallfunctionson[a.b]withcontinuoussecondderivative.Thecubicsplineiscallednaturalifthepolynomialpiecesontheintervals[a.XI]and[X,,b]areofdegree1.thatis.linear.ThefollowingtwopropertiesdistinguishnaturalcubicsplinesfromotherfunctionsinC2[a.b].UniqueInterpolation.Giventhenpairs,(XI,YI),...,(X,,Y,),withdis-tinctknotsXithereisauniquenaturalcubicsplinemthatinterpolatesthepoints.thatis,m(Xi)=Y,.ExtremalProperty.Givennpairs,(XI,Yl),...,(X,,Y,),withdistinctandorderedknotsXi:thenaturalcubicsplinem(z)thatinterpolatesthepointsalsominimizesthecurvatureontheinterval[a,b],wherea>x=[I04040206050251630608075651001;>>y=[859065551007035101036606555501;>>t=l:length(x);>>tt=linspace(t(l),t(end),250);>>xx=spline(t,x,tt);>>yy=spline(t,y,tt);>>plot(xx,yy,’-’,’linewidth’,2),holdon>>plot(x,y,’o’,’markersize’,6)>>axis(’equal’),axis(’off’) SPLlNE.5253Fig.13.7AcubicsplinedrawingofletterV.Example13.4InMATLAB,thefunctioncsapi.mcomputesthecubicsplineinterpolant,andforthefollowingzandy,>>x=(4*pi)*[O1rand(l,20)];y=sin(x);>>cs=csapi(x,y);>>fnplt(cs);holdon,plot(x,y,’o’)>>legend(’cubicspline’,’data’),holdofftheinterpolationisplottedinFigure13.8(a),alongwiththedata.Asurfaceinterpolationby2-dsplinesisdemonstratedbythefollowingMATLABcodeandFigure13.8(b).>>x=-1:.2:1;y=-1:.25:1;Cxx,yy]=ndgrid(x,y);>>z=sin(lO*(xx.^2+yy.^2));pp=csapi((x,y),z);>>fnplt(pp)Thereareimportantdistinctionsbetweensplineregressionsandregularpoly-nomialregressions.Thelattertechniqueisappliedtoregressioncurveswherethepractitionercanseeaninterpolatingquadraticorcubicequationthatlo-callymatchestherelationshipbetweenthetwovariablesbeingplotted.TheStone-Weierstrasstheorem(Weierstrass,1885)tellsusthatanycontinuousfunctioninaclosedintervalcanbeapproximatedwejlbysomepolynomial.Whileahigherorderpolynomialwillprovideacloserfitatanyparticularpoint,thelossofparsimonyisnottheonlypotentialproblemofoverfitting:unwantedoscillationscanappearbetweendatapoints.Splinefunctionsavoidthispitfall. 254CURVEF/JJ/NGJECHNlQUESFig.13.8(a)Interpolatingsinefunction;(b)Interpolatingasurface.13.4.2SmoothingSplinesSmoothingsplines,unlikeinterpolatingsplines,maynotcontainthepointsofascatterplot,butareratheraformofnonparametricregression.Supposewearegivenbivariateobservations(Xi,X),i=1,...,n.Thecontinuouslydifferentiablefunctionrizon[a,b]thatminimizesthefunctionaln.h(13.8)a=lisexactlyanaturalcubicspline.Thecostfunctionalin(13.8)hastwoparts:bc,”=,(K-rn(X,))2isminimizedbyaninterpolatingspline,ands,(n~”(t))~dtisminimizedbyastraightline.TheparameterXtradesofftheimportanceofthesetwocompetingcostsin(13.8).ForsmallA,theminimizerisclosetoaninterpolatingspline.ForXlarge,theminimizerisclosertoastraightline.Althoughnaturalcubicsmoothingsplinesdonotappeartoberelatedtokernel-typeestimators,theycanbesimilarincertaincases.Foravalueofzthatisawayfromtheboundary,ifnislargeandXsmall,letwherefisthedensityoftheX’s,hi=[X/(nf(Xi))]’/4andthekernelKis SPLlNfS2551~(z)=-exp{-izi/JZ)sin(izl/JZ-Fx/4).(13.9)2Asanalternativetominimizing(13.8);thefollowingversionisoftenused:(13.10)Inthiscase,X=(1-p)/p.AssumethathisanaveragespacingbetweentheneighboringX's.Anautomaticchoiceforpis6(6+h3)orX=h3/6.SmoothingSplinesasLinearEstimators.The:splineestimatorislinearintheobservations,m=S(X)Y,forasmoothingmatrixS(X).TheReinschalgorithm(Reinsch,1967)efficientlycalculatesSasS(X)=(I+XQR-~Q')-',(13.1I)whereQandRarestructuredmatricesofdimensionsnx(n-2)and(n-2)x(n-a),respectively:q12q22q237-22I-:!3q32q337-321-33P43R=7.43&=...qn-2,n-1qn-2,n-1qn-1,n-1qn-1.n-14n.n-1withentriesandI-i3=2(hj-1+hj),i=ji=j+l.ThevalueshiarespacingsbetweentheXi's,ix.,hi=Xi+l-Xi.i=1,...,n-1.FordetailsabouttheReinschAlgorithm,seeGreenandSilverman(1994). 256CURVEFlTTlNGTECHNlQUES13.4.3SelectingandAssessingtheRegressionEstimatorLetriZh(z)betheregressionestimatorofrn(z),obtainedbyusingthesetofnobservations(XI,Yl),...,(Xn,Yn),andparameterh.Notethatforkernel-typeestimators,histhebandwidth,butforsplines,hisXin(13.8).Definetheavaragemean-squareerroroftheestimatorriZhasLetfi(,p(z)betheestimatorofrn(z).basedonbandwidthparameterh,obtainedbyusingalltheobservationpairsexceptthepair(X,,E).Definethecross-validationscoreCV(h)dependingonthebandwith/trade-offparameterhas(13.12)BecausetheexpectedCV(h)scoreisproportionaltotheAMSE(h)or.moreprecisely,E[CV(h)]MAMSE(h)+CT',whereCT'isconstantvarianceoferrors6%.thevalueofhthatminimizesCV(h)islikely,onaverage,toproducethebestestimators.Forsmoothingsplines,andmoregenerally.forlinearsmoothersm=S(h)y,thecomputationallydemandingprocedurein(13.12)canbesimplifiedbylny,-riZh(2)CV(h)=-(13.13)Ci=l[1-S,i(h)whereS,,(h)isthediagonalelementinthesmoother(13.11).Whennislarge,constructingthesmoothingmatrixS(h)iscomputationallydifficult.Thereareefficientalgorithms(HutchisonanddeHoog.1985)thatcalculateonlyneededdiagonalelementsS,,(h).forsmoothingsplines,withcalculationalcostofO(n).Anothersimplificationinfindingthebestsmootheristhegeneralizedcross-validationcriterion,GCV.Thedenominatorin(13.13)1-S,,(h)isreplacedbyoverallaverage1-nP1C,"=,S,,(h),orintermsofitstrace,1-n-'trS(h).Thus(13.14) SUMMARY257Example13.5AssumethatriZisasplineestimatorandthatXI....,A,areeigenvaluesofmatrixQR-lQ’from(13.11).Then,2rS(h)=xy=l(l+hX,)-l.TheGCVcriterionbecomesnRSS(h)’GCV(h)=2[n-C7=1iTk]13.4.4SplineInferenceSupposethattheestimatorrizisalinearcombinationoftheyZs,i=1ThennIE(&(z))=~az(z)m(X,).andVar(riZ(z))=a,(z)0‘.a=1c1’)Givenz=X,weseethatriZisunbiased,thatis,EriZ(X,)=m(X,)onlyifalla,=0,i#j.Ontheotherhand,varianceisminimizedifalla,areequal.Thisillus-trates,onceagain,thetradeoffbetweentheestimator‘sbiasandvariance.Thevarianceoftheerrorsissupposedtobeconstant.InlinearregressionweestimatedthevarianceasRSS8’=-n-p’wherepisthenumberoffreeparametersinthemodel.Herewehaveananalogousestimator,whereRSS=CZ,[K-&(X,)]’.13.5SUMMARYThischapterhasgivenabriefoverviewofbothkernlelestimatorsandlocalsmoothers.AnexamplefromGasseretal.(1984)showsthatchoosingasmoothingmethodoveraparametricregressionmodelcanmakeacrucialdif-ferenceintheconclusionsofadataanalysis.AparametricmodelbyPreeceandBaines(1978)wasconstructedforpredictingthefutureheightofahu- 258CURVEFlTTlNGTECHNlQUESmanbasedonmeasuringchildren’sheightsatdifferentstagesofdevelopment.Theparametricregressionmodeltheyderivedforwasparticularlycompli-catedbutprovidedagreatimprovementinestimatingthehumangrowthcurve.Publishedsixyearslater,thenonparametricregressionbyGasseretal.(1984)broughtoutanimportantnuanceofthegrowthdatathatcouldnotbemodeledwiththePreeceandBainesmodel(oranymodelthatcamebe-foreit).Asubtlegrowthspurtwhichseemstooccurinchildrenaroundsevenyearsinage.Altman(1992)notesthatsuchagrowthspurtwasdiscussedinpastmedicalpapers,buthad“disappearedfromtheliteraturefollowingthedevelopmentoftheparametricmodelswhichdidnotallowforit.”13.6EXERCISES13.1.DescribehowtheLOESScurvecanbeequivalenttoleast-squaresre-gression.13.2.Datasetoj287.datisthelightcurveoftheblazar05287.Blazars,alsoknownasBLLacObjectsorBLLacertaes,arebright,extragalac-tic,starlikeobjectsthatcanvaryrapidlyintheirluminosity.Rapidfluctuationsofblazarbrightnessindicatethattheenergyproducingre-gionissmall.Blazarsemitpolarizedlightthatisfeaturelessonalightplot.Blazarsareinterpretedtobeactivegalaxynuclei,notsodifferentfromquasars.Fromthisinterpretationitfollowsthatblazarsareinthecenterofanotherwisenormalgalaxy,andareprobablypoweredbyasupermassiveblackhole.Usealocal-polynomialestimatortoanalyzethedatainoj287.datwherecolumn1isthejuliantimeandcolumn2isthebrightness.Howdoesthefitcompareforthethreevaluesofpin(0.1,a}?13.3.Considerthefunction1-2+22-230“Smoothingnoisydatawithsplinefunctions,”NumericalMathematics,1,9’9-106.Miiller,H.G.(1987),’WeightedLocalRegressionandKernelMethodsforNonparametricCurveFitting,“JournaloftheAmericanStatisticalAs-sociation,82,231-238.Nadaraya,E.A.(1964),”OnEstimatingRegressio:n,”TheoryofProbabilityandItsApplications,10,186-190.Preece,hl.A,,andBaines;M.J.(1978):“ANewFamilyofMathematicalModelsDescribingtheHumanGrowthCurve,”AnnalsofHumanBiol-ogy,5,1-24.Priestley,hl.B.,andChao,hl.T.(1972),“NonparametricFunctionFitting,”JournaloftheRoyalStatisticalSociety,Ser.B,34,385-392.Reinsch,C.H.(1967),“SmoothingbySplineFunctions,”NumericalMathe-matics,10,177-183.Schmidt,G.,Mattern,R.,andSchuler,F.(1981),“BiomechanicalInvestiga-tiontoDeterminePhysicalandTraumatologicalDifferentiationCriteriafortheMaximumLoadCapacityofHeadanldVertebralColumnwithandwithoutHelmetunderEffectsofImpact,”EECResearchProgramonBiomechanicsofImpacts.FinalReportPhase111,65,Heidelberg,Germany:InstitutfurRechtsmedizin.Silverman,B.W.(1985),“SomeAspectsoftheSplineSmoothingApproachtoNon-parametricCurveFitting,”JournaloftheRoyalStatisticalSociety,Ser.B,47,152.Tufte,E.R.(1983),TheVisualDisplayofQuantitativeInformation,Cheshire,CT:GraphicPress.Watson,G.S.(1964);“SmoothRegressionAnalysis,”Sankhya,SeriesA,26,359-372.Weierstrass,K.(1885);”UberdieanalytischeDarstellbarkeitsogenannterwillkiirlicherFunctioneneinerreellenVernderlichen.“SitzungsberichtederKoniglichPreufiischenAkademiederWissenschaftenzuBerlin,1885(11).ErsteMitteilung(part1)633639;ZweiteMitteilung(part2)789-805. ThisPageIntentionallyLeftBlank WaveletsItiserroronly,andnottruth,thatshrinksfrominquiry.ThommPaine(1737-1809)14.1INTRODUCTIONTOWAVELETSWavelet-basedproceduresarenowindispensableinmanyareasofmodernstatistics,forexampleinregression,densityandfunctionestimation,factoranalysis,modelingandforecastingoftimeseries,functionaldataanalysis,dataminingandclassification.withrangesofapplicationareasinscienceandengineering.Waveletsowetheirinitialpopularityinstatisticstoshrznkage,asimpleandyetpowerfulprocedureefficientformanynonparametricstatisticalmodels.Waveletsarefunctionsthatsatisfycertainrequirements.Thenamewaveletcomesfromtherequirementthattheyintegratetozero,"waving"aboveandbelowthex-axis.Thediminutiveinwaveletsuggestitsgoodlocalization.Otherrequirementsaretechnicalandneededmostlytoensurequickandeasycalculationofthedirectandinversewavelettransform.Therearemanykindsofwavelets.Onecanchoosebetweensmoothwavelets,compactlysupportedwavelets,waveletswithsimplemathematicalexpres-sions,waveletswithshortassociatedfilters,etc.ThesimplestistheHaarwavelet,andwediscussitasanintroductoryexampleinthenextsection.263 264WAVELETSExamplesofsomewavelets(fromDaubechies’family)aregiveninFigure14.1.Notethatscalingandwaveletfunctionsinpanels(a,b)inFigure14.1(Daubechies4)aresupportedonashortinterval(oflength3)butarenotsmooth;theotherfamilymember,Daubechies16(panels(e.f)inFigure14.1)issmooth,butitssupportismuchlarger.LikesinesandcosinesinFourieranalysis,waveletsareusedasatomsinrepresentingotherfunctions.Oncethewavelet(sometimesinformallycalledthemotherwavelet)$(x)isfixed.onecangenerateafamilybyitstranslationsanddilations,{$(e),(a,b)ER+xR}.Itisconvenienttotakespecialvaluesforaandbindefiningthewaveletbasis:a=2-3andb=k.2-3.wherekandjareintegers.Thischoiceofaandbiscalledcratacalsamplangandgeneratesasparsebasis.Inaddition.thischoicenaturallyconnectsmultiresolutionanalysisindiscretesignalprocessingwiththemathematicsofwavelets.Wavelets,asbuildingblocksinmodeling.arelocalizedwellinbothtimeandscale(frequency).Functionswithrapidlocalchanges(functionswithdis-continuities,cusps,sharpspikes,etc.)canbewellrepresentedwithaminimalnumberofwaveletcoefficients.Thisparsimonydoesnot,ingeneral,holdforotherstandardorthonormalbaseswhichmayrequiremany“compensating”coefficientstodescribediscontinuityartifactsorlocalbursts.Heisenberg’sprinciplestatesthattime-frequencymodelscannotbepreciseinthetimeandfrequencydomainssimultaneously.Wavelets,ofcourse,aresubjecttoHeisenberg’slimitation,butcanadaptivelydistributethetime-frequencyprecisiondependingonthenatureoffunctiontheyareapproximat-ing.Theeconomyofwavelettransformscanbeattributedtothisability.Theabovealreadyhintsathowthewaveletscanbeusedinstatistics.Largeandnoisydatasetscanbeeasilyandquicklytransformedbyadiscretewavelettransform(thecounterpartofdiscreteFouriertransform).Thedataarecodedbytheirwaveletcoefficients.Inaddition,thedescriptor-fast”inFastFouriertransformscan,inmostcases,bereplacedby“faster“forthewavelets.ItiswellknownthatthecomputationalcomplexityofthefastFouriertransformationisO(n.log2(n)).ForthefastwavelettransformthecomputationalcomplexitygoesdowntoO(n).Thismeansthatthecomplexityofalgorithm(intermseitherofnumberofoperations,time,ormemory)isproportionaltotheinputsize,n.Variousdata-processingprocedurescannowbedonebyprocessingthecor-respondingwaveletcoefficients.Forinstance,onecandofunctionsmoothingbyshrinkingthecorrespondingwaveletcoefficientsandthenback-transformingtheshrunkencoefficientstotheoriginaldomain(Figure14.2).Asimpleshrinkagemethod,thresholding,andsomethresholdingpoliciesarediscussedlater.Animportantfeatureofwavelettransformsistheirwhztenzngproperty.Thereisampletheoreticalandempiricalevidencethatwavelettransformsre-ducethedependenceintheoriginalsignal.Forexample,itispossible,foranygivenstationarydependenceintheinputsignal.toconstructabiorthogonal lNTRODUCTlONTOWAVELETS265A06104t-02-3-2-1o1234104;02t4I0-02~-06L-,-6-4-202468Fig.14.1WaveletsfromtheDaubechiesfamily.Depictedarescalingfunctions(left)andwavelets(right)correspondingto(a.b)4,(c.d)8,and(e.f)16tapfilters. 266WAVELETSFig.14.2Wavelet-baseddataprocessing.waveletbasissuchthatthecorrespondinginthetransformareuncorrelated(awaveletcounterpartofthesocalledKarhunen-Lokvetransform).Foradiscussionandexamples,seeWalterandShen(2001).Weconcludethisincompleteinventoryofwavelettransformfeaturesbypointingouttheirsensitivitytoself-similarityindata.Thescalingregularitiesaredistinctivefeaturesofself-similardata.Suchregularitiesareclearlyvisibleinthewaveletdomaininthewaveletspectra,awaveletcounterpartoftheFourierspectra.Moreargumentscanbeprovided:computationalspeedofthewavelettransform,easyincorporationofpriorinformationaboutsomefeaturesofthesignal(smoothness,distributionofenergyacrossscales),etc.Basicsonwaveletscanbefoundinmanytexts,monographs,andpapersatmanydifferentlevelsofexposition.StudentinterestedintheexpositionthatisbeyondthischaptercoverageshouldconsultmonographsbyDaubechies(1992).Ogden(1997),andVidakovic(1999).andWalterandShen(2001),amongothers.14.2HOWDOTHEWAVELETSWORK?14.2.1TheHaarWaveletToexplainhowwaveletswork,westartwithanexample.Wechoosethesimplestandtheoldestofallwavelets(wearetemptedtosay:grandmotherofallwavelets!).theHaarwavelet,$(z).Itisastepfunctiontakingvalues1and-1.onintervals[0,i)and[i.l),respectively.ThegraphsoftheHaarwaveletandsomeofitsdilations/translationsaregiveninFigure14.4.TheHaarwavelethasbeenknownforalmost100yearsandisusedinvariousmathematicalfields.AnycontinuousfunctioncanbeapproximateduniformlybyHaarfunctions,eventhoughthe“decomposingatom”isdiscon-tinuous.Dilationsandtranslationsofthefunction$. HOWDOTHEWAVELETSWORK?267Fig.14.3(a)JeanBaptisteJosephFourier1768-1830.AlfredHaar1885-1933.and(c)IngridDaubechies,ProfessoratPrinceton1il0.00204060810!!XFig.14.4(a)Haarwavelet~(z)=l(05zL,kEZ},whereq!~iscalledthescalingfunctionassociatedwiththewaveletbasis$jk,and4jk(z)=2j/’4(2jx-k).Thesetoffunctions{4L.k,kEZ}spansthesamesubspaceas{$jk,j>W=WavMat([sqrt(2)/2sqrt(2)/21,2”3,3,2);>>W’an5=0.35360.35360.500000.70710000.35360.35360.50000-0.70710000.35360.3536-0.5000000.7071000.35360.3536-0.500000-0.7071000.3536-0.353600.5000000.707100.3536-0.353600.500000-0.707100.3536-0.35360-0.50000000.70710.3536-0.35360-0.5000000-0.7071>>dat=[I0-3210121;>>wt=W*dat’;wt’ans=1.4142-1.41421.0000-1.00000.7071-3.53550.7071-0.7071>>data=W’*wt;data’ans=1.00000.0000-3.00002.00001.00000.00001.00002.0000PerformingwavelettransformationsviatheproductofwaveletmatrixWandinputvectoryisconceptuallystraightforward,butoflimitedpracticalvalue.Storingandmanipulatingwaveletmatricesforinputsexceedingtensofthousandsinlengthisnotfeasible.14.2.2WaveletsintheLanguageofSignalProcessingFastdiscretewavelettransformsbecomefeasiblebyimplementingthesocalledcascadealgorithmintroducedbyMallat(1989).Let{h(lc),kEZ}and{g(k),kEZ}bethequadraturemirrorfiltersintheterminologyofsignal HOWDOTHEWAVELETSWORK?271processing.Twofiltershandgformaquadraturemirrorpairwhen:g(n)=(-l)nh(l-n).Thefilterh(k)isalowpassorsmoothingfilterwhileg(k)isthehighpassordetailfilter.Thefollowingpropertiesofh(n),g(n)canbederivedbyusingsocalledscalingrelationship,Fouriertransformsandiorthogonality:Ckh(k)=4.Ckg(k)=0,Ckh(k)2=1,andCkh(k)k(k-2m)=l(m=0).Themostcompactwaytodescribethecascadealgorithm,aswelltogiveefficientrecipefordeterminingdiscretewaveletcoefficientsisbyusingoperatorrepresentationoffilters.Forasequencea={a,}theoperatorsHandGaredefinedbythefollowingcoordinate-wiserelations:TheoperatorsHandGperformfilteringanddown-sampling(omittingeverysecondentryintheoutputoffiltering),andcorrespondtoasinglestepinthewaveletdecomposition.Thewaveletdecompositionthusconsistsofsub-sequentapplicationofoperatorsHandGintheparticularorderontheinputdata.DenotetheoriginalsignalybydJ).Ifthesignalisoflengthn=2’.thendJ)canbeunderstoodasthevectorofcoefficientsinaseriesf(x)=2’--1(J)Ck=,ck4nk,forsomescalingfunction4.Ateachstepofthewavelettrans-formwemovetoacoarserapproximation&-’)withc(3-l)=He(’)andd(3-l)=Gc(3).Here,d(3-l)representthe“details”lostbydegradingc(3)toc(3-l).ThefiltersHandGaredecimating.thusthelengthofc(3-l)ord(J-’)ishalfthelengthof~(3).Thediscretewavelettransformofasequencey=c(~)oflengthn=2Jcanthenberepresentedasanothersequenceoflength2J(noticethatthesequencec(3-l)hashalfthelengthof~(3)):(p>d(O),d(C,”“d(J-2),($-I)),(14.4)Infact,thisdecompositionmaynotbecarrieduntilthesingletonsdo)anddo)areobtained,butcouldbecurtailedat(J-L)thstep.(,(L),d(L)&+1),...%d(J-2)5d(J-1)’jl(14.5)forany05L5J-1.Theresultingvectorisstillavalidwavelettransform.SeeExercise14.4forHaarwavelettransform“byhand.”functiondwtr=dwtr(data,L,filterh)%functiondwtr=dwt(data,L,filterh);%CalculatestheDWTofperiodicdataset%withscalingfilterfilterhandLdetaillevels.%%ExampleofUse: 272WAVELETSn=length(fi1terh);%LengthofwaveletfilterC=data(:)’;%Data(rowvector)liveinV-jdwtr=[I;%AtthebeginningdwtremptyH=fliplr(fi1terh);%FlipbecauseofconvolutionG=filterh;%MakequadraturemirrorG(1:2:n)=-G(1:2:n);%counterpartforj=l:L%Startcascadenn=length(C);%LengthneededtoC=[C(mod((-(n-l):-1),nn)+l)Cl;%makeperiodicD=conv(C,G);%Convolve,D=D([n:2:(n+nn-2)1+1);%keepperiodicanddecimateC=conv(C,H);%Convolve,c=c([n:2:(n+nn-2)1+1);%keepperiodicanddecimatedwtr=[D,dwtrl;%Adddetailleveltodwtrend;%Backtocascadeorenddwtr=[C,dwtrl;%Addthelast“smooth”partAsaresult,thediscretewavelettransformationcanbesummarizedas:y-(H~-~~,GH~-~-~y,GH~-~-~y,...,GHy,Gy),05L5J-1.TheMATLABprogramdwtr.mperformsdiscretewavelettransform:>data=[l0-3210121;filter=[sqrt(2)/2sqrt(2)/2];>wt=dwtr(data,3,filter)wt=1.4142-1.41421.0000-1.00000.7071-3.53550.7071-0.7071ThereconstructionformulaisalsosimpleintermsofHandG;wefirstdefineadjointoperatorsH*andG*asfollows:(H*a)k=C,h(k-272)~~(G*a)k=C,g(k-272)~~~.Recursiveapplicationleadsto:(c(L),,-J(L),d(L+l),...,d(J-2),d(J-l))+=(H*)JC(L)+~~~L1(H*)jG*d(j),forsome05L5J-1.functiondata=idwtr(wtr,L,filterh)%functiondata=idwt(wtr,L,filterh);CalculatestheIDWTofwavelet%transformationwtrusingwaveletfilter“filterh”andLscales.%Use%>>max(abs(data-IDWTR(DWTR(data,3,filter),3,filter)))% WAVELETSHRINKAGE273%ans=4.4409e-016M=length(wtr);n=length(fi1terh);%Lengthsifnargin==2,L=round(log2(nn));end;%DepthoftransformationH=filterh;%WaveletHfilterC=fliplr(H);G(2:2:n)=-G(2:2:n);%WaveletGfilterLL=nn/(2"L);%NumberofscalingcoeffsC=wtr(1:LL);%Scalingcoeffsforj=1:L%Cascadealgorithmw=mod(O:n/2-1,LL)+1;%MakeperiodicD=wtr(LL+1:2*LL);%Waveletcoeffs~u(1:2:2*LL+n)=[CC(1,w)l;%Upsample&keepperiodicDu(l:2:2*LL+n)=CDD(1,w)l;%Upsample&keepperiodicC=conv(Cu,H)+conv(Du,C);%Convolve&addc=c(Cn:n+2*LL-lI-i);%PeriodicpartLL=2*LL;%Doublethesizeoflevelend;data=C;%TheinverseDWTBecausewaveletfiltersuniquelycorrespondtoselectionofthewaveletorthonormalbasis.wegiveatableafewcommon(andshort)filterscommonlyused.SeeTable14.19forfiltersfromtheDaubechies,CoifletandSymmletfamilies'.SeeExercise14.5forsomecommonpropertiesofwaveletfilters.Thecarefulreadermighthavealreadynoticedthatwhenthelengthofthefilterislargerthantwo,boundaryproblemsoccur(therearenobound-aryproblemswiththeHaarwavelet).Thereareseveralwaystohandletheboundaries,twomainare:symmetricandperiodzc,thatis,extendingtheorig-inalfunctionordatasetinasymmetricorperiodicmannertoaccommodatefilteringthatgoesoutsideofdomainoffunction/data.14.3WAVELETSHRINKAGEWaveletshrinkageprovidesasimpletoolfornonparametricfunctionestima-tion.Itisanactiveresearchareawherethemethodologyisbasedonoptimalshrinkageestimatorsforthelocationparameters.SomereferencesareDonohoandJohnstone(1994,1995),Vidakovic(1999),Antoniadis,andBigotandSap-atinas(2001).Inthissectionwefocusonthesimplest,yetmostimportantshrinkagestrategy-waveletthresholding.IndiscretewavelettransformthefilterHisan"averaging"filterwhileitsmirrorcounterpartGproducesdetails.Thewaveletcoefficientscorrespondtodetails.Whendetailcoefficientsaresmallinmagnitude,theymaybeIFiltersareindexedbythenumberoftapsandroundedatsevendecimalplaces 274WAVELETSTable14.19SomeCommonWaveletFiltersfromtheDaubechies,CoifletandSymm-letFamilies.Haarl/Jz1/JzDaub40.48296290.83651630.2241439-0.1294095Daub60.33267060.80689150.4598775-0.1350110-0.08544130.0352263Coif60.0385808-0.1269691-0.07716160.60749160.74568760.2265843Daub80.23037780.71484660.6308808-0.0279838-0.18703480.0308414Symm8-0.0757657-0.02963550.49761870.80373880.2978578-0.0992195Daub100.16010240.60382930.72430850.1384281-0.2422949-0.0322449Symm100.02733310.0295195-0.03913420.19939750.72340770.6339789Daub120.11154070.49462390.75113390.3152504-0.2262647-0.1297669Symm120.01540410.0034907-0.1179901-0.04831170.49105590.7876411Daub810.0328830-0.0105974Symm8-0.01260340.0322231Daub100.0775715-0.0062415-0.01258080.0033357Symm100.0166021-0.1753281-0.02110180.0195389Daub120.09750160.0275229-0.03158200.00055380.0047773-0.0010773Symm120.3379294-0.0726375-0.02106030.04472490.0017677-0.0078007omittedwithoutsubstantiallyaffectingthegeneralpicture.Thustheideaofthresholdingwaveletcoefficientsisawayofcleaningoutunimportantdetailsthatcorrespondtonoise.Animportantfeatureofwaveletsisthattheyprovideunconditionalbases2forfunctionsthataremoreregular.smoothhavefastdecayoftheirwaveletcoefficients.Asaconsequence,waveletshrinkageactsasasmoothingopera-tor.ThesamecannotbesaidaboutFouriermethods.ShrinkageofFouriercoefficientsinaFourierexpansionofafunctionaffectstheresultgloballyduetothenon-localnatureofsinesandcosines.However,trigonometricbasescanbelocalizedbyproperlyselectedwindowfunctions]sothattheyprovidelocal.wavelet-likedecompositions.Whydoeswaveletthresholdingwork?Wavelettransformsdisbalanceddata.Informally,the"energy"indataset(sumofsquaresofthedata)ispreserved(equaltosumofsquaresofwaveletcoefficients)butthisenergyispackedinafewwaveletCoefficients.Thisdzsbalancingpropertyensuresthatthefunctionofinterestcanbewelldescribedbyarelativelysmallnumberofwaveletcoefficients.Thenormali.i.d.noise,ontheotherhand.isinvariantwithrespecttoorthogonaltransforms(e.g.,wavelettransforms)andpassestothewaveletdomainstructurallyunaffected.Smallwaveletcoefficientslikely21nformally.afamily{q2}isanunconditionalbasisforaspaceoffunctionsSifonecandetermineifthefunctionf=Eta,&belongstoSbyinspectingonlythemagnitudesofcoefficients.la,/s. WAVELETSHRlNKAGE275correspondtoanoisebecausethesignalpartgetstransformedtoafewbig-magnitudecoefficients.Theprocessofthresholdingwaveletcoefficientscanbedividedintotwosteps.Thefirststepisthepolicychoice,whichisthechoiceofthethresh-oldfunctionT.Twostandardchoicesare:hardandsoftthresholdingwithcorrespondingtransformationsgivenby:Thard(d.X)=dl(Id1>A),TSOft(d.A)=(d-sign(d)A)l(ldl>.A).(14.6)whereXdenotesthethreshold,anddgenericallydenotesawaveletcoefficient.Figure14.6showsgraphsof(a)hard-and(b)soft-thresholdingruleswhentheinputiswaveletcoefficientd./-1--I--Fig.14.6(a)Hardand(b)softthresholding;withX=1.Anotherclassofusefulfunctionsaregeneralshrinkagefunctions.Afunc-tionSfromthatclassexhibitsthefollowingproperties:S(d)M0.fordsmall:S(d)Md,fordlarge.Manystate-of-the-artshrinkagestrategiesareinfactoftypeS(d).Thesecondstepisthechoiceofathresholdiftheshrinkageruleisthresh-oldingorappropriateparametersiftherulehas5’-Functionalform.Inthefollowingsubsectionwebrieflydiscusssomeofthestandardmethodsofse-lectingathreshold. 276WAVELETS14.3.1UniversalThresholdIntheearly199Os,DonohoandJohnstoneproposedathresholdX(DonohoandJohnstone,1993;1994)basedontheresultintheoryofextremaofnormalrandomvariables.Theorem14.1LetZ1,...~2,beasequenceofi.i.d.standardnormalran-domvariables.DefineA,={max5JG}.kl,....nThenInaddition,if~,(t)={,max>t+dG},2=1,...,nthenP(B,(t))lambda);figure(3);plot([1:10241,swt,’-’) 278WAVELETS%(vi)Back-transformthethresholdedobjecttothetime%domain.Ofcourse,retainthesamefilterandvalueL.Fig.14.7Demooutput(a)Originaldopplersignal,(b)Noisydoppler,(c)Waveletcoefficientsthat“survived”thresholding,(d)Inverse-transformedthresholdedcoeffi-cients.Example14.3Aresearcherwasinterestedinpredictingearthquakesbythelevelofwaterinnearbywells.Shehadalarge(8192=213measurements)datasetofwaterlevelstakeneveryhourinaperiodoftimeofaboutoneyearinaCaliforniawell.Hereisthedescriptionoftheproblem:Theabilityofwaterwellstoactasstrainmetershasbeenobservedforcenturies.Labstudiesindicatethataseismicslipoccursalongafaultpriortorupture.Recentworkhasattemptedtoquantifythisresponse,inanefforttousewaterwellsassensitiveindicatorsofvolumetricstrain.Ifthisispossible,waterwellscouldaidinearthquakepredictionbysensingprecursoryearthquakestrain.WeobtainedwaterleveIrecordsfromawellinsouthernCalifornia,collectedoverayeartimespan.Severalmoderatesizeearthquakes(magnitude4.0-6.0)occurredincloseproximitytothewellduringthistimeinterval.Thereisaasignificantamountofnoiseinthewa-terlevelrecordwhichmustfirstbefilteredout.Environmentalfactors KMVELETSHRlNKAGE279suchasearthtidesandatmosphericpressurecreatenoisewithfrequen-ciesrangingfromseasonaltosemidiurnal.Theamountofrainfallalsoaffectsthewaterlevel,asdosurfaceloading,pumping,recharge(suchasanincreaseinwaterlevelduetoirrigation),andsonicbooms,tonameafew.Oncethenoiseissubtractedfromthesignal,therecordcanbeanalyzedforchangesinwaterlevel,eitheranincreaseorade-creasedependinguponwhethertheaquiferisexperiencingatensileorcompressionalvolumestrain.justpriortoanearthquake.Thisdatasetisgiveninearthquake.dat.Aplotoftherawdataforhourlymeasurementsoveroneyear(8192=213observations)isgiveninFigure14.8(a).ThedetailshowingtheoscillationattheearthquaketimeispresentedinFigure14.8(b).,-~Ii31Fig.14.8Panel(a)showsn=8192hourlymeasurementsofthewaterlevelforawellinanearthquakezone.Noticethewiderangeofwaterlevelsatthetimeofanearth-quakearoundt=417.Panel(b)focussesonthedataaroundtheearthquaketime.Panel(c)showstheresultofLOESS.and(d)givesawaveletbasedreconstruction. 280WAVELETSApplicationofLOESSsmoothercapturedtrendbuttheoscillationartifactissmoothedoutasevidentfromFigure14.8(c).AfterapplyingtheDaubechies8wavelettransformanduniversalthresholdingwegotafairlysmoothbaselinefunctionwithpreservedjumpattheearthquaketime.TheprocesseddataarepresentedinFigure14.8(d).Thisfeatureofwaveletmethodsdemonstrateddataadaptivityandlocality.Howthiscanbeexplained?Thewaveletcoefficientscorrespondingtotheearthquakefeature(bigoscillation)arelargeinmagnitudeandarelocatedatalleventhefinestdetaillevel.Thesefewcoefficients“survived”thethresh-olding.andtheoscillationfeatureshowsintheinversetransformation.SeeExercise14.6forthesuggestedfollow-up.Fig.14.9Onestepinwavelettransformationof2-DdataexemplifiedoncelebratedLennaimage. EXERCISES281Example14.4Themostimportantapplicationofi!-Dwaveletsisinimageprocessing.Anygray-scaleimagecanberepresentedbyamatrixAinwhichtheentriescorrespondtocolorintensitiesofthepixelatlocation(i,j).WeassumeasstandardlydonethatAisasquarematrixofdimension2nx2n.ninteger.Theprocessofwaveletdecompositionproceedsasfollows.OntherowsofthematrixAthefiltersHandGareapplied.TworesultingmatricesH,AandG,Aareobtained,bothofdimension2nx2n-1(SubscriptrsuggestthatthefiltersareappliedonrowsofthematrixA.2n-1isobtainedinthedimensionofH,AandG,Abecausewaveletfilteringdecimate).Now.thefiltersHandGareappliedonthecolumnsofH,AandG,AandmatricesH,H,A,G,H,A,H,G,AandG,G,Aofdimension2n-1x2"-lareobtained.ThematrixH,H,Aistheaverage,whilethematricesG,H,A,H,G,AandG,G,Aaredetails(seeFigure14.9).3TheprocesscouldbecontinuedinthesamefashionwiththesmoothedmatrixH,H,Aasaninput,andcanbecarriedoutuntilasinglenumberisobtainedasanoverall"smooth"orcanbestoppedatanystep.NoticethatindecompositionexemplifiedinFigure14.9,thematrixisdecomposedtoonesmoothandthreedetailsubmatrices.Apowerfulgeneralizationofwaveletbasesistheconceptofwaveletpack-ets.WaveletpacketsresultfromapplicationsofoperatorsHandG,discussedonp.271,inanyorder.Thiscorrespondstoanovercompletesystemoffunc-tionsfromwhichthebestbasisforaparticulardatasetcanbeselected.14.4EXERCISES14.1.ShowthatthematrixW'in(14.2)isorthogonal14.2.In(14.1)wearguedthat?,bJkand$J/k'areorthogonalfunctionswheneverj=j'and,k=k'isnotsatisfiedsimultaneously.ArguethatdJkand$3!k/areorthogonalwheneverj'2j.Findanexampleinwhich$hJkand$j'k'arenotorthogonalifj'>bsam=[I;>>B=50000;>>forb=1:Bbs=bootsample(pairs);ccbs=corrcoef(bs);bsam=Cbsamccbs(l,2)];>>endwherethefunctionbootsample(XIisasimplem-fileresamplingthevecinthatisnxpdatamatrixwithnequaltonumberofobservationsandpequaltodimensionofasingleobservation.functionvecout=bootsample(vecin)In,p]=size(vecin1;selected-indices=floor(l+n.*(rand(l,n)));vecout=vecin(se1ected-indices,:); NONPARAMETRICBOOTSTRAP291Example15.2TrimmedMean.Forrobustestiniationofthepopulationmean,outlierscanbetrimmedoffthesample,ensuringtheestimatorwillbelessinfluencedbytailsofthedistribution.Ifwetrimoffalmostallofthedata,wewillendupusingthesamplemedian.Supposewetrimoff50%ofthedatabyexcludingthesmallestandlargest25%ofthesample.Obviously,thestandarderrorofthisestimatorisnoteasilytractable,sonoexactconfidenceintervalcanbeconstructed.Thisiswherethebootstraptechniquecanhelpout.Inthisexample,wewillfocusonconstructingatwo-sided95%confidenceintervalforp,whereisanalternativemeasureofcentraltendency,thesameasthepopulationmeanifthedistributionissymmetric.Ifwecomputethetrimmedmeanfromthesampleaspn,itiseasytogeneratebootstrapsamplesanddothesame.Inthiscase,limitingBto1000or2000willmakecomputingeasier,becauseeachrepeatedsamplemustberankedandtrimmedbeforeficanbecomputed.Letb(.025)andfi(.975)bethelowerandupperquantilesfromthebootstraps(amp1efil,...,fi~.TheMATLABm-filetrimmean(x,P)trimsP%(so0>x=[11,13,14,32,55,58,61,67,69,73,73,89,90,93,94,94,95,96,99,991;>>m=trimmean(x,iO)m=71.7895>>m2=mean(x)m2=68.7500>>ciboot(x,'trimmean',5,.90,1000,10)ans=57.617171.789582.9474EstimatingStandardError.Themostcommonapplicationofasimplebootstrapistoestimatethestandarderroroftheestimatoren.Thealgorithmissimilartothegeneralnonparametricbootstrap:0GenerateBbootstrapsamplesofsizen0Evaluatethebootstrapestimators61,...,6,.0Estimatestandarderrorof0,aswhere6*=B-lC6i.15.3BIASCORRECTIONFORNONPARAMETRICINTERVALSThepercentilemethoddescribedinthelastsectionissimple,easytouse.andhasgoodlargesampleproperties.However.thecoverageprobabilityisnotaccurateformanysmallsampleproblems.TheAccelerutzonandBzas-Correction(orBC,)methodimprovesonthepercentilemethodbyadjustingthepercentiles(e.g.,6(l-0/2.6(~1/2))chosenfromthebootstrapsample.AdetaileddiscussionisprovidedinEfronandTibshirani(1993).TheBC,intervalisdeterminedbytheproportionofthebootstrapesti-matesdlessthanOn,i.e.,po=B-lCI(t?,<0,)definethebiasfactoras20=@-%a)expressthisbias,where@isthestandardnormalCDF,sothatvaluesofzo BIASCORRECTIONFORNONPARAMETRICINTERVALS293awayfromzeroindicateaproblem.Letbetheaccelerationfactor.wheree*istheaverageofthebootstrapestimates...,e,.Itgetsthisnamebecauseitmeasurestherateofchangein06%asafunctionof0.Finally.the100(1-a)%BC,intervaliscomputedaswhereNotethatifzo=0(nomeasuredbias)anda0=0,then(15.1)isthesameasthepercentilebootstrapinterval.IntheMATLABm-fileciboot.theBC,isanoption(6)forthenonparametricinterval.Forthetrimmedmeanexample,thebiascorrectedintervalisshiftedupward:>>ciboot(x,’trimmean’,6,.90,1000,10)ans=60.041271.789584.4211Example15.3RecallthedatafromCrowderetal.(1991)whichwasdis-cussedinExample10.2.Thedatacontainstrengthmeasurements(incodedunits)for48piecesofweatheredcord.Sevenofthepiecesofcordweredam-agedandyieldedstrengthmeasurementsthatareconsideredrightcensored.ThefollowingMATLABcodeusesabias-correctedbootstraptocalculatea95%confidenceintervalfortheprobabilitythatthe;strengthmeasureisequaltoorlessthan50.thatis,F(50).>>data=[36.3,41.7,43.9,49.9,50.1,50.8,51.9,52.1,52.3,52.3,...52.4,52.6,52.7,53.1,53.6,53.6,53.9,53.9,54.1,54.6,...54.8,54.8,55.1,55.4,55.9,56.0,56.1,56.5,56.9,57.1,...57.1,57.3,57.7,57.8,58.1,58.9,59.0,59.1,59.6,60.4,...60.7,26.8,29.6,33.4,35.0,40.0,41.9,42.51;>>censor=Cones(l,41),zeros(l,7)1;>>[best,sortdat,sortcenl=KMcdfSM(data’,censor’,0);>>prob=best(sum(50.0>=data),1)prob= 294BOOTSTRAP0.0949>>functionfkmt=kme_at_50(dt)%thisfunctionperformsKaplan-Meier%estimationwithgivenparameter%andproducesestimatedF(50.0)[kmestsortdat]=KMcdfSM(dt(:,1),dt(:,2),0);fkmt=kmest(sum(50.0>=sortdat),1);Usinghe-at-50.mandcibootfunctionsweobtainaconfidenceintervalforF(50)basedon1000bootstrapreplicates:>>ciboot([data’censor’],’kme-at-50’,5,.95,1000)ans=0.02270.09490.1918>>%a95%CIforF(50)is(0.0227,0.1918)>>functionfkmt=kme-all-x(dt)%thisfunctionperformsKaplan-Meierestimationwithgivenparameter%andgivesestimatedF()foralldatapoints[kmestsortdat]=KMcdfSM(dt(:,1),dt(:,2),0);data=C36.3,41.7,//...deleted...//,41.9,42.51;temp-val=[I;%calculateeachCDFFOvalueforalldatapointsfori=l:length(data)ifsum(data(i)>=sortdat)>0temp-val=[temp-valkmest(sum(data(i)>=sortdat),111;else%whenthereisnoobservation,CDFissimply0temp-val=[temp-val01;endendfkmt=temp-Val;TheMATLABfunctionscibootandhe-all-xareusedtoproduceFigure15.4:>>ci=ciboot([data’censor’],’he-all-x’,5,.95,1000);>>figure;>>plot(data’,ci(:,2)’,’.’1;>>holdon;>>plot(data’,ci(:,l)’,’+’I;>>plot(data’,ci(:,3)’,’*’); THEJACKKNlFE2950.40.5t*:++I**:+I1Fig.15.495%confidencebandtheCDFofCrowder’sdatausing1000bootstrapsamples.Lowerboundaryoftheconfidencebandisplottedwithmarker’+‘,whiletheupperboundaryisplottedwithmarker’*..15.4THEJACKKNIFEThejackknzfeprocedure,introducedbyQuenouille(1949).isaresamplingmethodforestimatingbiasandvarianceinen.Itpredatesthebootstrapandactuallyservesasaspecialcase.Theresampleisbasedonthe“leaveoneout’’method,whichwascomputationallyeasierwhencomputingresourceswerelimited.Thezthjackknifesampleis(21,...,~~-1.~,+1,...,zn).Let8(,)betheesti-matorof8basedonlyontheithjackknifesample.ThejackknifeestimateofthebiasisdefinedasbJ=(72-1)pn-8’),where8*=n-1C8(i).ThejackknifeestimatorforthevarianceofenisThejackknifeservesasapoorman’sversionofthebootstrap.Thatis.itestimatesbiasandvariancethesame.butwithalimitedresamplingmecha-nism.InMATLAB,them-filejackknife(x,function,pl,..)producesthejackknifeestimatefortheinputfunction.Thefunctionjackrsp(x,kproducesamatrixofjackknifesamples(takingkelementsout,withdefaultofk=1). 296BOOTSTRAP>>[b,v,f]=jackknife(’trimmean’,x’,10)%note:rowvectorinputb=-0.1074%Jackknifeestimateofbiasv=65.3476%Jackknifeestimateofvariancef=71.8968%JackknifecorrectedestimateThejackknifeperformswellinmostsituations,butpoorlyinsome.Incase8,canchangesignificantlywithslightchangestothedata,thejackknifecanbetemperamental.Thisistruewith8=median,forexample.Insuchcases,itisrecommendedtoaugmenttheresamplingbyusingadelete-djackknife,whichleavesoutdobservationsforeachjackknifesample.SeeChapter11ofEfronandTibshirani(1993)fordetails.15.5BAYESIANBOOTSTRAPTheBayesianbootstrap(BB),aBayesiananaloguetothebootstrap,wasintroducedbyRubin(1981).InEfron’sstandardbootstrap,eachobservationX,fromthesampleXI,....X,hasaprobabilityofl/ntobeselectedandaftertheselectionprocesstherelativefrequencyf,ofX,inthebootstrapsamplebelongstotheset(0.l/n,2/n,...,(n-l)/n,1).Ofcourse.C,f,=1.Then,forexample,ifthestatistictobeevaluatedisthesamplemean,itsbootstrapreplicateisX*=C,f,X,.InBayesianbootstrapping.ateachreplicationadiscreteprobabilitydis-tributiong=(91,...,g,}on{1,2.....n}isgeneratedandusedtoproducebootstrapstatistics.Specifically,thedistributiongisgeneratedbygeneratingn-1uniformrandomvariablesU,NU(0,l),z=1....,n-1,andorderingthemaccordingto0,=U,,-Iwith00=0and0,=1.ThentheprobabilityofX,isdefinedas--g,=U,-Uz-l.a=1,...,n.Ifthesamplemeanisthestatisticofinterest.itsBayesianbootstrapreplicateisaweightedaverageifthesample,X*=C,g,X,.ThefollowingexampleexplainswhythisresamplingtechniqueisBayesian.Example15.4SupposethatXI....,X,arei.i.d.Ber(p),andweseekaBBestimatorofp.Letn1bethenumberofonesinthesampleandn-n1thenumberofzeros.IftheBBdistributiongisgeneratedthenletPI=Cg,l(X,=1)betheprobabilityof1inthesample.ThedistributionforPIissimple,becausethegapsintheUl,,,.,U,-lfollowthe(n-1)-variateDirichletdistribution,Dzr(l.1,....1).Consequently,PIisthesumofn1gapsandisdistrubtedBe(n1.n-nl).NotethatBe(n1,n-721)is,infact,theposterior BAYESIANBOOTSTRAP297forPIifthepriorisx[P1(1-PI)]-’.Thatis.for.cE(0.l},P(X=zlP1)=P;J(1-PpZ.PIK[P1(1-p1)l-l.thentheposterioris[Pl/X1;...,X,]~Be(nl:n-nl).ForgeneralcasewhenXitaked5ndifferentvaluestheBayesianinterpreta-tionisstillvalid;seeRubin’s(1981)article.Example15.5WerevisitHubble’sdataandgiveaBBestimateofvariabilityofobservedcoefficientofcorrelationT.ForeachBBdistributiongcalculatewhere(Xi,X)‘i=1,...24areobservedpairsofdist,ancesandvelocities.TheMATLABprogrambelowperformstheBBresampling.>>x=C0.0320.0340.2140.2630.2750.2750.450.50.50.630.80.90.90.90.91.01.11.11.41.72.02.02.02.01;%Mpc>>y=El70290-130-70-185-220200290270200300-30...65015050092045050050096050085080010901;%velocity>>n=24;corr(x’,y’);>>B=50000;%numberofBBreplicates>>bbcorr=[I;%storeBBcorrelationreplicates>>fori=1:Bsampl=(rand(1,n-1));osmp=sort(sampl);all=[Oosamp13;gis=diff(al1,1);%gisisBBdistribution,corrbbiscorrelation%withgisasweightsssx=sum(gis.*x);ssy=sum(gis.*y);ssx2=sum(gis.*x.-2);ssy2=sum(gis.*y.-2);ssxy=sum(gis.*x.*y);corrbb=(ssxy-ssx*ssy)/...sqrt((ssx2-ssxA2)*(ssy2-ssy-2));%correlationreplicatebbcorr=[bbcorrcorrbb];%addreplicatetothestoragesequence>>end>>figure(1)>>hist(bbcorr,80)>>std(bbcorr)>>zs=1/2*log((l+bbcorr)./(l-bbcorr));%Fis:her’sz>>figure(2)>>hist(zs,801>>std(zs) 298BOOTSTRAPfig.15.5Thehistogramof50,000BBresamplesforthecorrelationbetweenthedistanceandvelocityintheHubbledata;(b)Fisherz-transformoftheBBcorrelations.Thehistogramsofcorrelationbootstrapreplicatesandtheirz-transformsinFigure15.5(a-b)looksimilartothethoseinFigure15.3(c-d).Numerically,B=50,000replicatesgavestandarddeviationofobservedTas0.0635andstandarddeviationofz=1/21og((l+~)/(1-T))as0.1704slightlysmallerthantheoretical24-3X1I2=0.2182.15.6PERMUTATIONTESTSSupposethatinastatisticalexperimentthesampleorsamplesaretakenandastatisticSisconstructedfortestingaparticularhypothesisHo.ThevaluesofSthatseemextremefromtheviewpointofHOarecriticalforthishypothesis.ThedecisioniftheobservedvalueofstatisticsSisextremeismadebylookingatthedistributionofSwhenHOistrue.Butwhatifsuchdistributionisunknownortoocomplextofind?Whatifthedistributionfor5'isknownonlyunderstringentassumptionsthatwearenotwillingtomake?ResamplingmethodsconsistingofpermutingtheoriginaldatacanbeusedtoapproximatethenulldistributionofS.Giventhesample,oneformsthepermutationsthatareconszstentwithexperimentaldesignandHo,andthencalculatesthevalueofS.ThevaluesofSareusedtoestimateitsdensity(oftenasahistogram)andusingthisempiricaldensitywefindanapproximatep-value.oftencalledapermutationp-value.WhatpermutationsareconsistentwithHo?Supposethatinatwo-sample PERMUTATIONTESTS299problemwewanttocomparethemeansoftwopopidationsbasedontwoin-dependentsamplesXI,...,X,andYl,...,Yn.T’henullhypothesisHois1-1~=py.ThepermutationsconsistentwithHowiouldbeallpermutationsofacombined(concatenated)sampleXI....,X,.Y;,...,Y,.Orsupposewearepeatedmeasuresdesigninwhichobservationsaretripletscorrespondingtothreetreatments.i.e.,(X11,XlZ3X13),....(Xnl,Xn2,Xn3),andthatHostatesthatthethreetreatmentmeansarethesame,1-11=1-12=1-13.Thenper-mutationsconsistentwiththisexperimentaldesignarerandompermutationsamongthetriplets(X,,,X,2.Xa3),i=1,...,nandapossiblepermutationmightbeThus,dependingonthedesignandHO,consistentpermutationscanbequitedifferent.Example15.6ByzantineCoins.ToillustratethespiritofpermutationtestsweusedatafromapaperbyHendyandCharles(1970)(seealsoHandetal,1994)thatrepresentthesilvercontent(%Ag)(ofanumberofByzantinecoinsdiscoveredinCyprus.Thecoins(Figure15.6)arefromthefirstandfourthcoinageinthereignofKingManuelI,Comnenus(1143-1180).1stcoinage5.96.86.47.06.67.77.26.96.24thcoinage5.35.65.55.16.25.85.8Thequestionofinterestiswhetherornotthereisstatisticalevidencetosuggestthatthesilvercontentofthecoinswassignificantlydifferentinthelatercoinage.Fig.15.6AcoinofManuelIComnenus(1143-1180)Ofcourse.thetwo-samplet-testoroneofitsnoinparametriccounterpartsispossibletoapplyhere,butwewillusethepermutationtestforpurposesofillustration.ThefollowingMATLABcommandspeirformthetest: 300BOOTSTRAP>>coins=[5.96.86.47.06.67.77.26.96.2.>>5.35.65.55.16.25.85.81;>>coinsl=coins(l:9);coins2=coins(l0:16);>>s=(mean(coins~)-mean(coins2))/sqrt(var(coinsl)+varcoins2)>>Sps=[I;asl=O;%SpsispermutationS,>>%as1isachievedsignificancelevel>>N=10000;>>fori=1:Ncoinsp=coins(randperm(l6));coinspl=coinsp(l:9);coinsp2=coinsp(lO:16);~p=(mean(coinsp~)-mean(coinsp2))/...sqrt(var(coinspl)+var(coinsp2));sps=[Spssp1;as1=as1+(abs(Sp)>S1;end>>as1=asl/NThevalueforSis1.7301,andthepermutationp-valueortheachievedsignificancelevelisas1=0.0004.Panel(a)inFigure15.7showsthepermu-tationnulldistributionofstatisticsSandtheobservedvalueofSisindicatedbythedottedverticalline.NotethatthereisnothingspecialaboutselectingandthatanyotherstatisticsthatsensiblymeasuresdeviationfromHo:p1=p2couldbeused.Forexample,onecoulduseS=median(Xl)/sl-median(X2)/sz,orsimplyS=-x2.Todemonstratehowthechoicewhattopermutedependsonstatisticalde-sign,weconsideragainthetwosampleproblembutwithpairedobservations.Inthiscase,thepermutationsaredonewithinthepairs,independentlyfrompairtopair.Example15.7Left-handedGrippers.Measurementsoftheleft-andright-handgrippingstrengthsof10left-handedwritersarerecorded.1Person111213I4151~171811l0I1Lefthand(X)I140I901125I130I95I121I85I97I131I1101Dothedataprovidestrongevidencethatpeoplewhowritewiththeirlefthandhavegreatergrippingstrengthinthelefthandthantheydointherighthand? PERMUTATIONTESTS301IntheMATLABsolutionprovidedbelow,dataLanddataarepairedmeasurementsandpdataLandpdataRarerandompermutations.either(1.2)or(2;l}ofthe10originalpairs.ThestatisticsSisthedifferenceofthesamplemeans.Thepermutationnulldistributionisshownasnon-normalizedhistograminFigure15.7(b).ThepositionofSwithrespecttothehistogramismarkedbydottedline.>>dataL=[140,90,125,130,95,121,85,97,131P1101;>>data=[138,87,110,132,96,120,86,90,129,1001;>>S=mean(dataL-data)>>data=[dataL;data];>>means=[];as1=O;N=10000;>>fori=1:Npdata=[I;forj=l:lOpairs=data(randperm(2),j);pdata=[pdatapairs];endpdataL=pdata(1,:);pdataR=pdata(2,:);pmean=mean(pdataL-pdataR);means=[meanspmeanl;as1=as1+(abs(pmean)>S);endfig.15.7Panels(a)and(b)showpermutationnulldistributionofstatisticsSandtheobservedvalueofS(markedbydottedline)forthecasesof(a)Bizantinecoins.and(b)Left-handedgrippers. 302BOOTSTRAP15.7MOREONTHEBOOTSTRAPThereareseveralexcellentresourcesforlearningmoreaboutbootstraptech-niques,andtherearemanydifferentkindsofbootstrapsthatworkonvariousproblems.BesidesEfronandTibshirani(1993),booksbyChernick(1999)andDavisonandHinkley(1997)provideexcellentoverviewswithnumeroushelp-fulexamples.Inthecaseofdependentdatavariousbootstrappingstrategiesareproposedsuchasblockbootstrap,stationarybootstrap,wavelet-basedbootstrap(wavestrap),andsoon.AmonographbyGood(2000)givesacomprehensivecoverageofpermutationtests.Bootstrappingisnotinfallible.Datasetsthatmightleadtopoorperfor-manceincludethosewithmissingvaluesandexcessivecensoring.Choiceofstatisticsisalsocritical;seeExercise15.6.Iftherearefewobservationsinthetailofthedistribution,bootstrapstatisticsbasedontheEDFperformpoorlybecausetheyarededucedusingonlyafewofthoseextremeobservations.15.8EXERCISES15.1.Generateasampleof20fromthegammadistributionwithX=0.1andr=3.Computea90%confidenceintervalforthemeanusing(a)thestandardnormalapproximation,(b)thepercentilemethodand(c)thebias-correctedmethod.Repeatthis1000timesandreporttheactualcoverageprobabilityofthethreeintervalsyouconstructed.15.2.ForthecaseofestimatingthesamplemeanwithX,derivetheexpectedvalueofthejackknifeestimateofbiasandvariance.15.3.RefertoinsectwaitingtimesforthefemaleWesternWhiteClematisinTable10.15.Usethepercentilemethodtofinda90%confidenceintervalforF(30),theprobabilitythatthewaitingtimeislessthanorequalto30minutes.15.4.InadatasetofsizengeneratedfromacontinuousF,howmanydistinctbootstrapsamplesarepossible?15.5.Refertothedominance-submissivenessdatainExercise7.3.Constructa95%confidenceintervalforthecorrelationusingthepercentilebootstrapandthejackknife.CompareyourresultswiththenormalapproximationdescribedinSection2ofChapter7.15.6.SupposewehavethreeobservationsfromU(O,8).Ifweareinterestedinestimating8,theMLEforitis8=X33,thelargestobservation.IfweobtainabootstrapsamplingproceduretoestimatethevarianceofthehlLE,whatisthedistributionofthebootstrapestimatorfor8? EXERClSES30315.7.Sevenpatientseachunderwentthreedifferentmethodsofkidneydialy-sis.ThefollowingvalueswereobtainedforweLghtchangeinkilogramsbetweendialysissessions:PatientTreatment1Treatment2Treatment312.902.972.6722.562.452.6232.882.761.8442.732.202.3352.502.161.2763.182.892.3972.832.872.39Testthenullhypothesisthatthereisnodifferenceinmeanweightchangeamongtreatments.Useproperlydesignedpermutationtest.15.8.InacontrolledclinicaltrialPhysician'sHealthStudyIwhichbeganin1982andendedin1987,morethat22.000physiciansparticipated.Theparticipantswererandomlyassignedtotwogrosups:(i)Aspirinand(ii)Placebo,wheretheaspiringrouphavebeentaking325mgaspirineverysecondday.Attheendoftrial,thenumberofparticipantswhosufferedfromMyocardialInfarctionwasassessed.Thecountsaregiveninthefollowingtable:MyoInfNoMyoInfTotalAspirin1041093311037Placebo1891084511034ThepopularmeasureinassessingresultsinclinicaltrialsisRiskRa-tio(RR)whichistheratioofproportionsofcases(risks)inthetwogroupsltreatments.Fromthetable,InterpretationofRRisthattheriskofMyocardialInfarctionforthePlacebogroupisapproximately110.55=1.82timeshigherthanthatfortheAspiringroup.WithMATLAB,constructabootstrapestimateforthevariabilityofRR.Hint:aspi=[zeros(10933,1);ones(l04,l)I;plac=[zeros(10845,1);ones(189,l)I;RR=(sum(aspi)/length(aspi))/(sum(plac)/length(plac)); 304BOOTSTRAPBRR=[I;B=10000;forb=1:Bbaspi=bootsample(aspi);bplac=bootsample(p1ac);BRR=[BRR(sum(baspi)/length(baspi))/(sum(bplac)/length(bplac))l;end(ii)FindthevariabilityofthedifferenceoftherisksR,-R,,andoflogarithmoftheoddsratio,log(R,/(l-R,))-log(R,/(l-R,)).(iii)UsingtheBayesianbootstrap,estimatethevariabilityofRR,R,-R,,andlog(Ra/(l-R,))-log(R,/(1-Rp)).15.9.Letf,andg,befrequency/probabilityoftheobservationX,inanordi-nary/BayesianbootstrapresamplefromXI.....X,.ProvethatIEf,=IEg,=l/n,i.e.,theexpectedprobabilitydistributionisdiscreteuniform,Varf,=(n+l)/n,Varg,=(n-1)/n2,andfori#j,Corr(f,.fJ)=Corr(g,,g,)=-I/(n-1).REFERENCESDavison,A.C.,andHinkley,D.V.(1997),BootstrapMethodsandTheirApplications,Boston:CambridgeUniversityPress.Chernick,M.R.,(1999),BootstrapMethods-APractitioner'sGuide,NewYork:Wiley.Efron,B.,andTibshirani,R.J.(1993),AnIntroductiontotheBootstrap,BocaRaton,FL:CRCPress.Efron,B.(1979),"BootstrapMethods:AnotherLookattheJackknife,"An-nalsofStatistics,7,1-26Fisher,R.A.(1935),TheDesignofExperiments,NewYork:Hafner.Good,P.I.(2000),PermutationTests:APracticalGuidetoResamplingMethodsforTestingHypotheses,2nded.,NewYork:SpringerVerlag.Hand,D.J.,Daly,F.,Lunn,A.D.,McConway,K.J.,andOstrowski,E.(1994).AHandbookofSmallDatasets,NewYork:Chapman8~Hall.Hendy,M.F.,andCharles,J.A.(1970),"TheProductionTechniques,Sil-verContent,,andCirculationHistoryoftheTwelfth-CenturyByzantineTrachy,"Archaeometry,12,13-21.Mahalanobis,P.C.(1946),"OnLarge-ScaleSampleSurveys,"PhilosophicalTransactionsoftheRoyalSocietyofLondon,Ser.B,231,329-451.Pitman,E.J.G.,(1937):"SignificanceTestsWhichMayBeAppliedtoSam-plesfromAnyPopulation,''RoyalStatisticalSocietySupplement,4,119-130and225-232(partsIand11). REFERENCES305Quenouille,XI.H.(1949),“ApproximateTestsofCorrelationinTimeSeries,”JournaloftheRoyalStatisticalSociety,Ser.B,11,18-84.Raspe,R.E.(1785).TheTravelsandSurprisingAdventuresofBaronMun-chausen,London:Trubner,1859[lstEd.17851.Rubin,D.(1981),“TheBayesianBootstrap,”AnnalsofStatistics,9,130-134. ThisPageIntentionallyLeftBlank 16EMAlgorithmInsanityisdoingthesamethingoverandoveragainandexpectingdifferentresults.AlbertEinsteinTheExpectation-Maximization(EM)algorithmisbroadlyapplicablesta-tisticaltechniqueformaximizingcomplexlikelihoodswhilehandlingproblemswithincompletedata.Withineachiterationofthe#algorithm.twostepsareperformed:(i)theE-Stepconsistingofprojectinganappropriatefunctionalcontainingtheaugmenteddataonthespaceoftheoriginal.incompletedata.and(ii)theM-Stepconsistingofmaximizingthefunctional.ThenameEMalgorithmwascoinedbyDempster,Laird,andRubin(1979)intheirfundamentalpaper,referredtohereastheDLRpaper.Butasisusuallythecase,ifonecomestoasmartidea,onemaybesurethatothersmartguysinthehistoryhadalreadythoughtaboutit.Llongbefore,LfcKendrick(1926)andHealyandWestmacott(1956)proposediterativemethodsthatareexamplesoftheEMalgorithm.Infact.beforetheDLRpaperappearedin1997,dozensofpapersproposingvariousiterativesolverswereessentiallyapplyingtheEMAlgorithminsomeform.However,theDLRpaperwasthefirsttoformallyrecognizetheseseparatealgorithmsashavingthesamefundamentalunderpinnings.soperhapstheir1977paperpreventedfurtherreinventionsofthesamebasicmathtool.Whilethealgorithmisnotguaranteedtoconvergeineverytypeofproblem(asmistakenlyclaimedbyDLR),Wu(1983)showedconvergenceisguaranteedifthedensitiesmakingupthefulldatabelongtotheexponentialfamily.307 308EMALGORITHMThisdoesnotpreventtheEMmethodfrombeinghelpfulinnonparametricproblems;TsaiandCrowley(1985)firstappliedittoageneralnonparametricsettingandnumerousapplicationshaveappearedsince.16.0.1DefinitionLetYbearandomvectorcorrespondingtotheobserveddatayandhavingapostulatedPDFf(y,$),where1c,=($1,...,$~d)isavectorofunknownparameters.Letzbeavectorofaugmented(socalledcomplete)data,andletzbethemissingdatathatcompletesIC,sothatz=[p,21.Denotebygc(z,$)thePDFoftherandomvectorcorrespondingtothecompletedatasetIC.Thelog-likelihoodfor$,ifzwerefullyobserved,wouldbeTheincompletedatavectorycomesfromthe"incomplete"samplespacey.Thereisanone-to-onecorrespondencebetweenthecompletesamplespaceXandtheincompletesamplespacey.Thus,forICEX.onecanuniquelyfindthe"incomplete"y=y(z)Ey.Also,theincompletepdfcanbefoundbyproperlyintegratingoutthecompletepdf,whereX(y)isthesubsetofXconstrainedbytherelationy=y(z).Let$(O)besomeinitialvaluefor$.Atthek-thsteptheEMalgorithmoneperformsthefollowingtwosteps:E-Step.CalculateM-Step.Chooseanyvalue$(k+l)thatmaximizesQ($,$(k)),thatis,TheEandMstepsarealternateduntilthedifferenceL(7p++1))-L($(")becomessmallinabsolutevalue.NextweillustratetheEMalgorithmwithafamousexamplefirstconsid-eredbyFisherandBalmukand(1928).ItisalsodiscussedinRao(1973).andlaterbyMclachlanandKrishnan(1997)andSlatkinandExcoffier(1996). FfSHER’SEXAMPLE30916.1FISHER’SEXAMPLEThefollowinggeneticsexamplewasrecognizedbyasanapplicationoftheEMalgorithmbyDempsteretal.(1979).ThedescriptionprovidedhereessentiallyfollowsalecturebyTerrySpeedofUCatI3erkeley.Inbasicgeneticsterminology.supposetherearetwolinkedbi-allelicloci,AandB,withallelesAanda.andBandb,respectively,whereAisdominantoveraandBisdominantoverb.AdoubleheterozygoteAaBbwillproducegametesoffourtypes:AB,Ab.aBandab.Asthelociarelinked,1hetypesABandabwillappearwithafrequencydifferentfromthatofAbandaB,say1-randr.respectively.inmales,and1-r’andr’respectivelyinfemales.HerewesupposethattheparentaloriginoftheseheterozygotesisfromthematingAABBxaabb.sothatrandT’arethemaleandfemalerecom-binationratesbetweenthetwoloci.Theproblemistoestimaterandr’,ifpossible.fromtheoffspringofselfeddoubleheterozygotes.BecausegametesAB.Ab.aBandabareproducedinproportions(l-r)/2,r/2,r/2and(l-r)/2?respectively,bythemaleparent.and(1-r’)/2,rf/2.r’/2and(1-r’)/2,re-spectively.bythefemaleparent.zygoteswithgenotypesAABB.AaBB....etc,areproducedwithfrequencies(1-r)(l-r’)/4,(1-T)T’/~.etc.Theproblemhereisthis:althoughthereare16distinctoffspringgeno-types,takingparentaloriginintoaccount.thedominancerelationsimplythatweonlyobserve4distinctphenotypes,whichwedenotebyA*B*.A*b*,a*B*anda*b*.HereA*(respectivelyB*)denotesthedominantwhilea*(respec-tivelyb*)denotestherecessivephenotypedeterminedbytheallelesatA(respectivelyB).ThusindividualswithgenotypesAABB,AaBB,AABborAaBb,(whichaccountfor9/16ofthegameticcombinations)exhibitthephenotypeA*B*,i.e.thedominantalternativeinbothcharacters.whilethosewithgenotypesAAbborAabb(3/16)exhibitthephenotypeA*b*,thosewithgenotypesaaBBandaaBb(3/16)exhibitthephenotypea*B*.andfinallythedoublerecessiveaabb(1/16)exhibitsthephenotypea*b*.Itisaslightlysurprisingfactthattheprobabilitiesofthefourphenotypicclassesaredefinableintermsoftheparametery=(1-r)(1-T’),asfollows:a*b*hasprobability4/4(easytosee),a*B*andA*b*bothhaveprobabilities(1-y)/4,whileA*B*hasrestoftheprobability.whichis(2+y)/4.Kowsupposewehavearandomsampleofnoffspringfromtheselfingofourdoubleheterozygote.The4phenotypicclasseswillberepresentedroughlyinproportiontotheirtheoreticalprobabilities,theirjointdistributionbeingmultinomial2+.11,1-7)1-y:+Mn(4’4’4‘4n;------(16.1)NotethathereneitherrnorT’willbeseparatelyestimablefromthesedata,butonlytheproduct(1-r)(l-r’).BecauseweknowthatT51/2andr’5l/2,itfollowsthatII,21/4. 310EMALGORITHMHowdoweestimate+?FisherandBalmukandlistedavarietyofmethodsthatwereintheliteratureatthetime,andcomparethemwithmaximumlikelihood,whichisthemethodofchoiceinproblemslikethis.WedescribeavariantontheirapproachtoillustratetheEMalgorithm.Let9=(125.18,20,34)bearealizationofvectory=(yl,y2,y3,y4)be-lievedtobecomingfromthemultinomialdistributiongivenin(16.1).Theprobabilitymassfunction,giventhedata,isn!(1/2+$/4)y1(1/4-$/4)y2$-y3($/4)'*.g(yl+')=y1!7&!y3!y4!Theloglikelihood,afteromittinganadditivetermnotcontaining$islogL($)=Y1lOd2+$1+(YZ+Y3)log(1-$1+Y4log($).Bydifferentiatingwithrespectto11,onegetsY1Y2+Y3Y48logL($)/8+=--~a++1-$+.;.'TheequationalogL($)/d$=0canbesolvedandsolutionis$=(5+dm)/394x0.626821.Nowassumethatinsteadoforiginalvaluey1thecountsy11andy12,suchthaty11+y12=y1,couldbeobserved,andthattheirprobabilitiesare1/2and$/4,respectively.Thecompletedatacanbedefinedasx=(~11,y12,92,y3,~4).TheprobabilitymassfunctionofincompletedatayisS(Y>$1=Cgc(z,$)!wheregc(zl$)=~(z)(1/2)~"($/4)~~~(1/4-$/4)y22+y3($/4)y4,c(x)isfreeof$landthesummationistakenoverallvaluesofzforwhichYll+y12=Y1.ThecompleteloglikelihoodislogLc($)=(Y12+Y4)log($)+(Y2+Y3)141-$1.(16.2)OurgoalistofindtheconditionalexpectationoflogLc($)giveny,usingthestartingpointfor$(O),Q($,$(')I=Ep){logLC($)IY}.AslogL,islinearfunctioniny11andy12,theE-Stepisdonebysimplybyreplacingy11andyl2bytheirconditionalexpectations,giveny.IfY11istherandomvariablecorrespondingtoy1l1itiseasytoseethat MIXTURES311sothattheconditionalexpectationofYl1giveny1isOfcourse,&)=y1-yiy).ThiscompletestheE-Steppart.IntheM-SteponechoosessothatQ(+,+(”))ismaximized.After(0)replacingy11andy12bytheirconditionalexpectationsyiy)andylzintheQ-function,themaximumisobtainedat&)IY2+Y4----YF+Y4(0).9102)+Y2+Y3+Y4n-Y11TheEM-Algorithmiscomposedofalternatingthesetwosteps.Attheiterationkwehavewhereyit)=$y1/(1/2+q(k)/4)andy!:)=y1-Y11(Ic).ToseehowtheEMalgorithmcomputestheWILEforthisproblem.seetheMATLABfunctionemexample.m.16.2MIXTURESRecallfromChapter2thatmixturesarecompounddistributionsoftheformF(z)=F(zlt)dG(t).TheCDFG(t)servesasamixingdistributiononker-neldistributionF(z/t).Recognizingandestimatingrnixturesofdistributionsisanimportanttaskindataanalysis.Patternrecognition.dataminingandothermodernstatisticaltasksoftencallformixtureestimation.Forexample.supposeanindustrialprocessthat]producesmachinepartswithlifetimedistributionF1,butasmallproportion(oftheparts(say,w)aredefectiveandhaveCDFF2>>F1.Ifwecannotsortoutthegoodonesfromthedefectiveones,thelifetimeofarandomlychosenpartisF(z)=(1-w)F1(z)+wF2(z).Thisisasimpletwo-pointmixturewherethemixingdistributionhastwodiscretepointsofpositivemass.With(finite)discretemixtureslikethis,theprobabilitypointsofGserveasweightsforthekerneldistribution.Inthenonparametriclikelihood,weseeimmediatelyhowdifficultitistosolvefortheMLEinthepresenceoftheweightw,especiallyifwisunknown.Supposewewanttoestimatetheweightsofafixednumberkoffullyknown 312EMALGORITHMdistributions.WeillustrateEMapproachwhichintroducesunobservedindi-catorswiththegoalofsimplifyingthelikelihood.Theweightsareestimatedbymaximumlikelihood.AssumethatasampleXI.Xz,....X,comesfromthemixturekf(z.w)=C3=1w3f3(4.wherefl....fkarecontinuousandtheweights05w351areunknownandconstitute(k-1)-dimensionalvectorw=(q....,wk-1)andWk=1-w1-...-wk-1.Theclass-densitiesf,(x)arefullyspecified.Eveninthissimplestcasewhenfll....fkaregivenandtheonlyparam-etersaretheweightsw.thelog-likelihoodassumesacomplicatedform.Thederivativeswithrespecttow3leadtothesystemofequations.notsolvableinaclosedform.HereisasituationwheretheEMAlgorithmcanbeappliedwithalittlecreativeforesight.Augmentthedataz=(51,...,zn)byanunobservablematrixz=(zt3,i=1,....n:j=1....,k).Thevaluesz,3areindicators,definedasI,zifromfjzij={0.otherwiseTheunobservablematrixz(our“missingvalue”)tellsus(inanoracularfash-ion)wheretheithobservationz,comesfrom.Notethateachrowofzcontainsasingle1andk-10‘s.Withaugmenteddata,z=(y,z)the(complete)like-lihoodtakesquiteasimpleform,Thecompletelog-likelihoodissimplylogL,(w)=c;==,c;=,zij10gwj+C.whereC=C,C3z,3logf3(xt)isfreeofw.Thisiseasilysolved.Assumethatmthiterationoftheweightestimatew(m)isalreadyobtained.ThemthE-Stepiswherez:?)istheposteriorprobabilityofithobservationcomingfromthejth MIXTURES313mixture-component,f:,,intheiterativestepm.BecauselogL,(w)islinearinzt3,Q(~.UJ(~))issimplyZ:=,C~=,z~~)logw,+C.ThesubsequentM-Stepissimple:Q(w.~(~1)ismaximizedbyJnTheMATLABscript(mixture-cla.m)illustratesthealgorithmabove.Asampleofsize150isgeneratedfromthemixture,f(z)=0.5n/(-5.22)+0.3N(0,0.52)+0.2n/(2,1).ThemixingweightsareestimatedbytheEMalgo-rithm.A4=20iterationsofEMalgorithmyielded2=:(0.4977,0.2732,0.2290).Figure16.1giveshistogramofdata,theoreticalmixtureandEMestimate.0.2510.21IO.l2l0.110.05-0.-10-505Fig.161Observationsfromthe0.5hr(-5.2’)+0.3Ar(0,(j1.5~)+0.2N(2,1)mixture(histogram).themixture(dottedhe)andEMestimatedmixture(solzdhe).Example16.1Asanexampleofaspecificmixtureofdistributionswecon-siderapplicationofEMalgorithminthesocalledZeroInflatedPoisson(ZIP)model.InZIPmodelstheobservationscomefromtwopopulations,oneinwhichallvaluesareidenticallyequalto0andtheotherPoissonP(A).The“zero”populationisselectedwithprobability[,andthePoissonpopulation 314EMALGORITHMwithcomplementaryprobabilityof1-E.Giventhedata,bothXandEaretobeestimated.ToillustrateEMalgorithminfittingZIPmodels,weconsiderdataset(Thisted,1988)ondistributionofnumberofchildreninasampleofn=4075widows,giveninTable16.20.Table16.20FrequencyDistributionoftheNumberofChildrenAmong4075WidowsNumberofChildren(number)0123456NumberofWidows(freq)30625872841033342AtfirstglancethePoissonmodelforthisdataseemstobeappropriate,however,thesamplemeanandvariancearequitedifferent(theoretically,inPoissonmodelstheyarethesame).>>number=0:6;%numberofchildren>>freqs=[306258728410333421;>>n=sum(freqs)>>sum(freqs.+number)/n%samplemeanans=0.3995>>sum(freqs.+(number-0.3995).-2)/(n-i)%samplevarianceails=0.6626Thisindicatespresenceofover-dispersionandtheZIPmodelcanaccountfortheapparentexcessofzeros.TheZIPmodelcanbeformalizedasxzP(x=~)=(l-()-e-’.i=1,2....,2!andtheestimationof>newxi=3/4;newlambda=1/4;%initialvalues>>newnOOs=[I;newxis=[I;newlambdas=[I;>>fori=1:20newno0=freqs(1)*newxi/(newxi+...(1-newxi)*exp(-newlambda));newxi=newnOO/n;newlambda=sum((l:6).*freqs(2:7))/(n-newiOO);%collectthevaluesinthreesequencesnewnOOs=[newnOOsnewno01;newxis=[newxisnewxi];newlambdas=[newlambdasnewlambda];endTable16.21givesthepartialoutputoftheMATLABprogram.Thevaluesfornewxi.newlambda.andnewnOOwillstabilizeafterseveraliterationsteps.16.3EMANDORDERSTATISTICSWhenapplyingnonparametricmaximumlikelihoodtodatathatcontain(in-dependent)orderstatistics,theEMAlgorithmcanbeappliedbyassumingthatwiththeobservedorderstatisticXt.k(theithsmallestobservationfromani.i.d.sampleofk),thereareassociatedwithitk-1missingvalues:i-1valuessmallerthanx,.kandk-ivaluesthatarelarger.KvamandSamaniego(1994)exploitedthisopportunitytousetheEMforfindingthenonparametric 316EMALGORlTHMTable16.21SomeoftheTwentyStepsintheEMImplementationofZIPModelingonWidowDataStepnewxinewlambdanewnOO01143142430.910.59650.99022447.220.60051.00012460.130.60371.00812470.2180.61491.03722505.6190.61491.03732505.8200.61491.03742505.9MLEfori.i.d.componentlifetimesbasedonobservingonlyk-out-of-nsystemlifetimes.Recallak-out-of-nsystemneedskormoreworkingcomponentstooperate,andfailsaftern-k+1componentsfail,hencethesystemlifetimeisequivalenttoXn-kfln.SupposeweobserveindependentorderstatisticsXrzk,.i=1,...,nwheretheunorderedvaluesareindependentlygeneratedfromF.WhenFisabso-lutelycontinuous,thedensityforXTzk,isexpressedasF(x))kt-Tpf(z).Forsimplicity,letk,=k.Inthisapplication.weassignthecompletedatatobeX,={X,,,....X&,Z,},z=1,...,nwhere2,isdefinedastherankofthevalueobservedfromX,.TheobserveddatacanbewrittenasU,={Wz,Z,},whereW,istheZzthsmallestobservationfromX,.Withthecompletedata,theLILEforF(z)istheEDF,whichwewillwriteasN(cc)/(nk)whereN(x)=C,C,l(X,,5z).ThismakestheM-stepsimple.butfortheE-step,Nisestimatedthroughthelog-likelihood.Forexample,if2,=z.weobserveW,distributedasX,k.IfW,5z.outofthesubgroupofsizekfromwhichW,wasmeasured,F(t)-F(rn7i)z+(k-z)1-F(W2)areexpectedtobelessthanorequaltox.Ontheotherhand,ifWi>x,weknowk-z+1elementsfromXiarelargerthanx,and MAPVIAEM317areexpectedin(-x.x].TheE-StepiscompletedbysummingalloftheseexpectedcountsoutofthecompletesampleofnlcbasedonthemostrecentestimatorofFfronitheM-Step.Then,ifF(J)representsourestimateofFafterjiterationsoftheEMAlgorithm,itisupdatedas(16.3)Equation(16.3)essentiallyjoinsthetwostepsofthe13hfAlgorithmtogether.AllthatisneededisainitialestimateF(O)tostartitoff.TheobservedsampleEDFsuffices.Becausethefulllikelihoodisessentiallyamultinomialdistribu-tion.convergenceofF(J)isguaranteed.Ingeneral,thespeedofconvergenceisdependentupontheamountofinformation.Comparedtothemixturesapplication,thereisagreatamountofmissingdatahere.aridconvergenceisexpectedtoberelativelyslow.16.4MAPVIA€MTheEMalgorithmcanbereadilyadaptedtoBayesiancontexttomaximizetheposteriordistribution.AmaximumoftheposteriordistributionisthesocalledMAP(maximumaposteriori)estimator.usedwidelyinBayesianinfer-ence.ThebenefitofMAPestimatorsoversomeotherposteriorparameterswaspointedoutonp.53ofChapter4inthecontextofBayesianestima-tors.Themaximumoftheposterior~(yly).ifitexists.coincideswiththemaximumoftheproductofthelikelihoodandpriorf(yI$)~($).Intermsoflogarithms,findingtheMAPestimatoramountstomaximizinglog7r(wly)=logL(7J)+logn(7h).TheEMalgorithmcanbereadilyimplementedasfollows:E-Step.At(k+l)stiterationcalculateTheE-StepcoincideswiththetraditionalEMalgorithm,thatis,&($..lD(k))hastobecalculated.M-Step.Choose@('+I)tomaximizeQ(+.~('1)+logx(l;,).TheM-StepherediffersfromthatintheEM,becausetheobjectivefunctiontobemaximizedwitherespecttoq'scontainsadditionalterm.logarithmoftheprior.How- 318EMALGORITHMever.thepresenceofthisadditionaltermcontributestotheconcavityoftheobjectivefunctionthusimprovingthespeedofconvergence.Example16.2MAPSolutiontoFisher'sGenomicExample.AssumethatweelicitaBe(v1,v2)prioron$,Thebetadistributionisanaturalconjugateforthemissingdatadistribution,because~12NBin(y1,($/4)/(1/2+$/4)).Thusthelog-posterior(additiveconstantsignored)isTheE-stepiscompletedbyreplacingy12byitsconditionalexpectationy1x($(k)/4)/(1/2+7.,!1(')/4).ThisstepisthesameasinthestandardEMalgorithm.TheM-Step,at(k+1)stiteration,isWhenthebetapriorcoincideswithuniformdistribution(thatis,whenu1=v2=l),theMAPandMLEsolutionscoincide.16.5INFECTIONPATTERNESTIMATIONReillyandLawlor(1999)appliedtheEMAlgorithmtoidentifycontaminatedlotsinbloodsamples.Heretheobserveddatacontainthediseaseexposurehistoryofapersonoverkpointsintime.Forthezthindividual,letX,=l(zthpersoninfectedbyendoftrial),whereP,=P(X,=1)istheprobabilitythatthezthpersonwasinfectedatleastonceduringkexposurestothedisease.Theexposurehistoryisdefinedasavectory,={y21,...,yzk}%whereyt3=l(zthpersonexposedtodiseaseatjthtimepointk).LetA,betherateofinfectionattimepoint3.Theprobabilityofnotbeinginfectedintimepointjis1-yZ3X3.sowehaveP,=1-n(l-yZ3X3).ThecorrespondinglikelihoodforX={XI.....Xk}fromobservingn.patientsisa EXERClSES319bitdaunting:nL(A)=np;-t(1-pp2=1TheEMAlgorithmhelpsifweassigntheunobservableZ,,=l(personzinfectedattimepoiint17).whereP(Zz,=1)=A,ifyZ3=1andP(Z,,=1)=0ifyz3=O.AveragingoverytJ,P(Z,,=1)=ytJA,.Withz,,inthecompletelikelihood(15z5n.1535Ic).wehavetheobserveddatachangingto5,=max{ztl,...,z,k}.andnkL(AIZ)=l-Jl-J(Y2JA3)=*3(1-Yyz3~3)1-213.2=13=lwhichhasthesimplebinomialform.FortheE-Step,wefindIE(ZtJlx2,A(m)).whereA(")isthecurrentestimatefor(Al....,Ak)aftermiterationsofthealgorithm.Weneedonlyconcernourselveswiththecase2,=1,sothatIntheM-Step,MLEsfor(A1,...,A,)areupdatediniterationm+1fromA+"),...,Apto16.6EXERCISES16.1.Supposewehavedatageneratedfromamixtureoftwonormaldistri-butionswithacommonknownvariance.n7riteahlATLABscripttodeterminetheMLEoftheunknownmeansfromani.i.d.samplefromthemixturebyusingtheEhlalgorithm.TestyourprogramusingasampleoftenobservationsgeneratedfromanequalmixtureofthetwokernelsN(0,l)andN(1.1).16.2.ThedatainthefollowingtablecomefromthemixtureoftwoPoisson 320EMALGORITHMrandomvariables:P(A1)withprobabilityEand%'(A,)withprobability1-E.Value012345678910F'req.708947832635427246121511961(i)DevelopanEMalgorithmforestimatingE,XI,andA,.(ii)WriteMATLABprogramthatuses(i)inestimatingE,XI,andA2fordatafromthetable.16.3.Thefollowingdatagivethenumbersofoccupantsin1768carsobservedonaroadjunctioninJakarta,Indonesia,duringacertaintimeperiodonaweekdaymorning.Numberofoccupants1234567Numberofcars1897540223851751TheproposedmodelfornumberofoccupantsXistruncatedPoisson(TP),definedasP(X=i)=A2exp{-X)i=1,2;(I-exp{-A})i!'(i)Writedownthelikelihood(orthelog-likelihood)function.IsitstraightforwardtofindtheMLEofAbymaximizingthelikelihoodorlog-likelihooddirectly?(ii)DevelopanEMalgorithmforapproximatingtheMLEofA.Hznt:Assumethatmissingdataisio-thenumberofcaseswhenX=0.sowiththecompletedatathemodelisPoisson.?(A).EstimateXfromthecompletedata.UpdateiogiventheestimatorofA.(iii)WriteMATLABprogramthatwillestimatetheMLEofXforJakartacarsdatausingtheENprocedurefrom(ii).16.4.ConsidertheproblemofrightcensoringinlifetimemeasurementsinChapter10.SetuptheEMalgorithmforsolvingthenonparametricMLEforasampleofpossibly-rightcensoredvaluesXI:...,X,.16.5.WriteRIATLABprogramthatwillapproximatetheMAPestimatorinFisher'sproblem(Example16.2).iftheprioron$isBe(2,2).ComparetheMAPandMLEsolutions. REFERENCES321REFERENCESDempster,A.P.,Laird,N.M.,andRubin,D.B.(1977),“MaximumLikeli-hoodfromIncompleteDataviatheEMAlgorithm”(withdiscussion),JournaloftheRoyalStatisticalSociety,Ser.B.39.1-38.Fisher,R.A.andBalmukand,B.(1928).Theestimationoflinkagefromtheoffspringofselfedheterozygotes.JournalofGenetics,20,79-92.HealyM.J.R.,andWestmacotthl.H.(1956),“MissingValuesinExperimenhAnalysedonAutomat’icComputers,”AppliedStatistics,5,203-306.Kvam,P.H.,andSamaniego,F.J.(1994)“NonparametricMaximumLikeli-hoodEstimationBasedonRankedSetSamples,”JournaloftheAmeri-canStatisticalAssociation,89,526-537.McKendrick,A.G.(1926).”ApplicationsofMathematicstoMedicalProb-lems,”ProceedingsoftheEdinburghMathematicalSociety,44,98-130.TvIcLachlan,G.J.,andKrishnan,T.(1997),TheEMAlgorithmandExten-sions;NewYork:Wiley.Rao.C.R.(1973),LinearStatisticalInferenceanditsApplications,2nded.,NewYork:Wiley.Reilly,hl.,andLawlorE.(1999),‘LALikelihoodMethodofIdentifyingCon-taminatedLotsofBloodProduct,”InternationalJournalofEpidemiol-ogy,28,787-792.Slatkin,hl.,andExcoffier,L.(1996),”TestingforLinkageDisequilibriuminGenotypicDataUsingtheExpectation-MaximizationAlgorithm,”Heredity,76,377-383.Tsai,W.Y..andCrowley,J.(1985).ALargeSa,mpleStudyofGeneral-izedMaximumLikelihoodEstimatorsfrom1n.completeDataviaSelf-Consistency,”AnnalsofStatistics,13,1317-1334.Thisted,R.A.(1988),ElementsofStatisticalComputing:NumericalCom-putation,NewYork:Chapman&Hall.Wu,C.F.J.(1983).“OntheConvergencePropertiesoftheEMAlgorithm,”AnnalsofStatistics,11,95-103. ThisPageIntentionallyLeftBlank StatisticalLearningLearningisnotcompulsory...neitherissurvival.W.Edwards.Deming(1900-1993)Ageneraltypeofartificialintelligence.calledmachznelearnzng,referstotech-niquesthatsiftthroughdataandfindpatternsthatleadtooptimaldecisionrules,suchasclassificationrules.Inaway,thesetechniquesallowcomputersto“learn”fromthedata,adaptingastrendsinthedatabecomemoreclearlyunderstoodwiththecomputeralgorithms.Statisticallearningpertainstothedataanalysisinthistreatment,butthefieldofmachinelearninggoeswellbeyondstatisticsandintoalgorithmiccomplexityofcomputationalmethods.Inbusinessandfinance,machinelearningisusedtosearchthroughhugeamountsofdatatofindstructureandpattern,andthisiscalleddatamznzng.Inengineering,thesemethodsaredevelopedforpatternrecognatzon.atermforclassifyingimagesintopredeterminedgroupsbasedcnthestudyofstatisticalclassificationrulesthatstatisticiansrefertoasdzscrzmznantanalysts.Inelec-tricalengineering,specifically,thestudyofszgnalprocesszngusesstatisticallearningtechniquestoanalyzesignalsfromsounds,rs(17.1)2forsomevalueb,whichisafunctionofcost.ThisiscalledFzsher’sLanearDzscrzmznatzonFunctzon(LDF)becausewiththeequalvarianceassumption.theruleislinearinIC.TheLDFwasdevelopedusingnormaldistributions.butthislinearrulecanalsobederivedusingaminimalsquared-errorap-proach.Thisistrue.youcanrecall.forestimatingparametersinmultiplelinearregressionaswell.Ifthevariancesarenotthesame.theoptimizationprocedureisrepeatedwithextraMLEsforthecovariancematrices,andtheruleisquadraticintheinputsandhencecalledaQuadratzcDzscrzmznantFunctzon(QDF).Becausethelinearruleisoverlysimplisticforsomeexamples,quadraticclassificationrulesareusedtoextendthelinearrulebyincludingsquaredvaluesofthepredictors.Withkpredictorsinthemodel.thisbegets(k;l)additionalpa- LINEARCLA!5IFICATIONMODELS327rameterstoestimate.Somanyparametersinthemodelcancauseobviousproblems,eveninlargedatasets.Therehavebeenseveralstudiesthathavelookedintothequalityoflinearandquadraticclassifiers.Whiletheserulesworkwellifthenormalityassump-tionsarevalid,theperformancecanbeprettylousyiftheyarenot.TherearenumerousstudiesontheLDFandQDFrobustness.forexample,seeMoore(1973),MarksandDunn(1974),Randles,Bramberg,andHogg(1978).17.2.1LogisticRegressionasClassifierThesimplezero-onelossfunctionmakessenseinthlscategoricalclassificationproblem.Ifwereliedonthesquarederrorloss(andoutputslabeledwithzeroesandones),theestimateforgisnotnecessarillyin[0,1],andevenifthelargesamplepropertiesoftheprocedurearesatisfactory,itwillbehardtotakesuchresultsseriously.Oneofthesimplestmodelsintheregressionframeworkisthelogisticregressionmodel,whichservesasabridgebetweensimplelinearregressionandstatisticalclassification.Logisticregression,discussedinChapter12inthecontextofGeneralizedLinearhlodels(GLM),appliesthelinearmodeltobinaryresponsevariables.relyingonalznkfunctzonthatwillallowthelinearmodeltoadequatelydescribeprobabilitiesforbinaryoutcomes.Belowwewilluseasimpleillustrationofhowitcanbeusedasaclassifier.Foramorecomprehensiveinstructiononlogisticregressionandothermodelsforordinaldata.Agresti‘sbookCategorzcalDataAnalyszsservesasanexcellentbasis.Ifwestartwiththesimplestcasewherek=2groups.wecanarbitrarilyassigngz=0orgz=1forcategoriesGoandGI.Thismeanswearemodelingabinaryresponsefunctionbasedonthemeasurementsonz.IfwerestrictourattentiontoalinearmodelP(g=115)=z’p,wewillbesaddledwithanunrefinedmodelthatcanestimateprobabilitywithavalueoutside[0,1].Toavoidthisproblem,considertransformationsofthelinearmodelsuchas(i)logit:p(z)=P(g=11.)=exp(z’P)/[l+exp(s’p)],soz’pisestimatinglnb(z)/(l-p(z))]whichhasitsrangeonR.(ii)probit:P(g=11.)=G(z’3);whereGisthestandardnormalCDF.Inthiscasez’pisestimatingW1(p(z)).(iii)log-log:p(z)=1-exp(exp(z’3))sothatz’pisestimatingIn[-ln(1-P(Z))IonBecausethelogittransformationissymmetricandhasrelationtothenat-uralparameterintheGLMcontext.itisgenerallythedefaulttransformationinthisgroupofthree.Wefocusonthelogitlinkandseektomaximizethe 328STATlSTlCALLEARNlNGlikelihoodnL(P)=rIPz(Z)9"1-Pz(Z))l-gt,2=1intermsofp(z)=1-exp(exp(z'P))toestimate,6'andthereforeobtainMLEsforp(z)=P(g=llz).Thislikelihoodisratherwellbehavedandcanbemaximizedinastraightforwardmanner.WeusetheMATLABfunctionlogistictoperformalogisticregressionintheexamplebelow.Example17.1(Kutner,Nachtsheim,andNeter,1996)Astudyof25com-puterprogrammersaimstopredicttasksuccessbasedontheprogrammers'monthsofworkexperience.TheMATLABm-filelogistcomputessimpleordinallogisticregressions:>>x=[14296251841812226301130520139322413194282281;>>y=[O0011000101010100101001111;>>logist(y,x,l)Numberofiterations3Deviance25.4246ThetaSE3.05851.2590BetaSE0.16140.0650ans=0.1614HereP=(PO,PI)and/!?=(3.0585,0.1614).Theestimatedlogisticregressionfunctionise-3.0585+0.1615z=1+e-3.0585+0.1615z'Forexample,inthecasex1=14.wehave$1=0.31;i.e.,weestimatethatthereisa31%chanceaprogrammerwith14monthsexperiencewillsuccessfullycompletetheproject.Inthelogisticregressionmodel,ifweusej5asacriterionforclassifyingobservations'theregressionservesasasimplelinearclassificationmodel.Ifmisclassificationpenaltiesarethesameforeachcategory,9=1/2willbetheclassifierboundary.Forasymmetricloss.therelativecostsofthemisclassifi-cationerrorswilldetermineanoptimalthreshold.Example17.2(Fisher'sIrisData)Toillustratethistechnique,weuseFisher'sIrisdata.whichiscommonlyusedtoshowoffclassificationmethods.Theirisdatasetcontainsphysicalmeasurementsof150flowers-50foreachofthree NEARESTNNGHBORCLASSlFlCAJlON329typesofiris(Virginica.VersicolorandSetosa).Irisflowershavethreepetalsandthreeouterpetal-likesepals.Figure(17.2.la)showsaplotofpetallengthvswidthforVersicolor(circles)andVirginica(plussigns)alongwiththelinethatbestlinearlycategorizesthem.Howisthislinedetermined?Fromthelogisticfunctionx’p=ln(p/(l-p)),p=1/2representsanobservationthatishalf-waybetweentheVirginicairisandtheVersicoloriris.Observationswithvaluesofp<0.5areclassifiedtobeVersicolorwhilethosewithp>0.5areclassifiedasVirginica.Atp=1/2,x’p=ln(p/(l-p))=0,andthelineisdefinedbyDo+P1.1+p2x2=0,whichinthiscaseequatestox2=(42.2723-5.7545x1)/10.4467.ThislineisdrawninFigure(17.2.la).>>loadiris>>x=[PetalLength,PetalWidthl;>>plot(PetalLength(51:1001,PetalWidth(51:loo),’0’)>>holdon>>x2=CPetalLength(51:150),PetalWidth(51:lbO)];>>fplot(’(45.27-5.7*x)/lo.4’,C3,71)>>v2=Variety(51:150);>>L2=logist(v2,x2,1);Numberofiterations8Deviance20,5635ThetaSE45.272313.6117BetaSE5.75452.305910.44673.7556Whilethisexampleprovidesaspiffyillustrationoflinearclassification.mostpopulationsarenotsoeasilydifferentiated,andalinearrulecanseemoverlysimplifiedandcrude.Figure(17.2.lb)showsasimilarplotofsepalwidthvs.length.Theiristypesarenotsoeasilydistinguished,andthelinearclassificationdoesnothelpusinthisexample.Inthenextpartsofthischapter,wewilllookat‘‘nonparametric”classify-ingmethodsthatcanbeusedtoconstructamoreflexible,nonlinearclassifier.17.3NEARESTNElGHBORCLASS1FlCAT10NRecallfromChapter13.nearestneighbormethodscanbeusedtocreatenonparametricregressionsbydeterminingtheregressioncurveatxbasedonexplanatoryvariables2,thatareconsideredclosesttox.WewillcallthisaIc-nearestneighborclassifierifitconsidersthekclosestpointsto5(usingamajorityvote)whenconstructingtheruleatthatpoint.Ifweallowktoincrease,theestimatoreventuallyusesallofthedatato 330STATISTKALLEARNING258,I35I+++++*+++c+++-**-1,.+2t++o,mi-rc+1-em;+c00mtiI+om-+++tGo61*++++++++*00oO;+o0+0+0+i++oc++C33il^r".e:c.'I,3354455556657?i5555657756Fig.17.1Twotypesofirisclassifiedaccordingto(a)petallengthvs.petalwidth,and(b)sepallengthvs.sepalwidth.Versicolor=0,Virginica=+.fiteachlocalresponse,sotheruleisaglobalone.Thisleadstoasimplermodelwithlowvariance.Butiftheassumptionsofthesimplemodelarewrong,highbiaswillcausetheexpectedmeansquarederrortoexplode.Ontheotherhand,ifweletkgodowntoone,theclassifierwillcreateminuteneighborhoodsaroundeachobserved2,.revealingnothingfromthedatathataplotofthedatahasnotalreadyshownus.Thisishighlysuspectaswell.Thebestmodelislikelytobesomewhereinbetweenthesetwoextremes.Asweallowktoincrease,wewillreceivemoresmoothnessintheclassificationboundaryandmorestabilityintheestimator.Withsmallk,wewillhaveamorejaggedclassificationrule,buttherulewillbeabletoidentifymoreinterestingnuancesofthedata.Ifweusealossfunctiontojudgewhichisbest,the1-nearestneighbormodelwillfitbest,becausethereisnopenaltyforover-fitting.Onceweidentifyeachestimatedcategory(conditionalonX)astheobservedcategoryinthedata,therewillbenoerrortoreport.Inthiscase,itwillhelptosplitthedataintoatrainingsampleandatestsample.Evenwiththelossfunction.theideaoflocalfittingworkswellwithlargesamples.Infact.astheinputsamplesizengetslarger,theIc-nearestneighborestimatorwillbeconsistentaslongask/n+0.Thatis,itwillachievethegoalswewantedwithoutthestrongmodelassumptionsthatcomewithparametricclassification.Thereisanextraproblemusingthenonparametrictechnique,however.IfthedimensionofXissomewhatlarge.theamountofdataneededtoachieveasatisfactoryanswerfromthenearestneighborgrowsexponentially. NEARSJNEIGHBORCLASSIFICAJlON33117.3.1TheCurseofDimensionalityThecurseofdzmenszonalaty,termedbyBellman(1961).describestheprop-ertyofdatatobecomesparseifthedimensionofthesamplespaceincreases.Forexample,imaginethedensenessofadatasetwith100observationsdis-tributeduniformlyontheunitsquare.Toachievethesamedensenessina10-dimensionalunithypercube,wewouldrequirelo2'observations.Thisisasignificantproblemfornonparametricclassificationproblemsincludingnearestneighborclassifiersandneuralnetworks.Asthedimen-sionofinputsincrease,theobservationsinthetrainingsetbecomerelativelysparse.Theseproceduresbasedonalargenumberofparametershelptohan-dlecomplexproblems,butmustbeconsideredinappropriateformostsmallormediumsizeddatasets.Inthosecases,thelinearmethodsmayseemoverlysimplisticorevencrude,butstillpreferabletonearestneighbormethods.17.3.2ConstructingtheNearestNeighborClassifierTheclassificationruleisbasedontheratioofthenearest-neighbordensityestimator.Thatis.ifJ:isfrompopulationG,thenP(z1G)E(proportionofobservationsintheneighborhoodaroundz)/(voluineoftheneighborhood).Toclassify5,selectthepopulationcorrespondingtothelargestvalueofThissimplifiestothenearestneighborrule;iftheneighborhoodaroundzisdefinedtobetheclosestTobservations.zisclassifiedintothepopulationthatismostfrequentlyrepresentedinthatneighborhood.Figure(17.4)showstheoutputderivedfromtheMATLABexamplebelow.Fiftyrandomlygeneratedpointsareclassifiedintooneoftwogroupsinvinapartiallyrandomway.Thenearestneighborplotsreflectthreedifferentsmoothingconditionsofk=ll.5and1.Askgets,smaller,theclassifieractsmorelocally,andtheruleappearsmorejagged.>>y=rand(50,2)>>v=round(0.3*rand(50,1)+0.3*y(:,1)+0.4*y(:,2));>>n=lOO;>>x=nby2(n);>>m=n-2;>>fori=l:mw(i,1)=nearneighbor(x(i,1:2),y,4,v);end>>rr=find(w==l);>>x2=x(rr,:);>>plot(x2(:,1),x2(:,2),J.') 332STATlSTlCALLfARNlNGFig.17.2Nearestneighborclassificationof50observationsplottedin(a)usingneigh-borhoodsizesof(b)11,(c)5,(d)1. NEURALNETWORKS33317.4NEURALNETWORKSDespitewhatyourdetractorssay,youhavearemarkablebrain.Evenwiththeincreasingspeedofcomputerprocessing.themuchslowerhumanbrainhassurprisingabilitytosortthroughgobsofinformation.disseminatesomeofitspeculiaritiesandmakeacorrectclassificationoftenseveraltimesfasterthanacomputer.Whenafamiliarfaceappearstoyouaroundastreetcorner.yourbrainhasseveralprocessesworkinginparalleltoidentifythispersonyousee,usingpastexperiencetogaugeyourexpectation(youmightnotbelieveyoureyes.forexample,ifyousawElvisappeararoundthecorner)alongwithallthesensorydatafromwhatyousee.hear.orevensmell.Thecomputerisatadisadvantageinthiscontestbecausedespiteallofthespeedandmemoryavailable,thestaticprocessesitusescannotparsethroughthesameamountofinformationinanefficientmanner.Itcannotadaptandlearnasthehumanbraindoes.Instead,thedigitalprocessorgoesthroughsequentialalgorithms,almostallofthembeingawasteofCPUtime,ratherthantraversingarelativelyfewcomplexneuralpathwayssetupbyourpastexperiences.Rosenblatt(1962)developedasimplelearningalgorithmhenamedtheperceptron,whichconsistsofaninputlayerofseveralnodesthatiscompletelyconnectedtonodesofanoutputlayer.Theperceptronisoverlysimplisticandhasnumerousshortcomings,butitalsorepresentsthefirstneuralnetwork.Byextendingthistoatwo-stepnetworkwhichincludesahzddenlayerofnodesbetweentheinputsandoutputs.thenetworkovercomesmostofthedisadvantagesofthesimplermap.Figure(17.4)showsasimplefeed-forwardneuralnetwork,thatis,theinformationtravelsinthledirectionfrominputtooutput.Fig.17.3Basicstructureoffeed-forwardneuralnetwork.ThesquarenodesinFigure(17.4)representneurons.andtheconnections 334STATISTICALLEARNING(oredges)betweenthemrepresentthesynapsesofthebrain.Eachconnectionisweighted,andthisweightcanbeinterpretedastherelativestrengthintheconnectionbetweenthenodes.Eventhoughthefigureshowsthreelayers,thisisconsideredatwo-layernetworkbecausetheinputlayer,whichdoesnotprocessdataorperformcalculations,isnotcounted.Eachnodeinthehiddenlayersischaracterizedbyanactivationfunctionwhichcanbeassimpleasanindicatorfunction(thebinaryoutputissimilartoacomputer)orhavemorecomplexnonlinearforms.Asimpleactivationfunctionwouldrepresentanodethatwouldreactwhentheweightedinputsurpassedsomefixedthreshold.Theneuralnetworkessentiallylooksatrepeatedexamples(orinputob-servations)andrecallspatternsappearingintheinputsalongwitheachsub-sequentresponse.Wewanttotrainthenetworktofindthisrelationshipbetweeninputsandoutputsusingsupervisedlearning.Akeyintrainingthenetworkistofindtheweightstogoalongwiththeactivationfunctionsthatleadtosupervisedlearning.Todetermineweights,weuseaback-propagationalgorithm.17.4.1Back-propagationBeforetheneuralnetworkexperiencesanyinputdata,theweightsforthenodesareessentiallyrandom(noninformative).Soatthispoint,thenetworkfunctionslikethescatteredbrainofacollegefreshmanwhohascelebratedhisfirstweekendoncampusbydrinkingwaytoomuchbeer.Thefeed-forwardneuralnetworkisrepresentedbynI*nH*n0inputnodeshiddennodesoutputnodes'WithaninputvectorX=(21,....znr),eachofthenIinputnodecodesthedataand"fires"asignalacrosstheedgestothehiddennodes.AteachofthenHhiddennodes.thismessagetakestheformofaweightedlinearcombinationfromeachattribute,XFt,=A(ao,+~1~~1+...+LY,,~Z,,),j=1....,nH(17.2)whereAistheactivationfunctionwhichisusuallychosentobetheszgmozdfunction1A(z)=-.1+ecXWewilldiscusswhyAischosentobeasigmoidlater.Inthenextstep,thenHhiddennodesfirethisnonlinearoutcomeoftheactivationfunctiontothe NEURALNETWORKS335outputnodes,eachtranslatingthesignalsasalinearcombinationEachoutputnodeisafunctionoftheinputs,andthroughthestepsoftheneuralnetwork,eachnodeisalsoafunctionoftheweightsQandp.IfweobserveXl=(~11....,xnfl)withoutputgl(k)fork=1....,no,weusethesamekindoftransformationusedinlogisticregression:Forthetrainingdata{(Xl,g~),...,(Xn,gn)},theclassificationiscom-paredtotheobservation'sknowngroup,whichisthenback-propagatedacrossthenetwork,andthenetworkresponds(learns)byadjustingweightsinthecasesanerrorinclassificationoccurs.Thelossfunctionassociatedwithmis-classificationcanbesquarederrors.suchasSSQ(a,P)=Cr=ICL21(gi(k)-g~(k))~,(17.4)wheregl(k)istheactualresponseoftheinputXIforoutputnodekandgl(k)istheestimatedresponse.Nowwelookhowthoseweightsarechangedinthisback-propagation.TominimizethesquarederrorSSQin(17.4)withrespecttoweightsQand/?frombothlayersoftheneuralnet,wecantakepartialderivatives(withrespecttoweight)tofindthedirectiontheweightsshouldgoinordertodecreasetheerror.Buttherearealotofparameterstoestimate:at3.with15i5n1.16jLnHandP3k,15j5nH,15k5no.It'snothelpfultothinkofthisasaparameterset,asiftheyhavetheirownintrinsicvalue.Ifyoudo,thenetworklooksterriblyover-parameterizedandunnecessarilycomplicated.Rememberthataandpareartificial,andourfocusisonthenpredictedoutcomesinsteadofestimatedparameters.Wewilldothisiterativelyusingbatchlearnzngbyupdatingthenetworkaftertheentiredatasetisentered.Actually,findingtheglobalminimumofSSQwithrespecttoQandpwillleadtoover-fittingthemodel,thatis,theanswerwid1notrepresentthetrueunderlyingprocessbecauseitisblindlymimickingeveryidiosyncrasyofthedata.Thegradientisexpressedherewithaconstantycalledthelearnzngrate:(17.5)(17.6)andissolvediterativelywiththefollowingback-propagationequations(see 336STATISTICALLEARNINGChapter11ofHastieetal.(2001))viaerrorvariablesaandb:(17.7)Obviously.theactivationfunctionAmustbedifferentiable.NotethatifA(z)ischosenasabinaryfunctionsuchasI(z20).weendupwitharegularlinearmodelfrom(17.2).Thesigmoidfunction.whenscaledasA,(z)=A(cz)willlooklikeI(.20)asc+co,butthefunctionalsohasawell-behavedderivative.Inthefirststep,weusecurrentvaluesofQandptopredictoutputsfrom(17.2)and(17.3).Inthenextstepwecomputeerrorsbfromtheoutputlayer.anduse(17.7)tocomputeafromthehiddenlayer.Insteadofbatchprocess-ing,updatestothegradientcanbemadesequentiallyaftereachobservation.Inthiscase,yisnotconstant.andshoulddecreasetozeroastheiterationsarerepeated(thisiswhyitiscalledthelearningrate).Thehiddenlayerofthenetwork.alongwiththenonlinearactivationfunc-tion,givesittheflexibilitytolearnbycreatingconvexregionsforclassificationthatneednotbelinearlyseparablelikethemoresimplelinearrulesrequire.Onecanintroduceanotherhiddenlayerthatineffectcanallownonconvexregions(bycombiningconvexregionstogether).Applicationsexistwithevenmorehiddenlayers,buttwohiddenlayersshouldbeampleforalmosteverynonlinearclassificationproblemthatfitsintotheneuralnetworkframework.17.4.2ImplementingtheNeuralNetworkImplementingthestepsaboveintoacomputeralgorithmisnotsimple,norisitfreefrompotentialerrors.Onepopularmethodforprocessingthroughtheback-propagationalgorithmusessixsteps:1.Assignrandomvaluestotheweights.2.Inputthefirstpatterntogetoutputstothehiddenlayer(Rl,....RnH)andoutputlayer(g(1)....>tj(k)).3.Computetheoutputerrorsb.4.Computethehiddenlayererrorsaasafunctionofb.5.Updatetheweightsusing(17.5)6.RepeatthestepsforthenextobservationComputinganeuralnetworkfromscratchwouldbechallengingformanyofus.evenifwehaveagoodprogrammingbackground.InMATLAB. NEURALNETWORKS337thereareafewmodestprogramsthatcanbeusedforclassification,suchassoftmax(X,K,Prior)thatusesimplementsafeed-forwardneuralnetworkusingatrainingsetX,avectorKforclassindexing,withanoptionalpriorargument.InsteadofminimizingSSQin(17.4).softmaxassumesthat“theoutputsareaPoissonprocessconditionalontheirsumandcalculatestheerrorastheresidualdeviance.’‘MATLABalsohasaNeuralNetworksToolbox,seehttp://www.mathworks.comwhichfeaturesagraphicaluserinterface(GUI)forcreating,training,andrunningneuralnetworks.17.4.3ProjectionPursuitThetechniqueofProjectionPursuitissimilartothatofneuralnetworks.asbothemployanonlinearfunctionthatisappliedonlytolinearcombinationsoftheinput.Whiletheneuralnetworkisrelativelyfixedwithasetnumberofhiddenlayernodes(andhenceafixednumberofparameters),projectionpursuitseemsmorenonparametricbecauseitusesunspecifiedfunctionsinitstransformations.WewillstartwithabasicmodeldX)=TLl+(em>(17.8)wherenprepresentsthenumberofunknownparametervectors(01....,On*).Notethat0:XistheprojectionofXontothevectorB,.Ifwepursueavalueof0%thatmakesthisprojectioneffective,itseemslogicalenoughtocallthisprojectionpursuit.TheideaofusingalinearcombinationofinputstouncoverstructureinthedatawasfirstsuggestedbyKruskal(1969).FriedmanandStuetzle(1981)derivedamoreformalprojectionpursuitregressionusingamulti-stepalgorithm:1.Define7:’)=gz.2.Maximizethestandardizedsquarederrorsoverweightsw(3)(undertheconstraintthat$(3)’1=1)andg(3-l).3.Update7withT,(~)=T:’-’)-g(J-’)(ti~(J)’z~).4.RepeatthefirststepktimesuntilSSQ(k)_<6forsomefixedS>0. 338STATISTICALLEARNINGOncethealgorithmfinishes,itessentiallyhasgivenuptryingtofindotherprojections,andwecompletetheprojectionpursuitestimatoras(17.10)17.5BINARYCLASSIFICATIONTREESBinarytreesofferagraphicalandlogicalbasisforempiricalclassification.De-cisionsaremadesequentiallythrougharouteofbranchesonatree-everytimeachoiceismade,therouteissplitintotwodirections.Observationsthatarecollectedatthesameendpoint(node)areclassifiedintothesamepopulation.Atthosejuncturesontheroutewherethesplitismadearenon-termznalnodes,andterminalnodesdenoteallthedifferentendpointswhereaclassificationofthetree.Theseendpointsarealsocalledtheleavesofthetree.andthestartingnodeiscalledtheroot.Withthetrainingset(51,gl),...,(xn,gn),wherezisavectorofmcompo-nents,splitsarebasedonasinglevariableofz,possiblyalinearcombination.Thisleadstodecisionrulesthatarefairlyeasytointerpretandexplain,sobinarytreesarepopularfordisseminatinginformationtoabroadaudience.Thephasesofoftreeconstructioninclude0Decidingwhethertomakethenodeaterminalnode.0Select.ionofsplitsinanonterminalnode0Assigningclassificationruleatterminalnodes.ThisistheessentialapproachofCART(ClasszficatzonandRegresszonTrees).Thegoalistoproduceasimpleandeffectiveclassificationtreewithoutanexcessnumberofnodes.IfwehavekpopulationsGI,....Gk.wewillusethefrequenciesfoundinthetrainingdatatoestimatepopulationfrequencyinthesamewaywecon-structednearest-neighborclassificationrules:theproportionofobservationsintrainingsetfromtheithpopulation=P(G,)=n,/n.Supposetherearen,(r)observationsfromG,thatreachnoder.TheprobabilityofsuchanobservationreachingnodeTisestimatedasWewanttoconstructaperfectlypuresplitwherewecanisolateoneorsomeofthepopulationsintoasinglenodethatcanbeaterminalnode(oratleastsplitmoreeasilyintooneduringalatersplit).Figure17.4illustratesa BINARYCLASSIFICATIONTREES339Fig.17.4Purifyingatreebysplitting.perfectsplitofnodeT.This,ofcourse:isnotalwayspossible.Thisqualitymeasureofasplitisdefinedinanimpurityindexfu:nctionwhereyisnonnegative.symmetricinitsarguments,maximizedat(l/k....,l/k)%andminimizedatanyk-vectorthathasaoneandk-1zeroes.Severaldifferentmethodsofimpurityhavebeendefinedforconstructingtrees.Thethreemostpopularimpuritymeasuresarecross-entropy.Giniimpurityandmisclassificationimpurity:1.Cross-entropy:Z(T)=-C,p,(T)>oP,(r)ln[P,(r)].2.Gini:Z(T)=-C,+,Pz(~)P,(~).3.Misclassification:Z(T)=1-max,P,(T)Themisclassificationimpurityrepresentstheminimumprobabilitythatthetrainingsetobservationswouldbe(empirical1y:lmisclassifiedatnodeT.TheGinimeasureandCross-entropymeasurehaveananalyticaladvantageoverthediscreteimpuritymeasurebybeingdifferentiable.Wewillfocusonthemostpopularindexofthethree.whichisthecross-entropyimpurity.Bysplittinganode,wewillreducetheimpuritytowhereq(R)istheproportionofobservationsthatgotonodeTR?andq(L)istheproportionofobservationsthatgotonodeTL.Constructedthisway,thebinarytreeisarecursweclassifier.LetQbeapotentialsplitfortheinputvectorx.Ifx=(21,....2,).Q={zcz>20)wouldbeavalidsplitif2,isordinal.orQ=(2,ES}ifXiiscategoricalandSisasubsetofpossiblecategoricaloutcomesforz,.Ineithercase,thesplitcreatestwoadditionalnodesforthebinaryresponseofthedatatoQ.Forthefirstsplit.wefindthesplitL)1thatwillminimizetheimpuritymeasurethemost.ThesecondsplitwillbechosentobetheQzthatminimizestheimpurityfromoneofthetwonodescreatedby&I. 340STATlSTlCALLEARNlNGSupposewearethemiddleofconstructingabinaryclassificationtreeTthathasasetofterminalnodesR.WithP(reachnoder)=P(r)=CPi(r),supposethecurrentimpurityfunctionisAtthenextstage,then,wesplitthenodethatwillmostgreatlydecreaseI,.Example17.3Thefollowingmade-upexamplewasusedinElsner,Lehmiller,andKimberlain(1996)toillustrateacaseforwhichlinearclassificationmodelsfailandbinaryclassificationtreesperformwell.Hurricanescategorizedac-cordingtoseasonas“tropicalonly”or“baroclinicallyinfluenced“.Hurricanesareclassifiedaccordingtolocation(longitude,latitude),andFigure(17.5(a))showsthatnolinearrulecanseparatethetwocategorieswithoutagreatamountofmisclassification.Theaveragelatitudeoforiginfortropical-onlyhurricanesis18.8’N,comparedto29.1°Nforbaroclinicallyinfluencedstorms.Thebaroclinicallyinfluencedhurricaneseasonextendsfrom-midMaytoDe-cember,whilethetropical-onlyseasonislargelyconfinedtothemonthsofAugustthroughOctober.Forthisproblem,simplesplitsareconsideredandtheonesthatminimizeimpurityareQ1:Longitude267.75,andQz:Longitude562.5(seehome-work).Inthiscase,thetreeperfectlyseparatesthetwotypesofstormswithtwosplitsandthreeterminalnodesinFigure17.5(b).21I,IIIIIIIB=19+I00I*T=18I00,+II17+-I000,+IT=18II’00I161I+;c0I+I00,+tII13-10I?III12’586062M€6667072(a)(b)Fg175(a)Locationof37tropical(circles)andother(plus-signs)hurricanesfromElsneratal.(1996).(b)Correspondingseparatingtree. €?/NARYCL.ASS/F/CAT/ONTREES341>>long=[59.0059.5060.0060.5061.0061.0061.5061.5062.0063.00...63.5064.0064.5065.0065.0065.0065.5066.5065.5066.0066.00...66.0066.5066.5066.5067.0067.5068.00613.5069.0069.0069.50...69.5070.0070.5071.0071.501;>>lat=c17.0021.0012.0016.0013.0015.001.7.0019.0014.0015.00...19.0012.0016.0012.0015.0017.0016.0019.0021.0013.0014.00...17.0017.0018.0021.0014.0018.0014.0018.0013.0015.0017.00...19.0012.0016.0017.0021.001;>>trop=[00000000011111111111111111100...000000001;>>plot(long(find(1ong’.*trop’>0.5))’,lat(find(1ong’.*trop’>O.5))’,’0’)>>holdon>>p1ot(1ong(find(1ong’.*tropJ<0.5))’,1at(find(1ong’.*trop’<0.5))’,’+’)17.5.1GrowingtheTreeSofarwehavenotdecidedhowmanysplitswillbeusedinthefinaltree;wehaveonlydeterminedwhichsplitsshouldtakeplacefirst.Inconstructingabinaryclassificationtree,itisstandardtogrowatreethatisinitiallytoolarge,andtothenpruneitback,formingasequenceofsub-trees.Thisapproachworkswell;ifoneofthesplitsmadeinthetreeappearstohavenovalue.itmightbeworthsavingifthereexistsbelowitaneffectivesplit.Inthiscasewedefineabranchtobeasplitdirectionthatbeginsatanodeandincludesallthesubsequentnodesinthedirectionofthatsplit(calledasubtreeordescendants).Forexample.supposeweconsidersplittingtreeTatnoderandT,representstheclassificationtreeafterthesplitismade.ThenewnodesmadeunderrwillbedenotedTRandrL.TheimpurityisnowThechangeinimpuritycausedbythesplitisAgain,letRbethesetofallterminalnodesofthetree.IfweconsiderthepotentialdifferencesforanyparticularsplitQ.sayAzTp(r;Q),thenthenextsplitshouldbechosenbyfindingtheterminalnoderandsplitQcorrespondingto.Topreventthetreefromsplittingtoomuch,wewillhaveafixedthresholdlevelT>0sothatsplittingmuststoponcethechangenolongerexceedsT. 342STATISTICALLEARNlNGWeclassifyeachterminalnodeaccordingtomajorityvote:observationsinterminalnoderareclassifiedintothepopulationiwiththehighestni(r).Withthissimplerule,themisclassificationrateforobservationsarrivingatnoderisestimatedas1-Pz(r).17.5.2PruningtheTreeWithatreeconstructedusingonlyathresholdvaluetopreventovergrowth.alargesetoftrainingdatamayyieldatreewithanabundanceofbranchesandterminalnodes.If7issmallenough,thetreewillfitthedatalocally,similartohowa1-nearest-neighboroverfitsamodel.IfTistoolarge,thetreewillstopgrowingprematurely,andwemightfailtofindsomeinterestingfeaturesofthedata.Thebestmethodistogrowthetreeabittoomuchandthenprunebackunnecessarybranches.Tomakethisefficable,theremustbeapenaltyfunction>numobs=size(meas,1);>>tree=treefit(meas(:,I:2),species);>>[dtnum,dtnode,dtclassl=treeval(tree,meas(:,l:2));>>bad=“strcmp(dtclass,species);>>sum(bad)/numobsans=0.1333%Thedecisiontreemisclassifies13.3%or20ofthespecimens.>>[grpnum,node,grpname]=treeval(tree,[xyl);>>gscatter(x,y,grpnme,’grb’,’sod’)>>treedisp(tree,’name’,C’SL’’SW’))>>resubcost=treetest(tree,’resub’);>>[cost,secost,ntermnodes,bestlevel]=...>>treetest(tree,’cross’,meas(:,I:2),species);>>plot(ntermnodes,cost,’b-’,ntermnodes,resubcost,’r--’)>>xlabel(’Numberofterminalnodes’)>>ylabel(’Cost(misc1assificationerror)’)>>legend(’Cross-validation’,’Resubstitution’)versicavirgimicolorFig.176MATLABfunctiontreedispappliedtoFisher’sIrisData. €WARYCLASSIFICAJDVTREES34517.5.3GeneralTreeClassifiersClassificationandregressiontreescanbeconvenient1.ydividedtofivedifferentfamilies.(i)TheCARTfamily:SimpleversionsofCARThavebeenemphasizedinthischapter.Thismethodischaracterizedbyitsuseoftwobranchesfromeachnonterminalnode.Cross-validationandpruningareusedtodeterminesizeoftree.Responsevariablecanbequantitativeornominal.Predictorvariablescanbenominalorordinal.andcontinuouspredictorsaresupported.Motzvatzon:statisticalprediction.(ii)TheCLSfamily:TheseincludeID3,originallydevelopedbyQuinlan(1979).andoff-shootssuchasCLSandC4.5.Forthismethod,thenum-berofbranchesequalsthenumberofcategoriesofthepredictor.Onlynominalresponseandpredictorvariablesaresupportedinearlyversions,socontinuousinputshadtobebinned.However,thelatestversionofC4.5supportsordinalpredictors.Motmation:conceptlearning.(iii)TheAIDfamily:MethodsincludeAID,THAID.CHAID.MAID,XAID.FIRM,andTREEDISC.Thenumberofbranchesvariesfromtwotothenumberofcategoriesofthepredictor.Statisticalsignificancetests(withmultiplicityadjustmentsinthelaterversions)areusedtodeterminethesizeoftree.AID.MAID,andXAIDareforquantitativeresponses.THAID.CHAID.andTREEDISCarefornominalresponses,althoughtheversionofCHAIDfromStatisticalInnovations,distributedbySPSS.canhandleaquantitativecategoricalresponse.FIRMcomesintwovari-etiesforcategoricalorcontinuousresponse.Predictorscanbenominalorordinalandthereisusuallyprovisionforamissing-valuecategory.Someversionscanhandlecontinuouspredictors,otherscannot.Motz-vatzon:detectingcomplexstatisticalrelationshilos.(iv)Linearcombinations:MethodsincludeOC1andSE-Trees.TheNum-berofbranchesvariesfromtwotothenumberofcategoriesofpredictor.Motzvation:Detectinglinearstatisticalrelationshipscombinedtocon-ceptlearning.(v)Hybridmodels:INDisoneexample.INDcombinesCARTandC4aswellasBayesianandminimumencodingmethods.KnowledgeSeekercombinesmethodsfromCHAIDandID3withanovelmultiplicityad-justment.Motiwation:Combinesmethodsfromotherfamiliestofindop-timalalgorithm. 346STATISTICALLEARNING17.6EXERCISES17.1.Createasimplenearest-neighborprogramusingMATLAB.Itshouldinputatrainingsetofdatainm+lcolumns;onecolumnshouldcontainthepopulationidentifier1,...,kandtheotherscontaintheinputvectorsthatcanhavelengthm.Alongwiththistrainingset,alsoinputanothermcolumnmatrixrepresentingtheclassificationset.Theoutputshouldcontainn,m,kandtheclassificationsfortheinputset.17.2.FortheExample17.3,showtheoptimalsplits,usingthecross-entropymeasure,intermsofintervals{longitude2lo}and{latitude211)17.3.Inthisexercisethegoalistodiscriminatebetweenobservationscomingfromtwodifferentnormalpopulations,usinglogisticregression.Simulateatrainingdataset,{(Xt,Y,).i=1,....n},(takeneven)asfollows:Forthefirsthalfofdata,X,,i=1,...,n/2aresampledfromthestandardnormaldistributionandY,=0,i=1,...,n/2.Forthesecondhalf,X,,i=n/2+1,...,naresampledfromnormaldistributionwithmean2andvariance1,whileY,=1,a=n/2+1,....n.Fitthelogisticregressiontothisdata,P(Y=1)=f(X).Simulateavalidationset{(X;,y),j=1,...,m}thesameway,andclassifythesenew7’sas0or1dependingwhetherf(X,*)<0.5or20.5.(a)Calculatetheerrorofthislogisticregressionclassifier,Inyoursimulationsusen=60,200,and2000andm=100.(b)CantheerrorL,(rn)bemadearbitrarilysmallbyincreasingn?REFERENCESAgresti,A.(1990),CategortcalDataAnalysts,NewYork:Wiley.Bellman.R.E.(1961),AdaptzveControlProcesses,Princeton.NJ:PrincetonUniversityPress.Breiman,L.,Friedman.J.,Olshen,R.,andStone,C.(1984),ClasszficatzonandRegresszonTrees,Belmont,CA:Wadsworth.Duda,R.O.,Hart,P.E.andStork.D.G.(2001),PatternClassaficatzon.NewYork:Wiley. REFERENCES347Fisher,R.A.(1936),“TheUseofMultipleMeasurem.entsinTaxonomicProb-lems,”AnnalsofEugenics,7,179-188.Elsner,3.B.,Lehmiller,G.S.,andKimberlain,T.B.(1996);“ObjectiveClassificationofAtlanticBasinHurricanes,”JournalofClimate,9,2880-2889.Friedman,J.,andStuetzle,W.(1981),”ProjectionPursuitRegression,”Jour-naloftheAmericanStatisticalAssociation,76,817-823.Hastie,T.,Tibshirani,R.,andFriedman,J.(2001),TheElementsofStatis-ticalLearning,NewYork:SpringerVerlag.Kutner,M.A.,Nachtsheim,C.J.,andNeter,J.(1996),AppliedLinearRe-gressionModels,4thed.,Chicago:Irwin.KruskalJ.(1969),“TowardaPracticalMethodw’hichHelpsUncovertheStructureofaSetofMultivariateObservationsbyFindingtheLinearTyansformationwhichOptimizesaNewIndexofCondensation,“Statis-ticalComputation,NewYork:AcademicPress,pp.427-440.Marks,S.,andDunn,0.(1974),“DiscriminantFunctionswhenCovarianceMatricesareUnequal,”JournaloftheAmericanStatisticalAssociation,69,555-559.Moore,D.H.(1973).“EvaluationofFiveDiscriminationProceduresforBi-naryVariables,“JournaloftheAmericanStatisticalAssociation,68,399-404.Quinlan,J.R.(1979),“DiscoveringRulesfromLargeCollectionsofExamples:ACaseStudy.”inExpertSystemsintheMicroelectronicsAge,Ed.D.Michie,Edinburgh:EdinburghUniversityPress.Randles,R.H.,Broffitt,J.D.,Ramberg,J.S.,andHogg,R.V.(1978),”Gen-eralizedLinearandQuadraticDiscriminantFunctionsUsingRobustEs-timates,”JournaloftheAmericanStatisticalAssociation,73,564-568.Rosenblatt,R.(1962),PrinciplesofNeurodynamics:PerceptronsandtheThe-oryofBrainMechanisms,Washington,DC:Spartan. ThisPageIntentionallyLeftBlank 18NonparametricBayesBayesian(bey'-zhuhn)n.1.ResultofbreedingELstatisticianwithaclergymantoproducethemuchsoughthoneststatistician.AnonymousThischapterisaboutnonparametricBayesianinference.Understandingthecomputationalmachineryneededfornon-conjugateBayesiananalysisinthischaptercanbequitechallenginganditisbeyondthescopeofthistext.Instead,wewillusespecializedsoftware.WinBUGS,toimplementcomplexBayesianmodelsinauser-friendlymanner.SomeapplicationsofWinBUGShavebeendiscussedinChapter4andanoverviewof'WinBUGSisgivenintheAppendixB.OurpurposeistoexploretheusefulapplicationsofthenonparametricsideofBayesianinference.Atfirstglance.thetermnonparametrzcBayesmightseemlikeanoxymoron;afterall,Bayesiananalysisif3allaboutintroducingpriordistributionsonparameters.Actually,nonparametricBayesisoftenseenasasynonymforBayesianmodelswithprocesspriorsonthespacesofdensitiesandfunctions.Dirichletprocesspriorsarethemostpopularchoice.However,manyotherBayesianmethodsarenonparametricinspirit.Inaddi-tiontoDirichletprocesspriors,BayesianformulationsofcontingencytablesandBayesianmodelsonthecoefficientsinatomicdecoinpositionsoffunctionswillbediscussedlaterinthischapter.349 350NONPARAMETRICBAYES18.1DlRlCHLETPROCESSESThecentralideaoftraditionalnonparametricBayesiananalysisistodrawin-ferenceonanunknowndistributionfunction.Thisleadstomodelsonfunctionspaces,sothattheBayesiannonparametricapproachtomodelingrequiresadramaticshiftinmethodology.Infact,acommonlyusedtechnicaldefini-tionofnonparametricBayesmodelsinvolvesinfinitelymanyparameters.asmentionedinChapter10.ResultsfromBayesianinferencearecomparabletoclassicalnonparametricinference,suchasdensityandfunctionestimation,estimationofmixturesandsmoothing.TherearetwomaingroupsofnonparametricBayesmethodolo-gies:(1)methodsthatinvolveprior/posterioranalysisondistributionspaces,and(2)methodsinwhichstandardBayesanalysisisperformedonavastnumberofparameters,suchasatomicdecompositionsoffunctionsandden-sities.Althoughthethesetwomethodologiescanbepresentedinaunifiedway(seeMuellerandQuintana,2005),becauseofsimplicitywepresentthemseparately.RecallaDirichletrandomvariablecanbeconstructedfromgammarandomvariables.IfXI,...,X,arei.i.d.Garnrna(a,,l),thenforY,=X,/C,”=,X,,thevector(Yl,...,Y,)hasDirichletDir(a1,...,a,)distribution.TheDirich-letdistributionrepresentsamultivariateextensionofthebetadistribution:Dar(al.a2)=Be(a1,az).Also,fromChapter2,IEY,=a,/C,”,,a,,Ex2=a,(a,+l)/Cy=la,(l+C,”=la,),andE(Y,5)=a,a,/C,”=la,(l+C,”=,a,).TheDirichletprocess(DP),withprecursorsintheworkofFreedman(1963)andFabius(1964),wasformallydevelopedbyFerguson(1973,1974).Itisthefirstpriordevelopedforspacesofdistributionfunctions.TheDPis,formally,aprobabilitymeasure(distribution)onthespaceofprobabilitymeasures(distributions)definedonacommonprobabilityspaceX.Hence,arealizationofDPisarandomdistributionfunction.TheDPischaracterizedbytwoparameters:(i)Qo,aspecificprobabilitymeasureonX(orequivalently,GoaspecifieddistributionfunctiononX);(ii)a,apositivescalarparameter.Definition18.1(Ferguson,19’73)TheDPgeneratesrandomprobabilitymea-sures(randomdistributions)QonXsuchthatforany3nitepartitionB1,....BI,ofx.(Q(B1)....,Q(Bk))NDir(aQo(Bi)%...,a&o(Blc)),where,Q(B,)(arandomvariable)andQo(Bi)(aconstant)denotetheprob-abilityofsetBiunderQandQo)respectively.Thus,foranyB, DlrPlCHLETPROCESSES351andTheprobabilitymeasureQOplaystheroleofthecenteroftheDP,whileQcanbeviewedasaprecisionparameter.LargeCIimpliessmallvariabilityofDPaboutitscenterQo.TheabovecanbeexpressedintermsofCDFs,ratherthanintermsofprobabilities.ForB=(-m.z]theprobabilityQ(B)=Q((-m,z])=G(z)isadistributionfunction.Asaresult,wecanwriteandThenotationGNDP(aG0)indicatesthattheDPpriorisplacedonthedistributionG.Example18.1LetGNDP(aG0)andx1>n=30;%generaterandomCDF’sat30equispiicedpoints>>a=2;%a,bareparametersofthe>>%BASEdistributionG-0=Beta(2,2)>>b=2;>>alpha=20;%Theprecisionparameteralpha=20describes>>%scatteringabouttheBASEdistribution.>>%Higheralpha,lessvariability.>>y------------------->>x=linspace(l/n,l,n);%theequispacedpointsatwhich>>%randomCDF’sareevaluated.>>y=CDF-beta(x,a,b);%findCDF’sofBASE: 352NONPARAMETRICBAYES>>par=[y(l)diff(y)];%andformaDirichletparameter>>........................>>fori=1:15%Generate15randomCDF’s.>>yy=rand-dirichlet(a1pha*par,l);>>plot(x,cumsum(yy),’-’,’linewidth’,l)%cumulativesum>>%ofDirichletvectorisarandomCDF>>holdon>>end>>yyy=6.*(x.-2/2-x.-3/3);%PlotBASECDFasreference>>plot(x,yyy,’:’,’linewidth’,3)0.10.20.30.40.50.60.70.80.9Fig.18.1ThebaseCDFBe(2,2)isshownasadottedline.FifteenrandomCDF’sfromDP(20,Be(2,2))arescatteredaroundthebaseCDF.AnalternativedefinitionofDP.duetoSethuramanandTiwari(1982)andSethuraman(1994),isknownasthestick-breakingalgorithm.Definition18.2LetUi-Be(1,a).i=1:2,...andV,-Go,i=1,2,...betwoindependentsequencesofi.i.d.randomvariables.Defineweightsw1=U1andwi=Uinili(l-Uj),i>1.Then,G=Cp=l~kS(Vj)NDP(aGo),where6(Vj)isapointmassatVk. DlRlCHLETPROCESSES353ThedistributionGisdiscrete,asacountable)mixtureofpointmasses.andfromthisdefinitiononecanseethatwithprobability1onlydiscretedistributionsfallinthesupportofDP.Thenamestick-breakingcomesfromthefactthatCw,=1withprobability1,thatis,theunityisbrokenoninfinitelymanyrandomweights.TheDefinition18.2suggestsanotherwaytogenerateapproximatelyfromagivenDP.LetGK=Cf=',,wkG(Vk)wheretheweights01....,WK-~areasinDefini-tion18.2andthelastweightLJKismodifiedas1-w1-...-wK-1,sothatthesumofKweightsis1.Inpracticalapplications,Kisselectedsothat(1-(a/(1+a))K)issmall.18.1.1UpdatingDirichletProcessPriorsThecriticalstepinanyBayesianinferenceisthetransitionfromthepriortotheposterior,thatis,updatingapriorwhendataareavailable.IfY1,Y2....,Y,isarandomsamplefromG.andGhasDirichletpriorDP(aG0).theposteriorisremainsDirichlet,GIYI,...,Y,-DP(a*G;).witha*=a+n,and(18.2)NoticethattheDPpriorandtheEDFconstituteaconjugatepazrbecausetheposteriorisalsoaDP.TheposteriorestimateofdistributionisE(GIY1.....Yn)=GT,(t)whichis,aswesawinseveralexampleswithcon,jugatepriors.aweightedaverageofthe"priormean"andthemaximumlikelihoodestimator(theEDF).Example18.2Inthespiritofclassicalnonparametrics,theproblemofesti-matingtheCDFatafixedvalue2.hasasimplenonparametricBayessolution.SupposethesampleXI,....X,-FisobservedandthatoneisinterestedinestimatingF(z).SupposetheF(z)isassignedaDirichletprocesspriorwithacenterFoandasmallprecisionparametera.TheposteriordistributionforF(z)isBe(aFo(z)+ex,a(1-Fo(z))+n-e,)whereexisthenumberofob-servationsinthesamplesmallerthanorequaltoz.Ascy-+0,theposteriortendstoaBe(e,,TI-&).Thislimitingposteriorisoftencallednonznformatzwe.ByinspectingtheBe(l,.n-l,)distribution,orgeneratingfromit.onecanfindaposteriorprobabilityregionfortheCDFatanyvaluez.NotethattheposteriorexpectationofF(z)isequaltotheclassicalestimatore,/n.whichmakessensebecausethepriorisnoninformative.Example18.3TheundergroundtrainatHartsfield-Jacksonairportarrivesatitsstartingstationeveryfourminutes.Thenumb'erofpeopleYenteringasinglecarofthetrainisrandomvariablewithaPojssondistribution,Y/X-?(A). 354NONPARAMETRICBAYES0.95II0.9-0.851GY-J0.8-90.75-I-0.710.651*0.611fig.18.2Forasamplen=15Beta(2,2)observationsaboxplotof"noninformative"posteriorrealizationsofP(X51)isshown.ExactvalueF(1)forBeta(2,2)isshownasdottedline.AsampleofsizeN=20forYisobtainedbelow.9778811875713571446189810TheprioronXisanydiscretedistributionsupportedonintegers[l.171.XIPNDD~SCT((1,2,....17),P=(pl,pz,....p17)).whereC,p,=1.ThehyperprioronprobabilitiesPisDirichlet,PNDi~(aGo(l).aGo(2),....aGo(17)).WecanassumethattheprioronXisaDirichletprocesswithGo=[l.l,1.2.2,3,3,4.4.5.6,5.4,3.2.1.1]/48ando=48.WeareinterestedinposteriorinferenceontherateparameterA.modelcfor(iin1:N)cy[i]-dpois(lambda) DIRICHLETPROC€SS€S3553lambdadcat(P[IP[l:bins]ddirch(alphaG0[I>#datalist(bins=17,alphaGO=c(l,1,1,2,2,3,3,4,4,5,6,5,4,3,2,1,1),y=c(9,7,7,8,8,11,8,7,5,7,13,5,7,14,4,6,18,9,8,10),N=20)#initslist(lambda=l2,P=c(0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0~))ThesummaryposteriorstatisticswerefounddirectlyfromwithinWin-BUGS:1node1mean1sd1MCerrorI2.5%1median197.5%I1lambda18.634I0.6687I0.003232I8I9I10I0.020340.019828.5563-55.4133-40.014450.072820.020380.0199578.219E-55.3743-40.014230.073910.020460.020048.7523-55.2453-40.014340.074560.040750.0281.1793-40.0049880.034540.11130.041030.0281.237E-40.0052490.035070.11070.061420.034191.5753-40.013160.055360.1430.061710.034061.5863-40.013130.055730.14270.090120.041611.9813-40.026370.084380.18590.091340.041631.9563-40.026760.085780.18660.10350.043291.853-40.035160.097740.20220.12260.046632.2783-40.046980.11750.22760.10190.042841.811E-40.034960.096490.19940.081730.038741.71E-40.023260.076080.17180.061180.033961.5853-40.012880.055120.14260.040850.027951.3363-40.0053090.034770.11060.020320.019969.5493-55.3173-40.014190.074440.020440.019868.4873-55.4753-40.014450.07347Themainparameterofinterestisthearrivalrate,A.TheposteriormeanofXis8.634.Themedianis9passengerseveryfouriminutes.Eithernumbercouldbejustifiedasanestimateofthepassengerarrivalrateperfourminuteinterval.WinBUGSprovidesaneasywaytosavethesimulatedparametervalues,inorder,toatextfile.Thisthenenablesthedatatobeeasilyimportedintoanotherenvironment.suchasRorMATLAB,fordataanalysisandgraphing.Inthisexample,MATLABwasusedtoprovidethehistogramsforXandplo.ThehistogramsinFigure18.3illustratethatXisprettymuchconfinedtothefiveintegers7.8.9.10.and11,withthemode9. 356NONPARAMETRICBAYESFig.18.3Histogramsof40,000samplesfromtheposteriorforXandP[10].18.1.2GeneralizingDirichletProcessesSomepopularNPBayesianmodelsemployamixtureofDirichletprocesses.Themotivationforsuchmodelsistheirextraordinarymodelingflexibility.LetXI,Xa,...,X,betheobservationsmodeledasXiIOi-Bin(nz,Oz),OilF-F,i=l,...,n(18.3)F-Dir(a).IfQassignsmasstoeveryopenintervalon[0,1]thenthesupportofthedistributionsonFistheclassofalldistributionson[0,1].Thismodelallowsforpoolinginformationacrossthesamples.Forexample,observationX,willhaveaneffectontheposteriordistributionof03.j#i,viathehierarchicalstageofthemodelinvolvingthecommonDirichletprocess.Themodel(18.3)isusedextensivelyintheapplicationsofBayesiannon-parametrics.Forexample,BerryandChristensen(1979)usethemodelforthequalityofweldingmaterialsubmittedtoanavalshipyard,implyinganinterestinposteriordistributionsof0,.Liu(1996)usesthemodelforre-sultsofflicksofthumbtacksandfocussesondistributionofO,+llX1,...,X,.McEarchern.Clyde,andLiu(1999)discussestimationoftheposteriorpre-dictiveXn+l/X1.....X,,andsomeotherposteriorfunctionals.TheDPisthemostpopularnonparametricBayesmodelintheliterature(forarecentreview,seeMacEachernandMueller,2000).However,limitingthepriortodiscretedistributionsmaynotbeappropriateforsomeapplica- BAYESIANCATEGORICALMODELS357tions.AsimpleextensiontoremovetheconstraintofdiscretemeasuresistouseaconvolutedDP:ThismodeliscalledDzrrchletProcessMzxture(DPM).becausethemix-ingisdonebytheDP.PosteriorinferenceforDMPmodelsisbasedonMCMCposteriorsimulation.Mostapproachesproceedbyintroducingla-tentvariablesd%X,ld,-f(xid,).O,(G-GandG‘wDP(aG0).EfficientMCMCsimulationforgeneralMDPmodelsisdiscussed,amongothers.inEscobar(1994),EscobarandWest(1995),BushandMacEachern(1996)andMacEachernandMueller(1998).UsingaGaussiankernel,f(zlp,C)0:exp{(z-p)’)E(z-p)/2}.andmixingwithrespecttod=(p.C),adensityestimateresemblingtraditionalkerneldensityestimationisobtained.SuchapproacheshavebeenstudiedinLo(1984)andEscobarandWest(1995).ArelatedgeneralizationofDirichletProcessesisi;heMzxtureofDzrzchletProcesses(MDP).TheMDPisdefinedasaDPwithacenterCDFwhichdependsonrandom0.F-DP(aGe)0N7r(d).Antoniak(1974)exploredtheoreticalpropertiesofMDP’sandobtainedpos-teriordistributionfor0.18.2BAYESIANCONTINGENCYTABLESANDCATEGORICALMODELSIncontingencytables,thecellcountsN,,canbemodeledasrealizationsfromacountdistribution,suchasMultinomialMn(n,p,,)orPoissonP(A,,).Thehypothesisofinterestisindependenceofrowandcolumnfactors.H0:p,,=a,b,,wherea,andb,aremarginalprobabilitiesoflevelsoftwofactorssatisfyingEtaz=C,b,=1.TheexpectedcellcountforthemultinomialdistributionisENzJ=npt3.UnderHo,thisequalsna,b,,sobytakingthelogarilhmonbothsides,oneobtains 358NONPARAMETRICBAYESforsomeparametersa,and@,.Thisshowsthattestingthemodelforaddi-tivityinparametersaandpisequivalenttotestingtheoriginalindependencehypothesisHo.ForthePoissoncounts,thesituationisanalogous:oneuseslogA,,=const+a,+D,.Example18.4ActivitiesofDolphinGroupsRevisited.WerevisittheDolphin’sActivityexamplefromp.162.GroupsofdolphinswereobservedoffthecoastofIcelandandthetableprovidinggroupcountsisgivenbelow.Thecountsarelistedaccordingtothetimeofthedayandthemainactivityofthedolphingroup.Thehypothesisofinterestisindependenceofthetypeofactivityfromthetimeoftheday.1TravellingFeedingSocializingMorning62838Noon645Afternoon1409Evening135610TheWinBUGSprogramimplementingtheadditivemodelisquitesimple.WeassumethecellcountsareassumeddistributedPoissonandthelogarithmofintensity(expectation)isrepresentedinanadditivemanner.Themodelparts(intercept,a,,andpJ)areassignednormalpriorswithmeanzeroandprecisionparameterxi.Theprecisionparameterisgivenagammapriorwithmean1andvariance10.Inadditiontothemodelparameters,theWinBUGSprogramwillcalculatethedevianceandchi-squarestatisticsthatmeasuregoodnessoffitforthismodel.model{for(iin1:nrow){for(jin1:ncol)Cgroups[i,jl-dpois(lambda[i,jl)log(lambda[i,j])<-c+alpha[il+beta[jl1)#c-dnorm(0,xi)for(iin1:nrow){alpha[i]dnorm(0,xi)1for(jin1:ncol){beta[jlIdnorrn(0,xi)1xi-dgauuna(0.01,0.01)#for(iin1:nrow)Cfor(jin1:ncol){devG[i,j]<-groups[i,j]*log((groups[i,jl+O.5)/(lambda[i,j1+O.5))-(groups[i,j1-lambda[i,j1;devX[i,jl<-(groups[i,jl-lambda[i,jl)*(groups[itjl-lambda[i,jl)/lambda[i,jl;1>G2<-2*sum(devG[,]1;X2<-sum(devX[,])) BAYWANCAT€GOR/CALMOD€LS359Thedataareimportedaslist(nrow=4,ncol=3,groups=structure(.Data=c(6,28,38,6,4,5,14,0,9,13,56,101,.Dim,=c(4,3)))andinitialparametersarelist(xi=O.l,c=0,alpha=c(0,0,0,0),beta=c(0,0,0))ThefollowingoutputgivesBayesestimatorsoftbeparameters,andmea-suresoffit.Thisadditivemodelconformspoorlytotheobservations;underthehypothesisofindependence,theteststatisticisx2with3x4-6=6degreesoffreedom,andtheobservedvalueX2=77.73hasapvalue(I-chi2cdf(77.73,6))thatisessentiallyzero.InodemeanIsdMCerrorI2.5%median97.5%C1.5140.73930.03152-0.022621.5362.961alpha[l]1.0280.56580.0215-0.078291.0252.185alpha[2]-0.51820.58940.02072-1.695-0.51660.6532alpha[3]-0.11050.57930.02108-1.259-0.11131.068alpha[4]1.1210.56560.021580.020591.1172.277beta[l]0.13140.64780.02492-1.1340.11011.507beta[2]0.94390.64270.02516-0.30260.92012.308beta[3]0.59240.64510.02512-0.66160.56871.951C1.5140.73930.03152-0.022621.5362.961G277.83.4520.0154873.0777.1686.2x277.739.8710.0373764.3275.85102.2Example18.5CEsareanSectionInfectionsRlevisited.Wenowcon-sidertheBayesiansolutiontotheCzesareansectionbirthproblemfromp.236.ThemodelforprobabilityofinfectioninabirthbyCmareansectionwasgivenintermsofthelogatlinkas,P(infection)=,&+noplan+fl2riskfac+p3antibio.logP(noinfection)TheWinBUGSprogramprovidedbelowimplementsthemodelinwhichthenumberofinfectionsisBin(n.p)withpconnectedtocovariatesnoplanriskfacandantibioviathelogitlink.PriorsoncoefficientsinthelinearpredictoraresettobeavagueGaussian(smallprecisionparameter).model(for(iinl:N)(inf[i]-dbin(pCi1,total[i])logit(p[i])<-beta0+betal*noplan[i]+ 360NONPARAMETRICBAYESbeta2*riskfac[i]+beta3*antibio[i]3beta0-dnorm(O,~.o~~~~)betal-dnorm(O,0.0000~)beta2"dnorm(0,0.00001)beta3"dnorm(0,0.0000~)>#DATAlist(inf=c(l,11,0,0,28,23,8,01,total=c(18,98,2,0,58,26,40,91,noplan=c(0,1,0,1,091,0,1),riskfac=c(l,l,0,0,1,1,0,01,antibio=c(l,l,l,l,O,O,O,O),N=8)#INITSlist(beta0=0,betal=O,beta2=0,beta3=0)TheBayesestimatesoftheparametersPo-p3aregivenintheWinBUGSoutputbelow.Inode1mean1sd1MCerror12.5%Imedian197.5%1beta0-1.9620.42830.004451-2.861-1.941-1.183betal1.1150.43230.0030040.291.1061.988beta22.1010.46910.0048431.2252.0843.066beta3-3.3390.48960.003262-4.338-3.324-2.418NotethatBayesestimatorsareclosetotheestimatorsobtainedinthefrequen-tistsolutioninChapter12:(po,&.p2,@3)=(-1.89,1.07,2.03.-3.25)andthatinadditiontotheposteriormeans,posteriormediansand95%crediblesetsfortheparametersareprovided.WinBUGScanprovidevariousposteriorlocationandprecisionmeasures.Fromthetable.the95%crediblesetforPOis[-2.861.-1.1831.18.3BAYESIANINFERENCEININFINITELYDIMENSIONALNONPARAMETRICPROBLEMSEarlierinthebookwearguedthatmanystatisticalproceduresclassifiedasnonparametricare,infact,infinitelyparametric.Examplesincludewaveletregression,orthogonalseriesdensityestimatorsandnonparametricMLEs(Chapter10).Inordertoestimatesuchfunctions,werelyonshrinkage,taperingortruncationofcoefficientestimatorsinapotentiallyinfiniteex-pansionclass.(Chencov'sorthogonalseriesdensityestimators,Fourierandwaveletshrinkage,andrelated.)Thebenefitsofshrinkageestimationinstatis- INFINITELYDIMENSIONALPROBLEMS361ticswerefirstexploredinthemid-1950'sbyC.SteinInthe1970'sand1980's.manystatisticianswereactiveinresearchonstatisticalpropertiesofclassicalandBayesianshrinkageestimators.BayesianmethodshavebecomepopularinshrinkageestimationbecauseBayesrulesare.ingeneral,9hrinkers".MostBayesrulesshrinklargecoef-ficientsslightly,whereassmallonesaremoreheaviilyshrunk.Furthermore,interestforBayesianmethodsisboostedbythepossibilityofincorporatingpriorinformationaboutthefunctiontomodelwaveletcoefficientsinarealisticway.WavelettransformationsWareappliedtonoisymeasurementsyz=f,+E,,i=1....,n,or,invectornotation,y=f+E.ThelinearityofWimpliesthatthetransformedvectord=W(y)isthesumofthetransformedsignal8=W(f)andthetransformednoise7=W(E).Furthermore,theorthog-onalityofWimpliesthatE~,i.i.d.normalN(0,o')componentsofthenoisevectorE.aretransformedintocomponentsof7withthesamedistribution.Bayesianmethodsareappliedinthewaveletdjomain,thatis,afterthewavelettransformationhasbeenappliedandthemodeld,NN(6',,a').z=1,...,n,hasbeenobtained.Wecanmodelcoefficient-by-coefficientbecausewaveletsdecorrelateandd,'sareapproximatelyindependent.Thereforeweconcentratejustonasingletypicalwaveletcoefficientandonemodel:d=6'+E.Bayesianmethodsareappliedtoestimatetheloca-tionparameter6'.As6''scorrespondtothefunctiontobeestimated,back-transforminganestimatedvector8willgivetheestimatorofthefunction.18.3.1BAMSWaveletShrinkageBASIS(standsforBayeszanAdaptzveMultascaleShrznkage)isasimpleeffi-cientshrinkageinwhichtheshrinkageruleisaBayesruleforproperlyselectedpriorandhyperparametersoftheprior.Startingwith[die.0'1NN(6'.0')andtheprior0'N€(p),p>0,withdensityf(a'1p)==pe-pu2,weobtainthemarginallikelihood1dl6'ND€(6'..withdensityf(di6')=-fie-fild-el2Iftheprioron6'isamixtureofapointmass60atzero,andadouble-exponentialdistribution.6'lEN€60+(1-€)D€(O,T),(18.4)thentheposteriormeanof6'(fromBayesrule)is:(18.5) 362NONPARAMETRlCBAYESwhere(18.6)andFig.18.4Bayesrule(18.7)andcomparablehardandsoftthresholdingrules.AsevidentfromFigure18.4,theBayesrule(18.5)fallsbetweencomparablehard-andsoft-thresholdingrules.Toapplytheshrinkagein(18.5)onaspecificproblem,thehyperparametersp,7,andEhavetobespecified.AdefaultchoicefortheparametersissuggestedinVidakovicandRuggeri(2001);seealsoAntoniadis,Bigot,andSapatinas(2001)foracomparativestudyofmanyshrinkagerules,includingBAMS.TheiranalysisisaccompaniedbyMATLABroutinesandcanbefoundathttp://www-lmc.imag.fr/SMS/software/Gaussi~WaveDen/.Figure18.5(a)showsanoisydopplerfunctionofsizen=1024,wherethesignal-to-noiseratio(definedasaratioofvariancesofsignalandnoise)is7.Panel(b)inthesamefigureshowsthesmoothedfunctionbyBAMS.Thegraphsarebasedondefaultvaluesforthehyperparameters.Example18.6BayesianWaveletShrinkageinWinBUGS.Becauseofthedecorrelatingpropertyofwavelettransforms,thewaveletcoefficientsaremodeledindependently.Aselectedcoefficientdisassumedtobenormal /Nf/N/T€LYDlMEiVSlONALPROBLEMS363fig.18.5(a)Anoisydopplersignal[SKR=7,n=1024,noisevariancecz=11.(b)SignalreconstructedbyBAMS.dNN(Q,6)whereQisthecoefficientcorrespondingtotheunderlyingsignalindataand>1i)ShowthattheBayesrule(posteriorexpectation)for0hastheexplicitformof 366NONPARAMETRICBAYESwhereddly=1)P(y=lid)=pr(dly=1)+(1-p)7r(dly=0)and7r(dlr=1)and7r(d(y=0)aredensitiesofN(0,a2+(cT)~)andN(0,a’+T’)distributions,respectively,evaluatedatd.(ii)PlottheBayesrulefrom(i)forselectedvaluesofparametersandhyperparameters(0’;T*,y,c)sothattheshapeoftheruleisreminiscentofthresholding.REFERENCESAntoniadis,A.,Bigot,J.,andSapatinas,T.(2001);“WaveletEstimatorsinNonparametricRegression:AComparativeSimulationStudy,”JournalofStatisticalSoftware,6,1-83.Antoniak,C.E.(1974),“MixturesofDirichletProcesseswithApplicationstoBayesianNonparametricProblems,”AnnalsofStatistics,2,1152-1174.Berry,D.A.,andChristensen,R.(1979),“EmpiricalBayesEstimationofaBinomialParameterViaMixturesofDirichletProcesses,”AnnalsofStatistics,7,558-568.Bush,C.A.,andMacEachernS.N.(1996))“ASemi-parametricBayesianModelforRandomizedBlockDesigns,”Biometrika,83,275-286.Chipman,H.A.,Kolaczyk,E.D.,andMcCulloch,R.E.(1997),“AdaptiveBayesianWaveletShrinkage,”JournalofAmericanStatisticalAssocia-tion,92,1413-1421.Escobar,M.D.(1994),“EstimatingNormalMeanswithaDirichletProcessPrior,”JournalofAmericanStatisticalAssociation,89,268-277.Escobar,M.D.,andWest,M.(1995),“BayesianDensityEstimationandInferenceUsingMixtures,JournalofAmericanStatisticalAssociation,90,577-588.Fabius,J.(1964),“AsymptoticBehaviorofBayes’Estimates,”AnnalsofMathematicalStatistics,35,846-856.Ferguson,T.S.(1973):“ABayesianAnalysisofSomeNonparametricProb-lems,”AnnalsofStatistics,1,209-230.(1974),“PriorDistributionsonSpacesofProbabilityMeasures,”AnnalsofStatistics,2,615-629.Freedman,D.A.(1963),“OntheAsymptoticBehaviorofBayes’EstimatesintheDiscreteCase,”AnnalsofMathematicalStatistics,34,1386-1403.Liu,J.S.(1996),“NonparametricHierarchicalBayesviaSequentialImputa- REFERENCES367tions,”AnnalsofStatistics,24,911-930.Lo,A.Y.(1984),“OnaClassofBayesianNonpararnetricEstimates,I.Den-sityEstimates,“AnnalsofStatistics,12,351-357.MacEachern,S.N.,andMueller,P.(1998),“EstimatingMixtureofDirichletProcessModels,JournalofComputationalandGraphicalStatistics,7,223-238.(2000),“EfficientMCMCSchemesforRobustModelExtensionsUs-ingEncompassingDirichletProcessMixtureModels,”inRobustBayesianAnalysis,Eds.F.RuggeriandD.Rios-Insua,NewYork:SpringerVer-lag.MacEachern,S.N.,Clyde,M.,andLiu,J.S.(19’99),“SequentialImpor-tanceSamplingforXonparametricBayesModels:TheNextGenera-tion,“CanadianJournalofStatistics,27,251-267.Mueller,P.,andQuintana,F.A.(2004),“NonparametricBayesianDataAnal-ysis,”StatisticalScience,19.95-110.Sethuraman,J.,andTiwari,R.C.(1982),“ConvergenceofDirichletMeasuresandtheInterpretationoftheirParameter,”inStatisticalDecisionTheoryandRelatedTopics111,eds.S.GuptaandJ.0.Berger;NewYork:SpringerVerlag,2,pp.305-315.Sethuraman,J.(1994),“AConstructiveDefinitiono’fDirichletPriors,‘’Sta-tisticaSinica,4,639-650.Vidakovic,B.,andRuggeri,F.(2001),”BAMSMethod:TheoryandSimula-tions.”Sankhya,Ser.B,63,234-249. ThisPageIntentionallyLeftBlank AppendixA:MATLABThecombinationofsomedataandanachingdesireforananswerdoesnotensurethatareasonableanswercanbeextractedfromagivenbodyofdata.J.W.Tukey(1915-2000)A.lUSINGMATLABMATLABisainteractiveenvironmentthatallowst'heusertoperformcompu-tationaltasksandcreategraphicaloutput.TheusertypesinexpressionsandcommandsinaCommandWindowwherenumericalresultsofthecommandsaredisplayedwiththeuserinput.Graphicaloutputwillbeproducedinanew(graphics)windowthatcanusuallybeprinted1orstored.WhenMATLABislaunched.severalwindowsareavailabletotheuserasyoucanseeinFig.A.7.Theirusesarelistedbelow:CommandWindow:Typingcommandsandexpressions-thisisthemaininteractivewindowintheuserinterfaceLaunchPadWindow:AllowsusertorundemosWorkspaceWindow:Listofvariablesenteredorcreatedduringses-sion369 370AppendixA:MATLABFig.A.7InteractiveenvironmentofLIATLAB.0CommandHistoryWindow:Listofrecentcommandsused0ArrayEditorWindow:Allowsusertomanipulatearraysvariablesusingspreadsheet0CurrentDirectoryWindow:TospecifydirectorywhereMATLABwillsearchfororstorefilesMATLABisahigh-leveltechnicalcomputinglanguageforalgorithmde-velopment,datavisualization,dataanalysis,andnumericcomputation.SomehighlightfeaturesofMATLABcanbesummarizedas0High-levellanguagefortechnicalcomputing.whichareeasytolearn0Developmentenvironmentformanagingcode,files.anddata0Mathematicalfunctionsforlinearalgebra.statistics,Fourieranalysis.filtering.optimization,andnumericalintegration02-Dand3-Dgraphicsfunctionsforvisualizingdata0Toolsforbuildingcustomgraphicaluserinterfaces0Functionstocommunicatewithotherstatisticalsoftware.suchasR.WinBUGS USlNGMATLAB371Togetstarted,youcantypedocinthecommandwindow.ThiswillbringyoutoanHTMLhelpwindowandyoucansearchkeywordorbrowsetopicstherein.>>docFig.A.8HelpwindowofMATLAB.Ifyouknowthefunctionname,butdonotknowhowtouseit,itisoftenusefultotype"helpfunctionname"incommandwindow.Forexample,ifyouwanttoknowhowtousefunctionrandgorfindoutwhatrandgdoes.>>helprandgRANDGGammarandomnumbers(unitscale).Note:Togenerategammarandomnumberswithspecifiedshapeandscaleparameters,youshouldcallGAMRND.R=RANDGreturnsascalarrandomvaluechosenfromagammadistributionwithunitscaleandshape.R=RANDG(A)returnsamatrixofrandomvalueschosenfromgammadistributionswithunitscale.RisthesamesizeasA,andRANDGgenerateseachelementofRusingashapeparameterequaltothecorrespondingelementofA..... 372AppendixA:MATLABA.l.lToolboxesServingasextensionstothebasicMATLABprogrammingenvironment,tool-boxesareavailableforspecificresearchinterests.ToolboxesavailableincludeCommunicationsToolboxControlSystemToolboxDSPBlocksetExtendedSymbolicMathToolboxFinancialToolboxFrequencyDomainSystemIdentificationFuzzyLogicToolboxHigher-OrderSpectralAnalysisToolboxImageProcessingToolboxLMIControlToolboxMappingToolboxModelPredictiveControlToolboxMu-AnalysisandSynthesisToolboxNAGFoundationBlocksetNeuralNetworkToolboxOptimizationToolboxPartialDifferentialEquationToolboxQFTControlDesignToolboxRobustControlToolboxSignalProcessingToolboxSplineToolboxStatisticsToolboxSystemIdentificationToolboxWaveletToolboxForthemostpartweusefunctionsinthebaseMATLABproduct,butwherenecessarywealsousefunctionsfromtheStatisticsToolbox.Therearenumer-ousproceduresfromothertoolboxesthatcanbehelpfulinnonparametricdataanalysis(e.g.,NeuralNetworkToolbox,WaveletToolbox)butwerestrictrou-tineapplicationstobasicandfundamentalcomputationalalgorithmstoavoidmakingthebookdependonanypre-writtensoftwarecode.A.2MATRIXOPERATIONSMATLABwasoriginallywrittentoprovideeasyinteractionwithmatrixsoft-waredevelopedbytheNASA1-sponsoredLINPACKandEISPACKprojects.Today,MATLABenginesincorporatetheLAPACKandBLASlibraries,em-beddingthestateoftheartinsoftwareformatrixcomputation.Insteadof'NationalAeronauticsandSpaceAdministration. MATRIXOPERATIONS373relyingondoloopstoperformrepeatedtasks,IIA'TLABisbettersuitedtousingarraysbecauseMATLABisaninterpretedlanguage.MATLABwasoriginallywrittentoprovideeasyaccesstomatrixsoftwaredevelopedbytheLINPACKandEISPACKprojects,(theseprojectsweresponsoredbyNASAandmuchofthesourcecodeisinpublicdomain)whichtogetherrepresentthestate-of-the-artinsoftwareformatrixcomputation.A.2.1EnteringaMatrixThereareafewbasicconventionsofenteringama,trixinMATLAB,whichinclude0Separatingtheelementsofarowwithblanksorcommas.0Usingasemicolon':'toindicatetheendofeachrow.0Surroundingtheentirelistofelementswithsquarebrackets,[>>A=C301;121;1111%columnsseparatedbyaspace%rowsseparatedby";"A=301121111A.2.2ArithmeticOperationsMATLABusesfamiliararithmeticoperatorsandprecedencerules,butunlikemostprogramminglanguages,theseexpressionsinvolveentirematrices.ThecommonmatrixoperatorsusedinMATLABarelistedasfollows:+addition-subtraction*multiplicationpower'transpose.'transposeleftdivision/rightdivisiion.*element-wisemultiplication.Aelement-wisepower./element-wiserightdivision>>X=[lO10201';%semicolonsuppressesoutputofX>>A*X%Ais3x3,Xis3x1andX'islx%soA*Xis3x1ans=505040 374AppendixA:MATLAB>>y=AX%yisthesolutionofAy=XY=-10.0000-10.000040.0000>>A.*A%".*"multipliescorrespondingelementsof%matchingmatrices;thisisequivalenttoA.-2ans=901141I11A.2.3LogicalOperationsTherelationaloperatorsinMATLABaregreaterthan<=less-than-or-equal____equal>=greater-than-or-equal-=notequal&(logical)and1(logical)orN(logical)notWhenrelationaloperatorsareappliedtoscalars,0representsfalseand1representstrue.A.2.4MatrixFunctionsTheseextramatrixfunctionsarehelpfulincreatingandmanipulatingarrays:eyeidentitymatrixonesmatrixofoneszerosmatrixofzerosdiagdiagonalmatrixrandmatrixofrandomU(0,l)invmatrixinversedetmatrixdeterminantrankrankofmatrixfindindicesofnonzeroentriesnormnormalizedmatrixA.3CREATINGFUNCTIONSINMATLABAlongwiththeextensivecollectionofexistingMATLABfunctions,youcancreateyourownproblem-specificfunctionusinginputvariablesandgenerating IMPORTINGANDEXPORTINGDATA375arrayorgraphicaloutput.Onceyoulookatasimpleexample,youcaneasilyseehowafunctionisconstructed.Forexample,hereisawaytocomputethePDFofatriangulardistribution,centeredatzerowiththesupport[-c.c]:functiony=tripdf(x,c)yl=max(O,c-abs(x))/c-2;Y=YlThefunctionstartswiththefunctiony=functionname(input)whereyisjustadummyvariableassignedasfunctionoutputattheendofthefunction.Localvariables(suchasyl)canbedefinedandcombinedwithinputvariables(x,c)andtheoutputcanbescalarormatrixform.Oncethefunctionisnamed.itwilloverrideanypreviousfunctionwiththesamename(sotrynottocallyourfunction"sort","inv"oranyotherknownMATLABfunctionyoumightwanttouselater).Thefunctioncanbetypedandsavedasanm-file(i.e.,tripdf.m)becausethatishowMATLABrecognizesanexternalfilewithexecutablecode.Al-ternatively,youcantypetheentirefunction(lineb;yline)directlyintotheprogram,butitwon'tbeautomaticallysavedafteryoufinish.Thenyoucan'.call"thenewfunctionas>>v=tripdf(0:4,3)v=co.33330.22220.1111003>>tripdf(-1,Z)<=0.5%=1ifstatementistrueans=1Italsopossibletodefineafunctionasavariable.Forexample,ifyouwanttodefineatruncated(andunnormalized)normalPDF,usethefollowingcommand>>tnormpdf=a(x,mu,sig,left,right)...normpdf(x,mu,sig).*(x>left&x>tnormpdf(-3:3,0,1,-2,2)ans=000.24200.39890.242000Thetnormpdffunctiondoesnotintegrateto1.Tonormalizeit,onecandi-videtheresultby(normcdf(right,mu,sigma)-normcdf(left,mu,sigma)).A.4IMPORTINGANDEXPORTINGDATAAsafirststepofdataanalysis,wemayneedtoimportdatafromexternalsources.ThemostcommontypesoffilesusedintheMATLABstatistical 376AppendixA:MATLABcomputingareMATLABdatafiles,Textfiles,andSpreadsheetfiles.TheMATLABdatafilehastheextensionname*.mat.HereisanexampleofimportingsuchdatatoMATLABworkspace.A.4.1MATFilesYoucanusethecommandwhostolookwhatvariablesareinthedatafile.>>whos-filedataexampleNameSizeBytesClassSigma2x232doublearrayUS1x18doublearraymu1x216doublearrayxx500x28000doublearrayGrandtotalis1007elementsusing8056bytesThenyoucanusethecommandloadtoloadallvariablesinthisdatafile.>>clear%clearvariablesintheworkspace>>loaddataexample>>whos%checkwhatvariablesareintheworkspaceNameSizeBytesClassSigma2x232doublearrayanS1x18doublearraymu1x216doublearrayxx500x28000doublearrayGrandtotalis1007elementsusing8056bytesInsomecases,youmayonlywanttoloadsomevariablesintheMATfiletotheworkspace.Hereishowyoucandoit.>>clear>>varlist={’Sigma’,’mu’);%Createdalistofvariables>>load(’dataexample.mat’,varlistC:))>>clearvarlist%removevarlistfromworkspace>>whos%seewhatisintheworkspaceNameSizeBytesClass IMPORTlNGANDEXPORTINGDATA377Sigma2x232doublearraymu1x216doublearrayGrandtotalis6elementsusing48bytesAnotherwayofcreatingvariablesofinterestistouseanindex.>>clear>>vars=whos(’-file’,’dataexample.mat’);>>load(’dataexample.mat’,vars([l,3]).name)Ifyoudonotwanttousefullvariablenames,butwanttousesomepatternsinthesenames.theloadcommandcanbeusedwitha‘-regexp‘option.Thefollowingcommandwillloadthesamevariableasthepreviousone.>>load(’dataexample.mat’,’-regexp’,’-St-m’);Textfilesusuallyhavetheext,ensionname*.txt,*.d.at,*.csv,andsoforth.A.4.2TextFilesIfthedatainthetextfileareorganizedasamatrix,youcanstilluseloadtoimportthedataintotheworkspace.>>loadmytextdata.dat>>mytextdatamytextdata=-0.30970.2950-0.1681-1.4250-1.5219-0.3927-0.68730.46150.82650.5759-0.99071.0915-0.6130-1.1414-0.0498-1.04430.95970.06110.7193-2.84281.97300.0123-0.28310.9968Youcanalsoassigntheloadingdatatobestored)inanewvariable.>>x=load(’mytextdata.dat’);Thecommandloadwillnotworkifthetextfileisnotorganizedinmatrixform.Forexample,ifyouhaveatextfilemydata.txt>>typemydata.txtvar1var2var3var4name-0.30970.2950-0.1681-1.4250Olive-1.5219-0.3927-0.68730.4615Richard 378AppendixA:MATLAB0.82650.5759-0.99071.0915Dwayne-0.6130-1.1414-0.0498-1.0443Edwin0.95970.06110.7193-2.8428Sheryl1.97300.0123-0.28310.9968FrankYoushoulduseanewfunctiontxtreadtoimportvariablestoworkspace.>>[vari,var2,var3,var4,strl=...textread(’mydata.txt’,’%f%f%f%f%s’,’headerlines’,l);Alternatively,youcanusetextscantofinishtheimport.>>fid=fopen(’mydata.txt’);>>c=textscan(fid,’%f%f%f%f%s’,’headerLines’,l);>>fclose(fid);>>[CC1:411%varl-var4ans=-0.30970.2950-0.1681-1.4250-1.5219-0.3927-0,68730.46150.82650.5759-0.99071.0915-0.6130-1.1414-0.0498-1.04430.95970.06110.7193-2.84281.97300.0123-0.28310.9968ans=’Olive’’Richard’’Dwayne’’Edwin’’Sheryl’Frank’Comma-separatedvaluesfilesareusefulwhenexchangingdata.Giventhefiledata.csvthatcontainsthecomma-separatedvalues>>typedata.csv02,04,06,08,10,1203,06,09,12,15,1805,10,15,20,25,3007,14,21,28,35,4211,22,33,44,55,66Youcanusecsvreadtoreadtheentirefileintoworkspace>>csvread(’data.csv’) IMPORTINGAN,DEXPORTINGDATA379ans=246810123691215185101520253071421283542112233445566A.4.3SpreadsheetFilesDatafromaspreadsheetcanbeimportedintotheworkspaceusingthefunc-tionxlsread.>>[NUMERIC,TXT,RAW]=xlsread(’data.xls’1;>>NUMERICNUMERIC=1.00000.3000NaN2.00000.4500NaN3.00000.300012.00004.00000.35005.00005.00000.35005.00006.00000.350010.00007.00000.350013.00008.00000.35005.00009.00000.350023.0000>>TXTTXT=’Date’’varl’var2’’var3’’name’’1/1/2001’’Frank’’1/2/2001’’9’1/3/2001’’Sheryl’’1/4/2001’,’’1/5/2001’’12ichard’’1/6/2001’’Olive’’1/7/2001’’Dwayne’’1/8/2001’’Edwin’’1/9/2001’’;:tanJ>>RAWRAW=’Date’’varl’’var2’’name’’1/1/2001’c11CO.30001’Frank’J1/2/2001’c21[O.45001CNaNl 380AppendixA:MATLAB’1/3/2001’[31[0.3000][121’Sheryl’’1/4/2001’[41[0.35001I:51[:NaNl’1/5/2001’[51[0.3500][51’Richard’’1/6/2001’[61[0.3500][101’Olive’’1/7/2001’[71[0.35001131’Dwayne’’1/8/2001’[81[0.35001[51’Edwin’’1/9/2001’[91C0.35001[231’Stan’Itisalsopossibletospecifythesheetnameofxlsfileasthesourceofthedata.>>NUMERIC=xlsread(’data.xls’,’rnd’);%readdatafrom%asheetnamedasrndFromanxlsfile,youcangetdatafromaspecifiedregioninanamedsheet:>>NUMERIC=xlsread(’data.xls’,’data’,’b2:c10J);Thefollowingcommandalsoallowsyoudointeractiveregionselection:>>NUMERIC=xlsread(’data.xls’,-l);ThesimplestwaytosavethevariablesfromaworkspacetoapermanentfileintheformatofaMATfileistousethecommandsave.Ifyouhaveasinglematrixtosave,savefilenamevarname-asciiwillsaveexporttheresulttotextfile.YoucanalsosavenumericarrayorcellarrayinanExcelworkbookusingxlswrite.A.5DATAVISUALIZATIONA.5.1ScatterPlotAscatterplotisausefulsummaryofasetofbivariatedata(twovariables).usuallydrawnbeforeworkingoutalinearcorrelationcoefficientorfittingaregressionline.Itgivesagoodvisualpictureoftherelationshipbetweenthetwovariables,andaidstheinterpretationofthecorrelationcoefficientorregressionmodel.InMATLAB,asimplewayofmakeaplotmatrixistousethecommandplot.Fig.A.9givestheresultofthefollowingMATLABcommands:However.thisisisnotenoughifyouaredealingwithmorethantwovariables.Inthiscase.thefunctionplotmatrixshouldusedinstead(Fig.A.lO). DATAVISUALIZATION381’-10.20.40.60.81Fig.A.9Scatterplotof(z,y)forx=rand(1000,l)andy=.5*x+5*x.2+.3*randn(1000,1).>>x=randn(50,3);>>y=x*[-l21;201;l-23;l’;>>plotmatrix(y)Inclassificationproblems,itisalsousefultolookatscatterplotmatrixwithgroupingvariable(Fig.A.11).>>loadcarsmall;>>X=[MPG,Acceleration,Displacement,Weight,Horsepower];>>varNames={’MPG’’Acceleration’’Displacement’...’Weight’’Horsepower’);>>gplotmatrix(X,[I,Cylinders,’bgrcm’,[I,[I,’on’,’hist’,varNames);>>set(gcf,’color’,’white’)A.5.2BoxPlotBoxplotisanexcellenttoolforconveyinglocationandvariationinformationindatasets.particularlyfordetectingandillustratinglocationandvariationchangesbetweendifferentgroupsofdata.HereisanexampleofhowMATLABmakesaboxplot(Fig.A.12).>>loadcarsmall>>boxplot(MPG,Origin,’grouporder’,...{’France’’Germany’’Italy’’Japan’’Sweden’’USA’))>>set(gcf,’color’,’white’) 382AppendixA:MATLABFig.A.10Simulateddatavisualizedbyplotmatrix.Fig.A.llScatterplotmatrixforCarData. DATAVlSUALlZAJlON383'45-+40--T-I35III;30-QI25-B1-I20-IIIEl0II15-II10-IFig.A.12BoxplotforCarData.A.5.3HistogramandDensityPlotAhistogramofunivariatedatacanbeplottedusinghist(Fig.A.13).>>hist(randn(100,l)whileathree-dimensionalhistogramofbivariatedataisplottedusinghist3,(Fig.A.14);>>mu=[I-11;Sigma=L.9.4;.4.31;>>r=mvnrnd(mu,Sigma,500);>>hist3(r)Ifyoulikeasmootherdensityplot.youmayturntoakerneldensityordistributionestimateimplementedinksdensity(Fig.A.15).Also,inrecentversionsofMATLAByouhavetheoptionofnotaskingforoutputsfromtheksdensity,andthefunctionplotstheresultsdirectly.>>[y,x]=ksdensity(randn(100,l));>>plot(x,y)A.5.4PlottingFunctionListHereisacompletelistofstatisticalplottingfunctionsavailableinMATLAB 384AppendixA:MATLABFig.A.13Histogramforsimulatedrandomnormaldata.Fig.A.14Spatialhistogramforsimulatedtwo-dimensionalrandomnormaldata. DATAV/SUAL/ZAT/ON385Fig.A.15Kerneldensityestimatorforsimulatedrandomnormaldata.andrewsplot-Andrewsplotformultivariated,ata.bar-Bargraph.biplot-Biplotofvariable/factorcoefficientsandscores.boxplot-Boxplotsofadatamatrix(one!percolumn).cdfplot-Plotofempiricalcumulativedistributionfunction.contour-Contourplot.ecdf-EmpiricalCDF(Kaplan-Meierestimate).ecdfhist-HistogramcalculatedfromempiricalCDF.fplot-Plotsscalarfunction$f(x)$atvaluesof$x$.fsurfht-Interactivecontourplotofafunction.gline-Point,dragandclicklinedrawingonfigures.glyphplot-PlotstarsorChernofffacesfo:rmultivariatedata.gname-Interactivepointlabelinginx-yplots.gplotmatrix-Matrixofscatterplotsgroupedbyacommonvariable.gscatter-Scatterplotoftwovariablesg:roupedbyathird.hist-Histogram(inMATLABtoolbox).hist3-Three-dimensionalhistogramofbivariatedata.ksdensity-Kernelsmoothingdensityestimation.lsline-Addleast-squarefitlinetoscatterplot.normplot-Normalprobabilityplot.parallelcoords-Parallelcoordinatesplotformultivariatedata.probplot-Probabilityplot.q¶Plot-Quantile-Quantileplot.refcurve-Referencepolynomialcurve.refline-Referenceline.stairs-Stair-stepofywithjumpsatpt3intsx.surfht-Interactivecontourplotofadatagrid. 386AppendixA:MATLABwblplot-Weibullprobabilityplot.A.6STATISTICSForyourconvenience!let’slookatalistoffunctionsthatcanbeusedtocomputesummarystatisticsfromdata.corr-Linearorrankcorrelationcoefficient.corrcoef-Correlationcoefficientwithconfidenceintervalscov-Covariance.crosstab-Crosstabulation.geomean-Geometricmean.grpstats-Summarystatisticsbygroup.harmmean-Harmonicmean.iqr-Interquartilerange.kurtosis-Kurtosis.mad-MedianAbsoluteDeviation.mean-Sampleaverage(inMATLABtoolbox).median-50thpercentileofasample.moment-Momentsofasample.nancov-CovariancematrixignoringNaNs.nanmax-MaximumignoringNaNs.nanmean-MeanignoringNaNs.nanmedian-MedianignoringNaNs.nanmin-MinimumignoringNaNs.nanstd-StandarddeviationignoringNaNs.nansum-SumignoringNaNs.nanvar-VarianceignoringNaNs.partialcorr-Linearorrankpartialcorrelationcoefficient.prctile-Percentiles.quantile-Quantiles.range-Range.skewness-Skewness.std-Standarddeviation(inMATLABtoolbox).tabulate-Frequencytable.trimmean-Trimmedmean.var-Variance STATlSTlCS387A.6.1DistributionsIDistributionICDFIPDF1InveirseCDFIRNG1BetabetacdfbetapdfbeitainvbetarndBinomialbinocdfbinopdfbinoinvbinorndChisquarechi2cdfchi2pdfch.i2invchi2rndExponentialexpcdfexppdfexpinvexprndExtremevalueevcdfevpdfevinvevrndFfcdffpdf:EinvfrndGammagamcdfgampdfgaminvgamrndGeometricgeocdfgeopdfgeoinvgeorndHypergeometrichygecdfhygepdfhygeinvhygerndLognormallogncdflognpdflcgninvlognrndMultivariatenormalmvncdfmvnpdfmvninvmvnrndNegativebinomialnbincdfnbinpdfnbininvnbinrndNormal(Gaussian)normcdfnormpdfncrminvnormrndPoissonpoisscdfpoisspdfpoissinvpoissrndRayleighray1cdfraylpdfraylinvraylrndttcdftpdfitinvtrndDiscreteuniformunidcdfunidpdfunidinvunidrndUniformdistributionunifcdfunifpdfunifinvunifrndWeibullwblcdfwblpdfwblinvwblrndA.6.2DistributionFittingbetafit-Betaparameterestimation.binofit-Binomialparameterestimation.evfit-Extremevalueparameterestimation.expfit-Exponentialparameterestimation.gamfit-Gammaparameterestimation.gevfit-Generalizedextremevalueparameterestimation.gpfit-GeneralizedParetoparameterestimation.lognfit-Lognormalparameterestimation.mle-Maximumlikelihoodestimation(IrILE).mlecov-AsymptoticcovariancematrixofMLE.lognfit-Negativebinomialparameterestimation.normfit-Normalparameterestimation.poissfit-Poissonparameterestimation.raylfit-Rayleighparameterestimation.unifit-Uniformparameterestimation.wblfit-Weibullparameterestimation.Inadditiontothecommandlinefunctionlistedabove,thereisalsoaGUItousedfordistributionfitting.Youcanusethecommanddfittooltoinvokethistool(Fig.A.16). 388AppendixA:MATLAB>>dfittoolFig.A.16GUIfordfittool.A.6.3NonparametricProcedureskstest-Kolmogorov-Smirnovtwo-sampletest.kstest2-Kolmogorov-Smirnovoneortwo-sampletestmtest-CramerVonMisestestfornormalitydagosptest-D’Agostino-Pearson’stestfornormalityruns-test-Runstestsign-test1-Two-samplesigntest.kruskal-wallis-Kruskal-Wallisranktest.friedman-Friedmanrandomizedblockdesigntestkendall-ComputesKendall’staucorrelationstatisticspear-Spearmancorrelationcoefficient.WmW-Wilcoxon-Mann-Whitneytwo-sampletest.tablerxc-testofindependencefor$r$x$c$table.mantel-haenszel-Mantel-Haenszelstatisticfor$2$x$2$tables.ThelistednonparametricfunctionsthatarenotdistributedwithMATLABoritsStatisticsToolboxcanbedownloadedfromthebookhomepage. STATISTICS389A.6.4RegressionModelsA.6.4.1OrdinaryLeastSquares(OLS)Themoststraightforwardwayofim-plementingOLSisbasedonnormalequations.>>x=rand(20,l);>>y=2+3*x+randn(size(x));>>X=[ones(length(x),1),XI;>>b=inv(X’*X)*X’*y%normalequationb=1.87783.4689Abettersolutionusesbackslashbecauseitismorenumericallystablethaninv.b2=1.87783.4689Thepseudoinversefunctionpinvisalsoanoption.Ittooisnumericallystable,butitwillyieldsubtlydifferentresultswhenyourmatrixissingularornearlyso.Ispinvbetter?Thereareargumentsforbothbackslashandpinv.Thedifferencereallyliesinwhathappensonsingularornearlysingularmatrixes.pinvwillnotworkonsparseproblems,andbecausepinvreliesonthesingularvaluedecomposition,itmaybeslowerforlargeproblems.>>b3=pinvo()*yb3=1.87783.4689Large-scaleproblemswhereXissparsemaysometimesbenefitfromasparseiterativesolution.lsqrisaniterativesolver>>b4=lsqr(X,y,l.e-13,10)lsqrconvergedatiteration2toasolutionwit.hrelativeresidual0.33 390AppendixA:MATLABb4=I.87783.4689Thereisanotheroption,Iscov.lscovisdesignedtohandleproblemswherethedatacovariancematrixisknown.Itcanalsosolveaweightedregressionproblem.b5=1.87783.4689DirectlyrelatedtothebackslashsolutionisonebasedontheQRfactor-ization.Ifourover-determinedsystemofequationstosolveisXb=y,thenaquicklookatthenormalequations,b=(X’X)-lX’ycombinedwiththeqrfactorizationofX,X=QRyieldsb=(R’Q’QR)-lR’Q‘y.Ofcourse,weknowthatQisanorthogonalmatrix,soQ’Qisanidentitymatrix.b=(R’R)-‘R’Q’yIfRisnon-singular,then(R’R)-‘=R-’R‘-’?sowecanfurtherreducetob=R-lQ‘yThissolutionisalsousefulforcomputingconfidenceintervalsontheparam-eters.b6=1.87783.4689A.6.4.2WeightedLeastSquares(WLS)WeightedLeastSquares(WLS)isspecialcaseofGeneralizedLeastSquares(GLS).Itshouldbeappliedwhen STATlSTlCS391thereisheteroscedasticityintheregression.i.e.thevarianceoftheerrortermisnotaconstantacrossobservations.Theoptimalweightsshouldbeinverselyproportionaltotheerrorvariances.>>x=(1:lO)’;>>wgts=l./rand(size(x));>>y=2+3*x+wgts.*randn(size(x));>>X=[ones(length(x),1),XI;>>b7=lscov(M,y,wgts)b7=-89.686727.9335AnotheralternativewayofdoingWLSistotransformtheindependentanddependentvariablessothatweapplyOLStothetransformeddata.coef8=-89.686727.9335A.6.4.3IterativeReweightedLeastSquares(IRLS)IRLScanbeusedformul-tiplepurposes.Oneistogetrobustestimatesbyr’educingtheeffectofout-liers.Anotheristofitageneralizedlinearmodel,asdescribedinSectionA.6.6.MATLABprovidesafunctionrobustfitwhichperformsiterativereweightedleastsquaresestimationwhichyieldrobustcoefficientestimates.brob=10.5208-2.0902A.6.4.4NonlinearLeastSquaresMATLABprovidesafunctionnlinfitwhichperformsnonlinearleastsquaresestimation.>>mymodel=@(beta,x)(beta(l)*x(:,2)-x(:,3)/beta(5))./...(l+beta(2)*x(:,l)+beta(3)*~(:,2)+beta(4)*~(:,3));>>loadreaction; 392AppendixA:MATLAB>>beta=nlinfit(reactants,rate,mymodel,ones(5,1))beta=1.25260.06280.04000.11241.1914A.6.4.5OtherRegressionFunctionscoxphfit-Coxproportionalhazardsregression.nlintool-Graphicaltoolforpredictioninnonlinearmodels.nlpredci-Confidenceintervalsforpredictioninnonlinearmodelsnlparci-Confidenceintervalsforparametersinnonlinearmodelspolyconf-Polynomialevaluationandwithconfidenceintervals.polyfit-Least-squarespolynomialfitting.polyval-Predictedvaluesforpolynomialfunctions.rcoplot-Residualscaseorderplot.regress-Multivariatelinearregression,alsoreturntheR-squarestatistic,theFstatisticandpvalueforthefullmodel,andanestimateoftheerrorvariance.regstats-Regressiondiagnosticsforlinearregression.ridge-Ridgeregression.rstool-Multidimensionalresponsesurfacevisualization(RSM).stepwise-Interactivetoolforstepwiseregression.stepwisefit-Non-interactivestepwiseregression.A.6.5ANOVAThefollowingfunctionsetcanbeusedtoperformANOVAinaparametricornonparametricfashion.anoval-One-wayanalysisofvariance.anova2-Two-wayanalysisofvariance.anovan-n-wayanalysisofvariance.aoctool-Interactivetoolforanalysisofcovariance.friedman-Friedman’stest(nonparametrictwo-wayanova).kruskalwallis-Kruskal-Wallistest(nonparametricone-wayanova)A.6.6GeneralizedLinearModelsMATLABprovidestheglmfitandglmvalfunctionstofitgeneralizedlinearmodels.ThesemodelsincludePoissonregression,gammaregression,andbinaryprobitorlogisticregression.Thefunctionsallowyoutospecifyalinkfunctionthatrelatesthedistributionparameterstothepredictors.Itisalso possibletofitaweightedgeneralizedlinearmodel.Fig.A.17isaresultofthefollowingMATLABcommands:>>x=[21002300250027002900310033003!50037003900410043001’;>>n=[4842313431212323211617211’;>>y=[I203881417191517213’;>>b=glmfit(x,[yn],’binomial’,’link’,’probit’);>>yfit=glmval(b,x,’probit’,’size’,n);>>plot(x,y./n,’o’,x,yfit./n,’-’I‘IOgl08-II07.0.6-05-04--I03L02-101-Fig.A.17Probitregressionexample.A.6.7HypothesisTestingMATLABalsoprovideasetoffunctionstoperformsomeimportantstatisticaltests.Thesetestsincludetestsonlocationordispersion.Forexample,ttestandttest2canbeusedtodoattest.HypothesisTests.ansaribradley-Ansari-Bradleytwo-sampletestforequaldispersions.dwtest-Durbin-Watsontestforautocorrelationinregression.ranksum-Wilcoxonranksumtest(independentsamples).runstest-Runstestforrandomness.signrank-Wilcoxonsignranktest(pairedsamples).signtest-Signtest(pairedsamples). 394AppendixA:MATLABztest-Ztest.ttest-Onesamplettest.ttest2-Twosamplettest.vartest-One-sampletestofvariance.vartest2-Two-sampleFtestforequalvariances.vartestn-Testforequalvariancesacrossmultiplegroups.Distributiontests,sometimescalledgoodnessoffittests,arealsoincluded.Forexample,kstestandkstest2arefunctionstoperformaKolmogorov-Smirnovtest.DistributionTesting.chi2gof-Chi-squaregoodness-of-fittest.jbtest-Jarque-Beratestofnormality.kstest-Kolmogorov-Smirnovtestforonesample.kstest2-Kolmogorov-Smirnovtestfortwosamples.lillietest-Lillieforstestofnormality.A.6.8StatisticalLearningThefollowingfunctionprovidetoolstodevelopdatamining/machinelearningprograms.FactorModelsfactoran-Factoranalysis.pcacov-Principalcomponentsfromcovariancematrix.pcares-Residualsfromprincipalcomponents.princomp-Principalcomponentsanalysisfromrawdata.rotatefactors-RotationofFAorPCAloadings.DecisionTreeTechniques.treedisp-Displaydecisiontree.treefit-Fitdatausingaclassificationorregressiontree.treeprune-Prunedecisiontreeorcreateoptimalpruningsequence.treetest-Estimateerrorfordecisiontree.treeval-Computefittedvaluesusingdecisiontree.DiscriminationModelsclassify-Discriminantanalysiswith'linear','quadratic','diagLinear','diagquadratic',or'mahalanobis'discriminantfunctionA.6.9BootstrappingInMATLAB,bootandbootciareusedtoobtainboostrapestimates.Theformerisusedtodrawbootstrappedsamplesfromdataandcomputethebootstrappedstatisticsbasedonthesesamples.Thelattercomputestheimprovedbootstrapconfidenceintervals,includingtheBCainterval. >>loadlawdatagpalsat>>se=std(bootstrp(lOOO,Qcorr,gpa,lsat))>>bca=bootci(lOOO,(Qcorr,gpa,lsat))se=0.1322bca=0.30420.9407 ThisPageIntentionallyLeftBlank AppendixB:WinBUGSBeware:MCMCsamplingcanbedangerous!(DisclaimerfromWinBUGSUserMan-ual)BUGSisfreelyavailablesoftwareforconstructingBayesianstatisticalmodelsandevaluatingthemusingMCWlCmethodology.BUGSandWINBUGSaredistributedfreelyandaretheresultofmanyyearsofdevelopmentbyateamofstatisticiansandprogrammersattheMed-icalresearchCouncilBiostatisticsResearchUnitinCambridge(BUGSandWinBUGS),andfromrecentlybyateamatUniversityofHelsinki(Open-BUGS)seetheprojectpages:http://www.mrc-bsii.cam.ac.uk/bugs/andhttp://mathstat.helsinki.fi/openbugs/.Modelsarerepresentedbyaflexiblelanguage,andthereisalsoagraphicalfeature,DOODLEBUGS,thatallowsuserstospecifytheirmodelsasdirectedgraphs.ForcomplexmodelstheDOODLEBUGScanbeveryuseful.AsofMay2007,thelatestversionofWinBUGSis1.4.1andOpenBUGS3.0.397 398AppendixB:WinBUGS6.1USINGWINBUGSWestarttheintroductiontoWinBUGSwithasimpleregressionexample.Considerthemodelyilpi,T-N(p2,T):i=1,...,npi=Q+p(~i-2))~(0,10-~)p~(0,10-~)TN~a7TL7TLU(0.001,0.001).ThescaleinnormaldistributionshereisparameterizedintermsofaprecisionparameterTwhichisthereciprocalofvariance,T=l/a2.Naturaldistribu-tionsfortheprecisionparametersaregammaandsmallvaluesoftheprecisionreflecttheflatness(noninformativeness)ofthepriors.TheparametersQandparelesscorrelatedifpredictorszi-3areusedinsteadofxi.Assumethat(z,y)-pairs(1,l),(2,3),(3,3),(4,3),and(5,5)areobserved.Estimatorsinclassical,LeastSquareregressionofyonz-3,aregiveninthefollowingtable.CoefLSEstimateSECoeftPALPHA3.00000.32669.190.003BETA0.80000.23093.460.041S=0.730297R-Sq=80.0%R-Sq(adj)=73.3%HowaboutBayesianestimators?WewillfindtheestimatorsbyMCMCcalculationsasmeansonthesimulatedposteriors.AssumethattheinitialvaluesofparametersareQO=0.1,=0.6,andr=1.StartBUGSandinputthefollowingcodein[File>New].#Asimpleregressionmodel(for(iin1:N){~[i],.dnorm(mu[il,tau);mu[i]<-alpha+beta*(x[il-x.bar);3x.bar<-mean(x[]);alphadnorm(0,0.0001);betadnorm(0,0.000~);tau-dgamma(0.001,0.001);sigma<-l.O/sqrt(tau);3#-----------------------------#theseareobservationslist(x=c(1,2,3,4,5),Y=c(1,3,3,3,5),N=5);#-----------------------------#theinitialvalues USINGWINBUGS399list(a1pha=0.1,beta=0.6,tau=1);Next,putthecursoratanarbitrarypositionwithinthescopeofmodelwhichdelimitedbywigglybrackets.SelecttheModelmenuandopenSpec-ification.TheSpecificationToolwindowwillpop-out.Ifyourmodelishighlighted,youmaycheckmodelinthespecificationtoolwindow.Ifthemodeliscorrect,theresponseonthelowerbaroftheBUGSwindowshouldbe:modelissyntacticallycorrect.Next,highlightthe“list”statementinthedata-partofyourcode.IntheSpecification‘Toolwindowselectloaddata.Ifthedataareincorrectformat,youshouldreceiveresponseonthebottombarofBUGSwindow:dataloaded.Youwillneedtocompileyourmodelonordertoactivateinits-buttons.SelectcompileintheSpecificationToolwindow.Theresponseshouldbe:modelcompiled,andthebuttonsloadinitsandgeninitsbecomeactive.Finally,highlightthe“list”state-mentintheinitials-partofyourcodeandintheSpecificationToolwindowselectloadinits.Theresponseshouldbe:modelisinitialized,andthisfinishesreadinginthemodel.Iftheresponseisinitialvaluesloadedbutthisorotherchaincontainuninitializedvariables.clickonthegeninitsbutton.Theresponseshouldbe:initialvaluesgenerated,modelinitialized.Now,youarereadytoBurn-insomesimulationsandatthesametimecheckthattheprogramisworking.IntheModelmenu,chooseUpdate...andopenUpdateTooltocheckifyourmodelupdates.FromtheInferencemenu,openSamples....AwindowtitledSampleMonitorToolwillpopout.Inthenodesub-windowinputthenamesofthevariablesyouwanttomonitor.Inthiscase,thevariablesarealpha,beta,andtau.Ifyoucorrectlyinputthevariablethesetbuttonbecomesactiveandyoushouldsetthevariable.Dothisforall3variablesofinterest.Infact,sigmaastransformationoftauisavailable,aswell.NowchoosealphafromthesubwindowinSamplleMonitorTool.Allofthebuttons(clear,set,trace,history,density,stats,coda,quantiles,bgrdiag,autocor)arenowactive.ReturntoUpdlateToolandselectthedesirednumberofsimulations,say10000,intheupdatessubwindow.Presstheupdatebutton.ReturntoSampleMonitorToolandchecktraceforthepartofMCtracefora,historyforthecompletetrace,densityforadensityestimatorofa,etc.Forexample,pressingstatsbuttonwillproducesomethinglikethefollowingtableImeansdMCerrorva12.5pcmedianva197.5pcstartsampleIIalpha3.0030.5490.0036141.9773.0044.05710000200011Themean3.003istheBayesestimator(asthemeanfromthesamplefromtheposteriorfora.Therearetwoprecisionoutputs,sdandMCerror.The 400Appendix6:WinBUGSformerisanestimatorofthestandarddeviationoftheposteriorandcanbeimprovedbyincreasingthesamplesizebutnotthenumberofsimulations.Thelateroneistheerrorofsimulationandcanbeimprovedbyadditionalsimulations.The95%crediblesetisboundedbyva12.5pcandva197.5pc,whicharethe0.025and0.975(empirical)quantilesfromtheposterior.Theempiricalmedianoftheposteriorisgivenbymedian.Theoutputsstartandsampleshowthestartingindexforthesimulations(afterburn-in)andtheavailablenumberofsimulations.01I16F""'"'I1.4di0.810.610.4I04,LIOO12"0246810(c)fig.6.18Tracesofthefourparametersfromsimpleexample:(a)a,(b)p,(c)T.and(d)0fromWinBUGS.DataareplottedinMATLABafterbeingexportedfromWinBUGS.Forallparametersacomparativetableis BUILT-INFUNCTIONS401ImeansdMCerrorva12.5pcmedianva197.5pcstartsampleIalpha3.0030.5490.0036141.9773.0044.0571000020001beta0.79940.37680.0028970.070880.79881.5341000020001tau1.8751.5210.015740.13991.4715.8511000020001sigma1.0060.71530.0097420.41340.82442.6741000020001IfyouwanttosavethetraceforcyinafileandprocessitinMATLAB,say,selectcodaandthedatawindowwillopenwithaninformationwindowaswell.KeepthedatawindowactiveandselectSaveAsfromtheFilemenu.Savetheasinalphas.txtwhereitwillbereadytobeimportedtoMATLAB.KevinMurphyleadtheprojectforcommunicationbetweenWinBUGSandMATLAB:HissuiteMATBUGS,maintainedbyseveralresearchers,communicateswithWinBUGSdirectlyfromMATLAB.B.2BUILT-INFUNCTIONSANDCOMMONDISTRIBUTIONSINBUGSThissectioncontainstwotables:onewiththelistofbuilt-infunctionsandthesecondwiththelistofavailabledistributions.Thefirst-timeWinBUGSusermaybedisappointedbytheselectionofbuiltinfunctions-thesetisminimalbutsufficient.Thefulllistofdis-tributionsinWinBUGScanbefoundinHelp>WinBUGSUserManualunderThe_BUGS_language:_stochastic_nodes>Distributions.BUGSalsoallowsforconstructionofdistributionsforwhicharenotindefaultlist.InTableB.23alistofimportantcontinuousanddiscretedistributions,withtheirBUGSsyntaxandparametrization,isprovided.BUGShasthecapa-bilitytodefinecustomdistributions,bothaslikelihoodorasaprior,viathesocalledzero-Poissondevice. 402AppendixB:WinBUGSTable5.22Built-inFunctionsinWinBUGS1BUGSCodeIfunctionIabs(y)IYIcloglog(y)In(-ln(1-y))cos(y)COS(Y)equals(y,z)1ify=z;0otherwiseexp(y)exP(Y)inprod(y,z)CZYiZZinverse(y)y-'forsymmetricpositive-definitematrixylog(y)WY)logfact(y)14Y!)loggam(y)W(Y))logit(y)WY/(1-Y))max(y,z)yify>z;yotherwisemean(y)72-1ciyz,72=dim(y)min(y,z)yify1ify20;0otherwisesum(y)CZYZtrunc(y)greatestintegerlessthanorequaltoy IDensityparamet,rizations.theirandnamesBUGSIwithk)k)T[,1)N)k)T[,1,b)tau)b)distributionsn)tau,dwish(R[,l,(p[])Codedmulti(pC1,ddirch(alphaC1)dmnorm(muC1,dmt(mu[l,-Built-indbern(p)dbin(p,dcatdpois(1ambda)dbeta(a,b)dchisqrck)ddexpcmu,tau)dexp(lambda1dflatodgamma(a,dnorm(mu,dpar)alpha,c)dt(mu,-dunif(a,dweibcv,lambda)-------N---NNN--NNBUGSxxxxxxxxxxxxxxxx[]p[]x[1x[lx[,l€3.23ITableNorrrialExponc:mtialDistxibiitionBcrrioulliBinomialCategoricalPoissonBetaChi-squareDoubleExponentialFlatGamrnaNormalParetoStudent-tUniforrmWeihullMultinomialDirichletMultivariateMultivariateStudent-t,WishartI ThisPageIntentionallyLeftBlank MATLAIBIndexCDFeta,352binup,39KMcdfSM,293biplot,386LMSreg,226bootci,394WavMat.m,270bootsample,290,304andrewsplot,386boxplot,381,386anoval,392cdfplot,386anova2,392chi2cdf,19,157--159,177,387anovan,392chiagof,394ansaribradley,394chiainv,19,387aoctool,392chi2pdf,19,387bar,386chi2rnd,387bessel,11,92ciboot,291,293beta,10classify,394betacdf,21,387clear,37'6betafit,387contour,386betainc,10,75,78conv,272betainv,21,77,387corcoeff,386betapdf,21,70,387corr,298,386betarnd,387corrcoef,290bincdf,38,177cov,386binlow,39coxphfit,197binocdf,14,167,387crosstab,386binofit,387csapi,25:3binoinv,387csvread,378binopdf,14,40,387dagosptest,388binoplot,14dfittool,387binornd,387diff,298 406MATLABINDEXdwtest,394harmean,386dwtr,271hist,206,298,383ecdf,386hist3,386ecdfhist,386histfit,207elm,199hygecdf,16,387evcdf,387hygeinv,16,387evfit,387hygepdf,16,387evinv,387hygernd,16,387evpdf,387idwtr,272evrnd,387inv,388expcdf,18,387iqr,386expfit,387jackrsp,295expinv,18,387jbtest,394exppdf,18,387kdfft2,213exprnd,18,387kendall,388factoran,394kmcdfsm,188factorial,9kruskal-wallis,143,388fcdf,23,387kruskalwallis,392finv,23,387ksdensity,211,383fliplr,272kstest,88,388,394floor,10kstest2,88,388,394fnplt,253kurtosis,386forruns,105lillietest,394fpdf,23,387lmsreg,224fplot,329,386load,376friedman,147,388,392loc-lin,247friedman-pairwise-comparison,147loess,249frnd,387loess2,249fsurfht,386logist,328,329gamcdf,18,387logistic,328gamfit,387logncdf,387gaminv,18,387lognfit,387gamma,10logninv,387gammainc,10lognpdf,387gampdf,18,387lognrnd,387gamrnd,387lpfit,247geocdf,16,387Iscov,224,226,388geoinv,16,387lsline,386geomean,386lsqr,388geopdf,16,387Its,226geornd,16,387mad,386gevfit,387mantelhaenszel,170,388gline,386mean,386glmfit,236,392median,386glmval,236,392medianregress,226glyphplot,386mixture-cla,313gplotmatrix,381mle,387grpstats,386mlecov,387gscatter,344moment,386 MATLABINDEX407mtest,92,97,388probplot,97,386mvncdf,387probup,104mvninv,387qqgamma,100mvnpdf,387qqnorm,100mvnrnd,383,387qqplot,99,111,386nada-wat,245qqweib,98,100nancov,386quantile,386nanmax,386rand,331nanmean,386randirichlet,352nanmedian,386randg,371.nanstd,386randn,35,380nansum,386range,386nanvar,386rank,118nbincdf,15,387ranksum,5#94nbininv,15,387raylcdf,387nbinpdf,15,387raylfit,387nbinrnd,15,387raylinv,387nchoosek,9raylpdf,387nearneighbor,331raylrnd,387nlinfit,392rcoplot,392nlintool,392refcurve,386nlparci,392refline,386nlpredci,392regress,392normcdf,19,35,387regstats,392normfit,387ridge,392norminv,19,110,387robustfit,392normpdf,19,387rotatefactors,394normplot,386round,331normrnd,387rstool,392parallelcoords,386runs-test,104,388partialcorr,386runstest,394pcacov,394sign-testl,121,388pcares,394signrank,394pinv,388signtest,394plot,35,329,341size,344plotedf,35skewness,386plotmatrix,381softmax,337pluginmu,195sort,298poisscdf,15,387spear,124,388poissfit,387spline,252poissinv,15,387squaredrankstest,134poisspdf,15,387stairs,386poissrnd,15,30,387std,386polyconf,392stepwise,392polyfit,392stepwisefit,392polyval,392surfht,386prctile,386survband,193princomp,394tablerxc,1152,388problow,104tabulate,386 408MATLABINDEXtcdf,20,387unifinv,387textread,378unifit,387textscan,378unifpdf,387tinv,20,387unifrnd,387tnormpdf,375var,386tpdf,20,387vartest,394treedisp,343vartest2,394treefit,343,344,394vartestn,394treeprune,343,394wblcdf,387treetest,344wblfit,387treeval,344,394wblinv,387trimmean,291,293,386wblpdf,387tripdf,375wblrnd,387trnd,387wbplt,386ttest,394whos,376ttest2,394wilcoxon-signed,128type,378wilcoxon-signed2,127unidcdf,387wmw,132,388unidinv,387xlsread,379unidpdf,387xlswrite,380unidrnd,387ztest,394unifcdf,387 AuthorIndexAnscombe,F.J.,47Buddha,81Bush,C.A,,357Agresti.A.,40.154,327Altman,N.S.,247,258Carter,W.C.,199Anderson,T.W.,90Casella,G.,1,42.62Anscombe,F.J.,47,226Charles,J.A,.299Antoniadis,A.,273,362Chen,M.-H..62Antoniak,C.E.,357Chen,Z.,77Arvin,D.V.,125Chernick,M.R.,302Christensen,R.,356Bai,Z.,77Cleveland,W.,247Baines,L.,205Clopper,C.J.,39Baines,ILI.J..258Clyde,M.,356Balmukand,B.,308Cochran,W.G.,167Bayes,T.,47,48Congdon,P..62Bellman,R.E.,331Conover,W.J.,2,134,148Benford,F.,158Cox,D.R.,196Berger,J.O.,58Cram&,H..'91Berry,D.A.,356Crowder,M.J.,188Best,N.G.,62Crowley,J.,308Bickel,P.J.;174Cummings,7'.L.,177Bigot,J.,273,362Birnbaum,Z.W.,83D'Agostino,12.B.,96Bradley,J.V..2Darling,D.A,.,90Breiman,L.,342Darwin,C.;154Broffitt,J.D.,327Daubechies,I.,266Brown,J.S.,183Davenport,J.M.,146 410AUTHORINDEXDavid,H.A..69Haar,A.,266Davies.L.,212Haenszel,W.,168Davis.T.A..6Hall,W.J.,193Davison.A.C.,302Hammel,E.A.,174deHoog,F.R..256Hart,P.E.,324Delampady,M.,58Hastie,T.,324,336Deming,W.E.,323HealyM.J.R.,307Dempster,A.P.,307Hedges,L.V.,107Donoho,D.,273,276Hendy,M.F.,299Doob,J.,12Hettmansperger,T.P.,1Doucet.H.,177Hill,T.!158Duda,R.O.,324Hinkley,D.V.,302Dunn,O.,327Hoeffding,W.,1Dykstra,R.L.,227Hogg,R.V.,327Hotelling,H.,115Ebert,R.,174Hubble,E.P.,289Efromovich,S.,211Huber,P.J.,222,223Efron,B.,286,292Hume,B.,153Elsner,J.B.,340Hutchinson,M.F.,256Epanechnickov,V.A.,210Escobar,hl.D.,357Ibrahim,J.,62Excoffier,L.,308Iman,R.L.,134,146Fabius,J.,350Johnson,R.:166Fahrmeir,L.,236Johnson,S.,166Falconer,S.,121Johnstone,I.,273,276Feller,W.,12Ferguson,T.S.,350Kohler,W.,257Finey,D.J..65Kahm,M.J.,177Fisher,R.A.,6,41,107,154,161.Kahneman.D.,5163,308,329Kaplan,E.L.,188,294Folks,L.J.,107Kaufman,L.,137Fourier,J.,266Kendall,M.G.,125Freedman,D.A,,350Kiefer.J.,184Friedman,J.,324,336,337,342Kimber.A.C.,188Friedman,Wl.,145Kimberlain,T.B.,340Frieman,S.W.,199Kolmogorov,A.N.,81FullerJr..E.R..199Krishnan,T.,308Kruskal.J..337Gasser,T.257Kruskal.W.H..115,142Gather,U.,212Kutner,h1.A.328Gelfand,A.E.,61Kvam.P.H..219.316George,E.0..108Gilb,T..167Laird.N.M.,307Gilks.W.R.,62Lancaster,H.O.,108Good.I.J..108Laplace.P.S.,9Good,P.I.,302Lawless,J.F.,196Gosset.W.S.,20,154LawlorE.,318Graham,D.,167Lehmann,E.L..42.131.149Green.P.J.,255Lehmiller,G.S.,340 AUTHORlNDEX411LeroyA.M.,223.224Pearson,E.S.:39,163Lindley,D.V.,65Pearson,K.,6,39,81.154,161,206Liu.J.S.,356Pepys,S.,51Liu.2..169Phillips,D.P.,176Lo,A.Y..357Piotrowski,H.:163Luben,R.N.,136Pitman,E.J.G.,286Playfair,I&’.>206Mdller,H.G.,257Popper,K.,36MacEachern,S.N..357Preece,M.A.,258Madigan,D..65Mahalanobis,P.C..286Quade,D.,147Mallat,S..270Quenouille,M.H.,286,295Mandel,J.,124Quinlan,J.R.,345Mann.H.,115Quinn?G.D.,199Mantel,N.,168Quinn.J.13.;199Marks.S..327Quintana,F.A.,350Martz.H.,59Mattern.R.,249Radelet;ML,172Matui.I.,179Ramberg,.J.S..327McCullagh,P.,231Randles,R..H.,1.327McEarchern,S.hl.,356Rao,C.R.,308McFly,G..205Rasmussen,M.H.,162McKendrick,A.G.,307Raspe.R.E.!287McLachlan.G.J.,308Reilly,M.,318McNemar,Q.,164Reinsch,C!.H..255Meier,P.,188,294Richey,G.G.,124Mencken,H.L.,1Rickey,B.,141Mendel,G.,154Robert,C.,62Michelson,A,,110Robertson,T.,227Miller,L.A.,162Rock,I.,137Molinari,L.,257Roeder,K.,84Moore,D.H.,327Rosenblatt,F..333Mudholkar,G.S.,108RousseeuurP.J.,223,224Mueller,P.,350,356,357Rubin,D.B.!296,307Muenchow.G..188Ruggeri,F.!362Nachtsheim.C.J..328Sager,T.W..77Nadaraya,E.A.,244Samaniego,F.J.,316Nair,V.J.,193Sapatinas.,T..273,362Nelder,J.A.,231Scanlon,F.L.,136Neter,J.,328Scanlon,T.J.,136Schuler,F.>249O’Connell.J.W.,174Schmidt,G.,249Ogden,T.,266Schoenberg,I.J..251Olkin.I.,107Selke,T.,58Olshen,R.,342Sethuraman:J.:352Owen,A.B..199Shah,M.K.,177Shakespeare,W.,285Pabst.M..115Shao,Q.-M.:62Pareto.V.,23Shapiro,S.S.,93 412AUTHORINDEXShen,X..266Tversky,A,,5Sigmon,K..6Twain,M.,xiiiSilverman.B.W.,211,255Simonoff,J.S.,154Utts,J.,108Singleton.N.,136Sinha,B.K.,77vanGompel,R.,121Siskel.G.,174Vidakovic,B.,266,273,362Slatkin.M.,308Voltaire,F.M.,6Smirnov.N.V..81.86vonBortkiewicz,L.,157Smith,A.F.M.,61vonMises,R.;91Smith,R.L.,188Spaeth,R.,125Waller,R.,59Spearman,C.E.,122Wallis,W.A,,142Speed,T.,309Walter,G.G.,266Spiegelhalter,D.J.,62Wasserman:L.,2Stephens,M.A,,90,96Watson,G.S.,244Stichler,R.D.,124Wedderburn,R.W.M.,231Stigler,S.M.,188Weierstrass,K.,253Stokes,S.L.,77Wellner,J.,193Stone,C.,342West:M.,357Stork.D.G.,324WestmacottM.H.,307Stuetzle,W.,337Wilcoxon,F.,115,127Sweeting,T.J.,188Wilk,M.B.,93Wilkinson,B.,107Thisted,R.A.,314Wilks,S.S.,43Thomas,A,,62Wilson,E.B.,40Tibshirani,R.J.,292,324,336Wolfowitz,J.,1,184Tingey,F.,83Wright,S.,69Tippet.L.H.C.,107Wright,T.F.,227Tiwari,R.C.,352Wu.C.F.J.,308Tsai,W.Y..308Tutz,G.,236Young,N.,33 SubjectIndexAcceleratedlifetesting,197pointeatimation,52Almost-sureconvergence.28posteriordistribution,49Analysisofvariance,116.141,142priordistribution,48Anderson-Darlingtest,89priorpredictive,49Anscombe’sdatasets.226Bayesiantesting.56Artificialintelligence,323ofprecisehypotheses.58Lindleyparadox,65BAMSwaveletshrinkage.361Benford’slaw,158BandwidthBernoullidistribution,14choiceof.210Besselfunctions.11optimal.210Betadistribution,20BayesBetafunction.10nonparametric,349Beta-binomialdistribution,24Bayesclassifier,325Bias,325Bayesdecisionrule,326Bayesfactor,57Binaryclassiiicationtrees,338Bayesformula,11growing,341Bayesiancomputation,61impurityfunction,339Bayesianstatistics.47crossentropy,339prediction,59Gini,339bootstrap.296Inisclassification,339conjugatepriors,54pruning,342expertopinion.51Binomialdistribution.4,14,32hyperperameter,48confidenceintervals.39hypothesistesting.56normalepproximation.40intervalestimation.55relationtoPoisson,15lossfunctions,53testofhypothesis,37 414SUBJECTINDEXBinomialdistributionsClopper-Pearson,39toleranceintervals,74forquantiles,73Bootstrap,285,325Greenwood’sformula,193Bayesian,296Kaplan-Meierestimator,192biascorrection,292likelihoodratio,43fallibility,302normaldistribution,43nonparametric,287onesided,39percentile,287pointwise,193Bowman-Shentontest,94simultaneousband,193Boxkernelfunction,209twosided,39Brownianbridge,197Wald,40Brownianmotion,197Confirmationbias,5Byzantinecoins,299Conjugatepriors,54Conovertest,133,148Categoricaldata,153assumptions,133contingencytables,159Consistentestimators,29,34goodnessoffit,155Contingencytables,159,177Cauchydistribution,21TXCtables,161Censoring,185,212Fisherexacttest,163typeI,186fixedmarginals,163type11,186McNemartest,165Centrallimittheorem,1,29Convergence,28extended,31almostsure,28multinomialprobabilities,1.70indistribution,28Centralmoment,13inprobability,28Chancevariables,12Convexfunctions,11Characteristicfunctions:13,32Correlation,13ChisquaretestCorrelationcoefficientrulesofthumb,156Kendall’stau,125Chi-squaredistribution,19,32Pearson.116Chi-squaretest,146,155Spearman.116ClassificationCovariance,13binarytrees,338Covariate,195linearmodels,326Cram&-vonMisestest.91.97,112nearestneighbor,329,331Crediblesets,55neuralnetworks,333Crossvalidation,325supervised,324binaryclassificationtrees,343unsupervised.324testsample,325,330ClassificationandRegressionTrees(CART),trainingsample.325,330338Curseofdimensionality,331Cochran’stest,167Curvefitting,242Combinations,9Czsareanbirthstudy,236Compliancemonitoring,74Concavefunctions,11D’Agostino-Pearsontest,94Concomitant,186,191DataConditionalexpectation,14Blissbeetledata,239Conditionalprobability,11CaliforniaConfidenceintervals,39wellwaterlevel,278binomialproportion,39,40Fisher‘sirisdata,329 SUBJECTINDEX415horse-kickfatalities,157normal,18Hubble’sdata,297Pareto,23interval,4,153Student’st,20Mendel’sdata,156uniform,20,32motorcycledata,249Weibull,59,60nominal,4,153discrete,14ordinal.4,153Bernoulli,14Datamining.323beta-binomial,24Deltamethod,29binomial,4,14Densityestimation,184,205Diracmass,59bandwidth,207geometric,16bivariate,213hypergeometric,16kernel,207multinomial,16,160,185,232adaptivekernels,210negativebinomial,15box,209Poisson,15,32Epanechnickov.209truncatedPoisson,320normal,209uniform,304triangular,209empirical,34smoothingfunction,208convergence,36Designedexperiments,141exponentialfamily,25Detrendingdata,250mixture,23Deviance,234EMalgorithmestimation.311Dirichletdistribution,22,350normal.32Dirichletprocess.350.351.354,356uniform,70conjugacy,353Dolphinsmixture,357Icelandic,162mixtureof,357Doubleexponentialdistribution,21.noninformativeprior,353361Discretedistributionsbeta-binomial,53EfficiencyDiscriminantanalysis,323,324asymptoticrelative,3,44,148Discriminationfunctionhypothesistesting,44linear.326nonparametricmethods.3quadratic,326EMAlgorithm,307Distributions,12definition.308continuous,17Empiricaldensityfunction,184,205beta.20Empiricaldistributionfunction.34.183Cauchy,21converg;ence,36chi-square,19.32Empiricallikelihood,43,198Dirichlet,22,297.350Empiricalprocess,197doubleexponential,21.361Epanechikovkernel,244exponential,17.32Epanechnickovkernelfunction,209F,23Estimation,33gamma,18consistent,34Gumbel.76,113unbiased,34inversegamma,22Expectation.12Laplace.21Expectedvalue,12Lorentz,21Expertopinion,51negative-Weibull.76Exponentialdistribution,17,32 416SUBJECTlNDEXExponentialfamilyofdistributions,25Bowman-Shentontest,94Extremevaluetheory,75chi-square,155choosingatest,94Fdistribution,23Cram&-vonMisestest,91,97;Failurerate,17,27,195112Fisherexacttest,163D’Agostino-Pearsontest,94Formulasdiscretedata,155counting,10Lillieforstest,94geometricseries,10Shapiro-Wilkstest,93Newton’s,11twosampletest,86Sterling’s,10Greenwood’sformula,193Taylorseries,11Gumbeldistribution,76,113Foxnews,153Heisenberg’sprinciple,264Friedmanpairwisecomparisons,147Histogram,206Friedmantest,116Functionsbins,206Hogmanay,120Bessel,11Hubbletelescope,288beta,10Huberestimate,222characteristic,13,32Hypergeometricdistribution,16Poissondistribution,31Hypothesistesting,36convexandconcave,11p-values,37empiricaldistribution,34Bayesian,56gamma,10binomialproportion,37incompletebeta,10efficiency,44incompletegamma,10forvariances;148momentgenerating,13nullversusalternative,36Taylorseries,32significanclevel,36typeIerror,36Gammadistribution,18typeI1error,37Gammafunction,10unbiased,37Gasser-Mullerestimator,245Waldtest,37Generaltreeclassifiers,345AID,345Incompletebetafunction,10CART,345Incompletegammafunction,10CLS,345Independence,11,12hybrids,345Indicatorfunction,34oc1,345InequalitiesSE-trees,345Cauchy-Schwartz,13,26Generalizedlinearmodels;230Chebyshev,26algorithm,232Jensen,26linkfunctions,233Markov,26Geneticsstochastic,26Mendel’sfindings,154Inter-arrivaltimes,176Geometricdistribution,16Interpolatingsplines,252maximumlikelihoodestimator,42Intervalscaledata,4,153Geometricseries,10Inversegammadistribution.22Glivenko-Cantellitheorem,36,197Isotonicregression,227Goodnessoffit,81,156Anderson-Darlingtest,89Jackknife,295,325 SUBJECTlNDEX417Jointdistributions,12Machinelearning,323Mann-Whitneytest,116,131,141k-out-of-nsystem,78equivalencetoWilcoxonsumrankKaplan-Meierestimator,185,188test.132confidenceinterval.192relationtoROCcurve,203Kendall’stau,125Mantel-Haenszeltest,167KernelMarkovchainMonteCarlo(MCMC),betafamily,24461Epanechikov,244MATLABKernelestimators.243ANOVA,392Kolmogorovstatistic,82,109datavisualization,380quantiles,84exportingdata,375Kolmogorov-Smirnovtest,82-84,90functions,374Kruskal-Wallistest,141,143.149,150implementation,5pairwisecomparisons,144importingdata,375matrixoperations,372L2convergence.28nonparametricfunctions.388Laplacedistribution.21regression,389Lawoftotalprobability.11statisticsfunctions,386Lawsoflargenumbers(LLN).29windows,369Leastabsoluteresidualsregression,222Maximumlikelihoodestimation,41Leastmediansquaresregression,224Cramer-Raolowerbound,42Leastsquaresregression,218deltamethod,42Leasttrimmedsquaresregression,223geometricdistribution,42Lennaimage,281invarianceproperty,42Likelihood.41logisticregression;328empirical.43negativebinomialdistribution,42maximumlikelihoodestimation,nonparametric,184,185,19141regularityconditions,42Likelihoodratio.43McNemartest,165confidenceintervals,43Meansquareconvergence,28nonparametric,198Meansquarederror,34,36Lillieforstest,94Median,13Linearclassification.326onesampletest,118Lineardiscriminationfunction,326twosampletest,119Linearrankstatistics,131hlemorylessproperty,16,18U-statistics,131Metaanalysis,106,157,169Links,233averagingp-values,108complementarylog-log.234Fisher’sinversex2method,107logit,234Tippet-Wilkinsonmethod,107probit,234Misclassificationerror,328Localpolynomialestimator,246Momentgeneratingfunctions,13LOESS,247Multinomialdistribution,16,185Logisticregression,327centrallimittheorem,170missclassificationerror,328MultiplecomparisonsLossfunctionsFriedmantest:147crossentropy,325Kruskal-Wallistest,144inneuralnetworks,335testofvariances,149zero-one,325,327Multivariatedistributions 418SUBJECTINDEXDirichlet,22jointdistribution,70multinomial,16maximum,70minimum,70,191Nadaraya-Watsonestimator,244Ordinalscaledata,4,153Naturalselection,154Over-dispersion,24,314NearestneighborOverconfidencebias,5classification,329constructing,331Parallelsystem.70Negativebinomialdistribution,15Parametricassumptions.115maximumlikelihoodestimator,42analysisofvariance,142NegativeWeibulldistribution,76criticisms,3Neuralnetworks,323,333testsfor,81activationfunction,334,336Paretodistribution,23back-propagation,334,336Patternrecognition,323feed-forward,333Percentileshiddenlayers,334sample,72implementing,336Perceptron,333layers,333Permutationtests,298MATLABtoolbox,336Permutations,9perceptron,333Plug-inprinciple,193trainingdata,335Poissondistribution,15,32two-layer,334insigntest,120Newton’sformula,11relationtobinomial,15Nominalscaledata,4,153Poissonprocess,176NonparametricPooladjacentviolatorsalgorithm(PAVA),definition,1230densityestimation,205Posterior,49estimation,183odds.57NonparametricBayes,349Posteriorpredictivedistribution,49NonparametricMaximumlikelihoodes-Power.37.38timation,184,185,191Precisionparameter,64Nonparametricmetaanalysis,106Prior.49Normalapproximationnoninformative,353centrallimittheorem,19odds,57forbinomial,40Priorpredictivedistribution,49Normaldistribution,18Probabilityconfidenceintervals,43Bayesformula,11conjugacy,49conditional,11kernelfunction,209continuitytheorem,31mixture,32convergenceNormalprobabilityplot,97almostsure,28centrallimittheorem,1,29Orderstatistics,69,115deltamethod,29asymptoticdistributions,75extendedcentrallimittheorem,densityfunction,7031distributionfunction,70Glivenko-Cantellitheorem.36.EMAlgorithm,315197extremevaluetheory,75inIL2,28independent,76indistribution,28 SUBJECTlNDEX419inMeansquare,28Receiveroperatingcharacteristic,202inprobability,28RegressionLawsofLargeNumbers,29changepoint.66Lindberg’scondition,31generalizedlinear,230Slutsky’stheorem,29isotonic,227densityfunction,12leastabsoluteresiduals,222independence,11leastmediansquares,224jointdistributions,12least:squares,218lawoftotalprobability,11leasttrimmedsquares:223massfunction,12logistic.327Probabilitydensityfunction,12robust,221Probabilityplotting,97Sen-T’heilestimator,221normal.97weightedleastsquares,223twosamples,98Reinschalgorithm,255Productlimitestimator,188Relativeriisk,162Projectionpursuit,337Resampling,286Proportionalhazardsmodel,196Robust,44,141Robustregression,221Quadetest,147breakdownpoint,222Quadraticdiscriminationfunction:326leveragepoints,224Quantile-quantileplots,98ROCcurve,202Quantiles,13areundercurve,203estimation.194Runstest,100,111sample,72normalapproximation,103RacialbigotrySamplerange,69byscientists,155distribution,72Randomvariables,12toleranceintervals.74characteristicfunction,13Semi-paranietricstatisticsconditionalexpectation,14Coxmodel,196continuous,12inference,195correlation.13Sen-Theilestimator,221covariance,13Seriessystem,70,191discrete,12Shapiro-Wilkstest.93expectedvalue,12coefficients,94independent,12quantiles,94median,13Shrinkage,53momentgeneratingfunction,13Clopper-PearsonInterval.40quantile.13Signtest,116,118variance.13assumptions,118Randomizedblockdesign,116.145pairedsamples,119Range,69tiesindata,122Rankcorrelations,115Signalprocessing,323Ranktests,115,142Significancelevel,36Rankedsetsampling,76Simpson’sparadox,172Ranks,116,141Slutsky’stheorem.29incorrelation,122Smirnovtet8t,86.88,110linearrankstatistics.118quantiles.88properties,117Smoothingsplines,254 420SUBJECTlNDEXSpearmancorrelationcoefficient,122Trimmedmean,291assumptions,124TypeIerror,36hypothesistesting,124TypeI1error,37tiesindata,124SplinesUnbiasedestimators,34interpolating,252Unbiasedtests,37knots,252Uncertaintynatural,252overconfidencebias,5Reinschalgorithm,255Voltaire’sperspective,6smoothing,254Uniformdistribution,20,32,70,78Statisticallearning,323Universalthreshold,276lossfunctions,325Unsupervisedlearning,324crossentropy,325zero-one,325Variance,13,19,325Sterling‘sformula,10ksampletest,148Stochasticorderingtwosampletest,133failurerate,27,32likelihoodratio,27,32Waldtest,38ordinary,26Wavelets,263uniform,27,32cascadealgorithm,271Stochasticprocess,197Coifletfamily,273Student’st-distribution,20Daubechiesfamily,264,273Supervisedlearning,324filters,264Survivalanalysis,196Haarbasis,266Survivorfunction,12Symmletfamily,273thresholding,264t-distribution,20hard,275,278t-testsoft,275onesample,116Weakconvergence,28paireddata,116Weightedleastsquaresregression,223Taylorseries,11,32Wilcoxonsignedranktest,116,126Tiesindataassumptions,127signtest,122normalapproximation,127Spearmancorrelationcoefficient,quantiles,128124Wilcoxonsumranktest,129Wilcoxonsumranktest,131equivalencetoMann-Whitneytest,Toleranceintervals,73132normalapproximation,74assumptions,129samplerange,74comparisontot-test,137samplesize,75tiesindata,131Traingularkernelfunction,209Wilcoxontest.116Transformationlog-log,327ZeroinflatedPoisson(ZIP),313logistic,327probit,327 WILEYSERIESINPROBABILITYANDSTATISTICSESTABLISHEDBYWALTERA.SHEWHARTANDSAMUELS.WILKSEditors:DavidJ.Balding,NoelA.C.Cressie,NicholasI.Fisher,IainM.Johnstone,J.B.Kadane,GeertMolenberghs,DavidW.Scott,AdrianF.M.Smith,SanfordWeisbergEditorsEmeriti:VicBarnett,J.StuartHunter,DavidG.Kendall,JozefL.TeugelsTheWileySeriesinProbabilityandStatisticsiswellestablishedandauthoritative.Itcoversmanytopicsofcurrentresearchinterestinbothpureandappliedstatisticsandprobabilitytheory.Writtenbyleadingstatisticiansandinstitutions,thetitlesspanbothstate-of-the-artdevelopmentsinthefieldandclassicalmethods.Reflectingthewiderangeofcurrentresearchinstatistics,theseriesencompassesapplied,methodologicalandtheoreticalstatistics,rangingfromapplicatiionsandnewtechniquesmadepossiblebyadvancesincomputerizedpracticetorigoroustreatmentoftheoreticalapproaches.Thisseriesprovidesessentialandinvaluablereadingforallstatisticians,whetherinaca-demia,industry,government,orresearch.tABRAHAMandLEDOLTER.StatisticalMethodsforForecastingAGRESTI.AnalysisofOrdinalCategoricalDataAGRESTI.AnIntroductiontoCategoricalDataAnalysis,SecondEditionAGRESTI.CategoricalDataAnalysis,SecondEditionALTMAN,GILL,andMcDONALD.NumericalIssuesinStatisticalComputingfortheSocialScientistAMARATUNGAandCABRERA.ExplorationandAnalysisofDNAMicroarrayandProteinArrayDataANDEL.MathematicsofChanceANDERSON.AnIntroductiontoMultivariateStatisticalAnalysiis,ThirdEdition*ANDERSON.TheStatisticalAnalysisofTimeSeriesANDERSON,AUQUIER,HAUCK,OAKES,VANDAELE,andWEISBERG.StatisticalMethodsforComparativeStudiesANDERSONandLOYNES.TheTeachingofPracticalStatisticisARMITAGEandDAVID(editors).AdvancesinBiometryARNOLD,BALAKRISHNAN,andNAGARAJA.Records*ARTHANARIandDODGE.MathematicalProgramminginStatistics*BAILEY.TheElementsofStochasticProcesseswithApplicationstotheNaturalSciencesBALAKRISHNANandKOUTRAS.RunsandScanswithAppliicationsBALAKRISHNANandNG.Precedence-TypeTestsandApplicationsBARNETT.ComparativeStatisticalInference,ThirdEditionBARNETT.EnvironmentalStatisticsBARNETTandLEWIS.OutliersinStatisticalData,ThirdEditionBARTOSZYNSKIandNIEWIADOMSKA-BUGAJ.ProbabilityandStatisticalInferenceBASILEVSKY.StatisticalFactorAnalysisandRelatedMethods:TheoryandApplicationsBASUandRIGDON.StatisticalMethodsfortheReliabilityofF.epairableSystemsBATESandWATTS.NonlinearRegressionAnalysisandItsApplications*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.+NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries. BECHHOFER,SANTNER,andGOLDSMAN.DesignandAnalysisofExperimentsforStatisticalSelection,Screening,andMultipleComparisonsBELSLEY.ConditioningDiagnostics:CollinearityandWeakDatainRegressionBELSLEY,KUH,andWELSCH.RegressionDiagnostics:IdentifyingInfluentialDataandSourcesofCollinearityBENDATandPIERSOL.RandomData:AnalysisandMeasurementProcedures,ThirdEditionBERRY,CHALONER,andGEWEKE.BayesianAnalysisinStatisticsandEconometrics:EssaysinHonorofArnoldZellnerBERNARD0andSMITH.BayesianTheoryBHATandMILLER.ElementsofAppliedStochasticProcesses,ThirdEditionBHATTACHARYAandWAYMIRE.StochasticProcesseswithApplicationsBILLINGSLEY.ConvergenceofProbabilityMeasures,SecondEditionBILLINGSLEY.ProbabilityandMeasure,ThirdEditionBIRKESandDODGE.AlternativeMethodsofRegressionBLISCHKEANDMURTHY(editors).CaseStudiesinReliabilityandMaintenanceBLISCHKEANDMURTHY.Reliability:Modeling,Prediction,andOptimizationBLOOMFIELD.FourierAnalysisofTimeSeries:AnIntroduction,SecondEditionBOLLEN.StructuralEquationswithLatentVariablesBOLLENandCURRAN.LatentCurveModels:AStructuralEquationPerspectiveBOROVKOV.ErgodicityandStabilityofStochasticProcessesBOULEAU.NumericalMethodsforStochasticProcessesBOX.BayesianInferenceinStatisticalAnalysisBOX.R.A.Fisher,theLifeofaScientistBOXandDRAPER’ResponseSurfaces,Mixtures,andRidgeAnalyses,SecondEdition*BOXandDRAPER.EvolutionaryOperation:AStatisticalMethodforProcessImprovementBOXandFRIENDS.ImprovingAlmostAnything,RevisedEditionBOX,HUNTER,andHUNTER.StatisticsforExperimenters:Design,Innovation,andDiscovery,SecondEditonBOXandLUCERO.StatisticalControlbyMonitoringandFeedbackAdjustmentBRANDIMARTE.NumericalMethodsinFinance:AMATLAB-BasedIntroductionBROWNandHOLLANDERStatistics:ABiomedicalIntroductionBRUNNER,DOMHOF,andLANGER.NonparametricAnalysisofLongitudinalDatainFactorialExperimentsBUCKLEW’LargeDeviationTechniquesinDecision,Simulation,andEstimationCAIROLIandDALANG.SequentialStochasticOptimizationCASTILLO,HADI,BALAKRISHNAN,andSARABIA.ExtremeValueandRelatedModelswithApplicationsinEngineeringandScienceCHAN*TimeSeries:ApplicationstoFinanceCHARALAMBIDES.CombinatorialMethodsinDiscreteDistributionsCHATTERJEEandHADI.RegressionAnalysisbyExample,FourthEditionCHATTERJEEandHADI.SensitivityAnalysisinLinearRegressionCHERNICK.BootstrapMethods:APractitioner’sGuideCHERNICKandFRIIS.IntroductoryBiostatisticsfortheHealthSciencesCHILESandDELFINER*Geostatistics:ModelingSpatialUncertaintyCHOWandLIU.DesignandAnalysisofClinicalTrials:ConceptsandMethodologies,SecondEditionCLARKEandDISNEY.ProbabilityandRandomProcesses:AFirstCoursewithApplications,SecondEdition*COCHRANandCOX.ExperimentalDesigns,SecondEditionCONGDON.AppliedBayesianModellingCONGDON.BayesianModelsforCategoricalDataCONGDON.BayesianStatisticalModelling*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.+NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries. CONOVER.PracticalNonparametricStatistics,ThirdEditionCOOK.RegressionGraphicsCOOKandWEISBERG.AppliedRegressionIncludingComputingandGraphicsCOOKandWEISBERG.AnIntroductiontoRegressionGraphicsCORNELL.ExperimentswithMixtures,Designs,Models,andtheAnalysisofMixtureData,ThirdEditionCOVERandTHOMAS.ElementsofInformationTheoryCOX.AHandbookofIntroductoryStatisticalMethods*COX.PlanningofExperimentsCRESSIE.StatisticsforSpatialData,RevisedEditionCSORGOandHORVATH.LimitTheoremsinChangePointAnalysisDANIEL*ApplicationsofStatisticstoIndustrialExperimentationDANIEL.Biostatistics:AFoundationforAnalysisintheHealthSciences,EighthEdition*DANIEL.FittingEquationstoData:ComputerAnalysisofMultifactorData,SecondEditionDASUandJOHNSON.ExploratoryDataMiningandDataCleaningDAVIDandNAGARAJA.OrderStatistics,ThirdEdition*DEGROOT,FIENBERG,andKADANE*StatisticsandtheLawDELCASTILLO.StatisticalProcessAdjustmentforQualityControlDEMARIS.RegressionwithSocialData:ModelingContinuousandLimitedResponseVariablesDEMIDENKO.MixedModels:TheoryandApplicationsDENISON,HOLMES,MALLICKandSMITH.BayesianMethodsforNonlinearClassificationandRegressionDETTEandSTUDDEN.TheTheoryofCanonicalMomentswithApplicationsinStatistics,Probability,andAnalysisDEYandMUKERJEE.FractionalFactorialPlansDILLONandGOLDSTEIN.MultivariateAnalysis:MethodsandApplicationsDODGE.AlternativeMethodsofRegression*DODGEandROMIG.SamplingInspectionTables,SecondEdition*DOOB.StochasticProcessesDOWDY,WEARDEN,andCHILKO.StatisticsforResearch,:ThirdEditionDRAPERandSMITH.AppliedRegressionAnalysis,ThirdEditionDRYDENandMARDIA.StatisticalShapeAnalysisDUDEWICZandMISHRA.ModemMathematicalStatisticsDUNNandCLARK.BasicStatistics:APrimerfortheBiomediicalSciences,ThirdEditionDUPUISandELLIS.AWeakConvergenceApproachtotheTheoryofLargeDeviationsEDLERandKITSOS.RecentAdvancesinQuantitativeMethodsinCancerandHumanHealthRiskAssessment*ELANDT-JOHNSONandJOHNSON.SurvivalModelsandDataAnalysisENDERS.AppliedEconometricTimeSeries'fETHIERandKURTZ.MarkovProcesses:CharacterizationandConvergenceEVANS,HASTINGS,andPEACOCK.StatisticalDistributions,ThirdEditionFELLER.AnIntroductiontoProbabilityTheoryandItsApplications,VolumeI,ThirdEdition,Revised;Volume11,SecondEditionFISHERandVANBELLE.Biostatistics:AMethodologyfortheHealthSciencesFITZMAURICE,LAIRD,andWARE.AppliedLongitudinalAnalysis*FLEISS.TheDesignandAnalysisofClinicalExperimentsFLEISS.StatisticalMethodsforRatesandProportions,ThirdEdition7FLEMINGandHARRINGTON.CountingProcessesandSurvivalAnalysisFULLER.IntroductiontoStatisticalTimeSeries,SecondEditionFULLER.MeasurementErrorModels*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.+NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries. GALLANT.NonlinearStatisticalModelsGEISSER*ModesofParametricStatisticalInferenceGELMANandMENG'AppliedBayesianModelingandCausalInferencefromIncomplete-DataPerspectivesGEWEKE.ContemporaryBayesianEconometricsandStatisticsGHOSH,MUKHOPADHYAY,andSEN.SequentialEstimationGIESBRECHTandGUMPERTZ*Planning,Construction,andStatisticalAnalysisofComparativeExperimentsGIFI.NonlinearMultivariateAnalysisGIVENSandHOETING.ComputationalStatisticsGLASSERMANandYAO.MonotoneStructureinDiscrete-EventSystemsGNANADESIKAN.MethodsforStatisticalDataAnalysisofMultivariateObservations,SecondEditionGOLDSTEINandLEWIS.Assessment:Problems,Development,andStatisticalIssuesGREENWOODandNIKULIN.AGuidetoChi-SquaredTestingGROSSandHARRIS1FundamentalsofQueueingTheory,ThirdEdition*HAHNandSHAPIRO.StatisticalModelsinEngineeringHAHNandMEEKER.StatisticalIntervals:AGuideforPractitionersHALD.AHistoryofProbabilityandStatisticsandtheirApplicationsBefore1750HALD.AHistoryofMathematicalStatisticsfrom1750to1930HAMPEL.RobustStatistics:TheApproachBasedonInfluenceFunctionsHANNANandDEISTLER.TheStatisticalTheoryofLinearSystemsHEIBERGER.ComputationfortheAnalysisofDesignedExperimentsHEDAYATandSINHA.DesignandInferenceinFinitePopulationSamplingHEDEKERandGIBBONS*LongitudinalDataAnalysisHELLER.MACSYMAforStatisticiansHINKELMANNandKEMPTHORNE.DesignandAnalysisofExperiments,Volume1:IntroductiontoExperimentalDesignHINKELMANNandKEMPTHORNE.DesignandAnalysisofExperiments,Volume2:AdvancedExperimentalDesignHOAGLIN,MOSTELLER,andTUKEY.ExploratoryApproachtoAnalysisofVariance*HOAGLIN,MOSTELLER,andTUKEY*ExploringDataTables,TrendsandShapes*HOAGLIN,MOSTELLER,andTUKEY+UnderstandingRobustandExploratoryDataAnalysisHOCHBERGandTAMHANE.MultipleComparisonProceduresHOCKING.MethodsandApplicationsofLinearModels:RegressionandtheAnalysisofVariance,SecondEditionHOEL.IntroductiontoMathematicalStatistics,FifthEditionHOGGandKLUGMAN.LossDistributionsHOLLANDERandWOLFE.NonparametricStatisticalMethods,SecondEditionHOSMERandLEMESHOW.AppliedLogisticRegression,SecondEditionHOSMERandLEMESHOW.AppliedSurvivalAnalysis:RegressionModelingofTimetoEventDatatHUBER.RobustStatisticsHUBERTY.AppliedDiscriminantAnalysisHUBERTYandOLEJNIK.AppliedMANOVAandDiscriminantAnalysis,SecondEditionHUNTandKENNEDY.FinancialDerivativesinTheoryandPractice,RevisedEditionHUSKOVA,BERAN,andDUPAC.CollectedWorksofJaroslavHajek-withCommentaryHUZURBAZAR.FlowgraphModelsforMultistateTime-to-EventDataIMANandCONOVER.AModemApproachtoStatistics*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.?NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries. tJACKSON.AUser’sGuidetoPrincipleComponentsJOHN.StatisticalMethodsinEngineeringandQualityAssuranceJOHNSON.MultivariateStatisticalSimulationJOHNSONandBALAKRISHNAN.AdvancesintheTheoryandPracticeofStatistics:AVolumeinHonorofSamuelKotzJOHNSONandBHATTACHARYYA.Statistics:PrinciplesandMethods,FifthEditionJOHNSONandKOTZ.DistributionsinStatisticsJOHNSONandKOTZ(editors).LeadingPersonalitiesinStatisticalSciences:FromtheSeventeenthCenturytothePresentJOHNSON,KOTZ,andBALAKRISHNAN.ContinuousUnivariateDistributions,Volume1,SecondEditionJOHNSON,KOTZ,andBALAKRISHNAN.ContinuousUnivariateDistributions,Volume2,SecondEditionJOHNSON,KOTZ,andBALAKRISHNAN.DiscreteMultivariateDistributionsJOHNSON,KEMP,andKOTZ.UnivariateDiscreteDistributions,ThirdEditionJUDGE,GRIFFITHS,HILL,LUTKEPOHL,andLEE.TheTheoryandPracticeofEcenometrics,SecondEditionJURECKOVAandSEN.RobustStatisticalProcedures:AymptoticsandInterrelationsJUREKandMASON.Operator-LimitDistributionsinProbabilityTheoryKADANE.BayesianMethodsandEthicsinaClinicalTrialDesignKADANEANDSCHUM.AProbabilisticAnalysisoftheSaccoandVanzettiEvidenceKALBFLEISCHandPRENTICE.TheStatisticalAnalysisofFailureTimeData,SecondEditionKARIYAandKURATA.GeneralizedLeastSquaresKASSandVOS.GeometricalFoundationsofAsymptoticInferencetKAUFMANandROUSSEEUW.FindingGroupsinData:An[ntroductiontoClusterAnalysisKEDEMandFOKIANOS.RegressionModelsforTimeSeriesAnalysisKENDALL,BARDEN,CARNE,andLE.ShapeandShapeTheoryKHURI.AdvancedCalculuswithApplicationsinStatistics,SecondEditionKHURI,MATHEW,andSINHA.StatisticalTestsforMixedLinearModelsKLEIBERandKOTZ.StatisticalSizeDistributionsinEconomicsandActuarialSciencesKLUGMAN,PANJER,andWILLMOT.LossModels:FromDatatoDecisions,SecondEditionKLUGMAN,PANJER,andWILLMOT.SolutionsManualto,4ccompanyLossModels:FromDatatoDecisions,SecondEditionKOTZ,BALAKRISHNAN,andJOHNSON.ContinuousMultivariateDistributions,Volume1,SecondEditionKOVALENKO,KUZNETZOV,andPEGG.MathematicalTheoryofReliabilityofTime-DependentSystemswithPracticalApplicationsKVAMandVIDAKOVIC.NonparametricStatisticswithApplicationstoScienceandEngineeringLACHIN.BiostatisticalMethods:TheAssessmentofRelativeRisksLAD.OperationalSubjectiveStatisticalMethods:AMathematical,Philosophical,andHistoricalIntroductionLAMPERTI.Probability:ASurveyoftheMathematicalTheory,SecondEditionLANGE,RYAN,BILLARD,BRILLINGER,CONQUEST,andGREENHOUSE.CaseStudiesinBiometryLARSON.IntroductiontoProbabilityTheoryandStatisticalInference,ThirdEditionLAWLESS.StatisticalModelsandMethodsforLifetimeData,SecondEditionLAWSON.StatisticalMethodsinSpatialEpidemiologyLE.AppliedCategoricalDataAnalysisLE.AppliedSurvivalAnalysis*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.?NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries. LEEandWANG.StatisticalMethodsforSurvivalDataAnalysis,ThirdEditionLEPAGEandBILLARD.ExploringtheLimitsofBootstrapLEYLANDandGOLDSTEIN(editors).MultilevelModellingofHealthStatisticsLIAO.StatisticalGroupComparisonLINDVALL.LecturesontheCouplingMethodLIN.IntroductoryStochasticAnalysisforFinanceandInsuranceLINHARTandZUCCHINI.ModelSelectionLITTLEandRUBIN.StatisticalAnalysiswithMissingData,SecondEditionLLOYD.TheStatisticalAnalysisofCategoricalDataLOWENandTEICH.Fractal-BasedPointProcessesMAGNUSandNEUDECKER.MatrixDifferentialCalculuswithApplicationsinStatisticsandEconometrics,RevisedEditionMALLERandZHOU.SurvivalAnalysiswithLongTermSurvivorsMALLOWS.Design,Data,andAnalysisbySomeFriendsofCuthbertDanielMA",SCHAFER,andSINGPURWALLA.MethodsforStatisticalAnalysisofReliabilityandLifeDataMANTON,WOODBURY,andTOLLEY.StatisticalApplicationsUsingFuzzySetsMARCHETTE.RandomGraphsforStatisticalPatternRecognitionMARDIAandJUPP.DirectionalStatisticsMASON,GUNST,andHESS.StatisticalDesignandAnalysisofExperimentswithApplicationstoEngineeringandScience,SecondEditionMcCULLOCHandSEARLE*Generalized,Linear,andMixedModelsMcFADDEN*ManagementofDatainClinicalTrialsMcLACHLAN.DiscriminantAnalysisandStatisticalPatternRecognitionMcLACHLAN,DO,andAMBROISE.AnalyzingMicroanayGeneExpressionDataMcLACHLANandKRISHNAN*TheEMAlgorithmandExtensionsMcLACHLANandPEEL.FiniteMixtureModelsMcNEIL.EpidemiologicalResearchMethodsMEEKERandESCOBAR.StatisticalMethodsforReliabilityDataMEERSCHAERTandSCHEFFLER.LimitDistributionsforSumsofIndependentRandomVectors:HeavyTailsinTheoryandPracticeMICKEY,DUNN,andCLARK*AppliedStatistics:AnalysisofVarianceandRegression,ThirdEditionMILLER.SurvivalAnalysis,SecondEditionMONTGOMERY,PECK,andVINING.IntroductiontoLinearRegressionAnalysis,FourthEditionMORGENTHALERandTUKEY.ConfiguralPolysampling:ARoutetoPracticalRobustnessMUIRHEAD.AspectsofMultivariateStatisticalTheoryMULLERandSTOYAN.ComparisonMethodsforStochasticModelsandRisksMURRAY.X-STAT2.0StatisticalExperimentation,DesignDataAnalysis,andNonlinearOptimizationMURTHY,XIE,andJIANG.WeibullModelsMYERSandMONTGOMERY.ResponseSurfaceMethodology:ProcessandProductOptimizationUsingDesignedExperiments,SecondEditionMYERS,MONTGOMERY,andVINING.GeneralizedLinearModels.WithApplicationsinEngineeringandtheSciencesNELSON.AcceleratedTesting,StatisticalModels,TestPlans,andDataAnalysesNELSON.AppliedLifeDataAnalysisNEWMAN.BiostatisticalMethodsinEpidemiologyOCHI*AppliedProbabilityandStochasticProcessesinEngineeringandPhysicalSciencesOKABE,BOOTS,SUGIHARA,andCHIU.SpatialTesselations:ConceptsandApplicationsofVoronoiDiagrams,SecondEdition*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.?NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries OLIVERandSMITH.InfluenceDiagrams,BeliefNetsandDecisionAnalysisPALTA.QuantitativeMethodsinPopulationHealth:ExtensionsofOrdinaryRegressionsPANJER.OperationalRisk:ModelingandAnalyticsPANKRATZ.ForecastingwithDynamicRegressionModelsPANKRATZ*ForecastingwithUnivariateBox-JenkinsModels:ConceptsandCases*PARZEN.ModemProbabilityTheoryandItsApplicationsPERA,TIAO,andTSAY.ACourseinTimeSeriesAnalysisPIANTADOSI.ClinicalTrials:AMethodologicPerspectivePORT.TheoreticalProbabilityforApplicationsPOURAHMADI.FoundationsofTimeSeriesAnalysisandPredictionTheoryPRESS*BayesianStatistics:Principles,Models,andApplicaticlnsPRESS.SubjectiveandObjectiveBayesianStatistics,SecondEditionPRESSandTANUR.TheSubjectivityofScientistsandtheBa:yesianApproachPUKELSHEIM.OptimalExperimentalDesignPURI,VILAPLANA,andWERTZ.NewPerspectivesinTheoreticalandAppliedStatistics?'PUTERMAN.MarkovDecisionProcesses:DiscreteStochasticDynamicProgrammingQIU.ImageProcessingandJumpRegressionAnalysis*RAO.LinearStatisticalInferenceandItsApplications,SecondEditionRAUSANDandH0YLAND.SystemReliabilityTheory:Models,StatisticalMethods,andApplications,SecondEditionRENCHER.LinearModelsinStatisticsRENCHER.MethodsofMultivariateAnalysis,SecondEditionRENCHER.MultivariateStatisticalInferencewithApplications*RIPLEY.SpatialStatistics*RIPLEY.StochasticSimulationROBINSON*PracticalStrategiesforExperimentingROHATGIandSALEH.AnIntroductiontoProbabilityandStatistics,SecondEditionROLSKI,SCHMIDLI,SCHMIDT,andTEUGELS.StochasticProcessesforInsuranceandFinanceROSENBERGERandLACHIN.RandomizationinClinicalTrials:TheoryandPracticeROSS.IntroductiontoProbabilityandStatisticsforEngineersandScientistsROSSI,ALLENBY,andMcCULLOCH.BayesianStatisticsandMarketingtROUSSEEUWandLEROY*RobustRegressionandOutlierDetection*RUBIN.MultipleImputationforNonresponseinSurveysRUBINSTEIN.SimulationandtheMonteCarloMethodRUBINSTEINandMELAMED.ModemSimulationandModelingRYAN.ModemExperimentalDesignRYAN.ModemRegressionMethodsRYAN.StatisticalMethodsforQualityImprovement,SecondEditionSALEH.TheoryofPreliminaryTestandStein-TypeEstimationwithApplications*SCHEFFE.TheAnalysisofVarianceSCHIMEK.SmoothingandRegression:Approaches,Computation,andApplicationSCHOTT.MatrixAnalysisforStatistics,SecondEditionSCHOUTENS.LevyProcessesinFinance:PricingFinancialDierivativesSCHUSS.TheoryandApplicationsofStochasticDifferentialE:quationsSCOTT.MultivariateDensityEstimation:Theory,Practice,andVisualizationtSEARLE.LinearModelsforUnbalancedDataSEARLE.MatrixAlgebraUsefulforStatisticsSEARLE,CASELLA,andMcCULLOCH.VarianceComponentsSEARLEandWILLETT.MatrixAlgebraforAppliedEconomicsSEBERandLEE.LinearRegressionAnalysis,SecondEditiontSEBER.MultivariateObservations'fSEBERandWILD.NonlinearRegression*NowavailableinalowerpricedpaperbackeditionintheWileyClas.sicsLibrary.?NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries. SENNOTT.StochasticDynamicProgrammingandtheControlofQueueingSystems*SERFLING.ApproximationTheoremsofMathematicalStatisticsSHAFERandVOVK.ProbabilityandFinance:It’sOnlyaGame!SILVAPULLEandSEN*ConstrainedStatisticalInference:Inequality,Order,andShapeRestrictionsSMALLandMcLEISH.HilbertSpaceMethodsinProbabilityandStatisticalInferenceSRIVASTAVA.MethodsofMultivariateStatisticsSTAPLETON.LinearStatisticalModelsSTAUDTEandSHEATHER.RobustEstimationandTestingSTOYAN,KENDALL,andMECKE.StochasticGeometryandItsApplications,SecondEditionSTOYANandSTOYAN.Fractals,RandomShapesandPointFields:MethodsofGeometricalStatisticsSTREETandBURGESS.TheConstructionofOptimalStatedChoiceExperiments:TheoryandMethodsSTYAN.TheCollectedPapersofT.W.Anderson:1943-1985SUTTON,ABRAMS,JONES,SHELDON,andSONG.MethodsforMeta-AnalysisinMedicalResearchTAKEZAWA.IntroductiontoNonparametricRegressionTANAKA*TimeSeriesAnalysis:NonstationaryandNoninvertibleDistributionTheoryTHOMPSON.EmpiricalModelBuildingTHOMPSON.Sampling,SecondEditionTHOMPSON.Simulation:AModeler’sApproachTHOMPSONandSEBER.AdaptiveSamplingTHOMPSON,WILLIAMS,andFINDLAY.ModelsforInvestorsinRealWorldMarketsTIAO,BISGAARD,HILL,PERA,andSTIGLER(editors).BoxonQualityandDiscovery:withDesign,Control,andRobustnessTIERNEY.LISP-STAT:AnObject-OrientedEnvironmentforStatisticalComputingandDynamicGraphicsTSAY.AnalysisofFinancialTimeSeries,SecondEditionUPTONandFINGLETON.SpatialDataAnalysisbyExample,Volume11:CategoricalandDirectionalDataVANBELLE.StatisticalRulesofThumbVANBELLE,FISHER,HEAGERTY,andLUMLEY.Biostatistics:AMethodologyfortheHealthSciences,SecondEditionVESTRUP.TheTheoryofMeasuresandIntegrationVIDAKOVIC.StatisticalModelingbyWaveletsVINODandREAGLE.PreparingfortheWorst:IncorporatingDownsideRiskinStockMarketInvestmentsWALLERandGOTWAY.AppliedSpatialStatisticsforPublicHealthDataWEERAHANDI.GeneralizedInferenceinRepeatedMeasures:ExactMethodsinMANOVAandMixedModelsWEISBERG.AppliedLinearRegression,ThirdEditionWELSH.AspectsofStatisticalInferenceWESTFALLandYOUNG.Resampling-BasedMultipleTesting:ExamplesandMethodsforp-ValueAdjustmentWHITTAKER.GraphicalModelsinAppliedMultivariateStatisticsWINKER.OptimizationHeuristicsinEconomics:ApplicationsofThresholdAcceptingWONNACOTTandWONNACOTT.Econometrics,SecondEditionWOODING.PlanningPharmaceuticalClinicalTrials:BasicStatisticalPrinciplesWOODWORTH.Biostatistics:ABayesianIntroductionWOOLSONandCLARKE.StatisticalMethodsfortheAnalysisofBiomedicalData,SecondEdition*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.TNowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries. WUandHAMADA.Experiments:Planning,Analysis,andParameterDesignOptimizationWUandZHANG.NonparametricRegressionMethodsforLongitudinalDataAnalysisYANG.TheConstructionTheoryofDenumerableMarkovProcessesYOUNG,VALERO-MOM,andFRIENDLY.VisualStatistics:SeeingDatawithDynamicInteractiveGraphicsZELTERMAN.DiscreteDistributions-ApplicationsintheHealthSciences*ZELLNER.AnIntroductiontoBayesianInferenceinEconometricsZHOU,OBUCHOWSKI,andMcCLISH.StatisticalMethodsinDiagnosticMedicine*NowavailableinalowerpricedpaperbackeditionintheWileyClassicsLibrary.+NowavailableinalowerpricedpaperbackeditionintheWiley-IntersciencePaperbackSeries.

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
大家都在看
近期热门
相关文章
更多
相关标签
关闭