Using R for Data Analysis and Graphics Introduction, Code and Commentary

Using R for Data Analysis and Graphics Introduction, Code and Commentary

ID:39907060

大小:2.58 MB

页数:104页

时间:2019-07-14

上传者:新起点
Using R for Data Analysis and Graphics Introduction, Code and Commentary_第1页
Using R for Data Analysis and Graphics Introduction, Code and Commentary_第2页
Using R for Data Analysis and Graphics Introduction, Code and Commentary_第3页
Using R for Data Analysis and Graphics Introduction, Code and Commentary_第4页
Using R for Data Analysis and Graphics Introduction, Code and Commentary_第5页
资源描述:

《Using R for Data Analysis and Graphics Introduction, Code and Commentary》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库

UsingRforDataAnalysisandGraphicsIntroduction,CodeandCommentaryJHMaindonaldCentreforBioinformationScience,AustralianNationalUniversity.©J.H.Maindonald2000,2004.Alicenceisgrantedforpersonalstudyandclassroomuse.Redistributioninanyotherformisprohibited.Languagesshapethewaywethink,anddeterminewhatwecanthinkabout(BenjaminWhorf.).10October20041 CambarvilleWhianWhianConondaleBulburinBellbirdByrangeryAllynRiverfemalemale606570752404tail83length6343235707footlength56065505earconchlength540432364040455055Lindenmayer,D.B.,Viggers,K.L.,Cunningham,R.B.,andDonnelly,C.F.:Morphologicalvariationamongpopulationsofthemountainbrushtailpossum,trichosuruscaninusOgibly(Phalangeridae:Marsupialia).AustralianJournalofZoology43:449-459,1995.possumn.1Anyofmanychieflyherbivorous,long-tailed,tree-dwelling,mainlyAustralianmarsupials,someofwhichareglidinganimals(e.g.brush-tailedpossum,flyingpossum).2amildlyscornfultermforaperson.3anaffectionatemodeofaddress.ndFromtheAustralianOxfordPaperbackDictionary,2ed,1996.2 TABLEOFCONTENTSIntroduction............................................................................................................................................................11.StartingUp..........................................................................................................................................................31.1GettingstartedunderWindows......................................................................................................................31.2UseofanEditorScriptWindow....................................................................................................................41.3AShortRSession..........................................................................................................................................51.4FurtherNotationalDetails...........................................................................................................................71.5On-lineHelp.................................................................................................................................................71.6TheLoadingorAttachingofDatasets..........................................................................................................71.7Exercise.........................................................................................................................................................82.AnOverviewofR...............................................................................................................................................92.1TheUsesofR.................................................................................................................................................92.2RObjects......................................................................................................................................................11*2.3Looping......................................................................................................................................................122.4Vectors.........................................................................................................................................................122.5DataFrames................................................................................................................................................152.6CommonUsefulFunctions...........................................................................................................................162.7MakingTables..............................................................................................................................................172.8TheSearchList............................................................................................................................................182.9FunctionsinR..............................................................................................................................................182.10MoreDetailedInformation........................................................................................................................202.11Exercises....................................................................................................................................................203.Plotting..............................................................................................................................................................213.1plot()andalliedfunctions...........................................................................................................................213.2Finecontrol–Parametersettings...............................................................................................................213.3Addingpoints,linesandtext........................................................................................................................223.4IdentificationandLocationontheFigureRegion......................................................................................253.5Plotsthatshowthedistributionofdatavalues............................................................................................253.6OtherUsefulPlottingFunctions..................................................................................................................293.7PlottingMathematicalSymbols..................................................................................................................303.8GuidelinesforGraphs.................................................................................................................................313.9Exercises......................................................................................................................................................313.10References..................................................................................................................................................324.Latticegraphics................................................................................................................................................334.1ExamplesthatPresentPanelsofScatterplots–Usingxyplot()..................................................................334.3Exercises......................................................................................................................................................355.Linear(MultipleRegression)ModelsandAnalysisofVariance.................................................................37i 5.1TheModelFormulainStraightLineRegression........................................................................................375.2RegressionObjects.......................................................................................................................................385.3ModelFormulae,andtheXMatrix.............................................................................................................385.4MultipleLinearRegressionModels.............................................................................................................405.5PolynomialandSplineRegression..............................................................................................................435.6UsingFactorsinRModels..........................................................................................................................465.7MultipleLines–DifferentRegressionLinesforDifferentSpecies.............................................................495.8aovmodels(AnalysisofVariance)..............................................................................................................505.9Exercises......................................................................................................................................................525.10References..................................................................................................................................................536.MultivariateandTree-BasedMethods..........................................................................................................556.1MultivariateEDA,andPrincipalComponentsAnalysis.............................................................................556.2ClusterAnalysis...........................................................................................................................................566.3DiscriminantAnalysis..................................................................................................................................566.4DecisionTreemodels(Tree-basedmodels).................................................................................................586.5Exercises......................................................................................................................................................586.6References....................................................................................................................................................58*7.RDataStructures...........................................................................................................................................597.1Vectors.........................................................................................................................................................597.2MissingValues.............................................................................................................................................597.3Dataframes..................................................................................................................................................607.4DataEntry....................................................................................................................................................617.5FactorsandOrderedFactors......................................................................................................................627.6OrderedFactors...........................................................................................................................................637.7Lists..............................................................................................................................................................64*7.8MatricesandArrays..................................................................................................................................657.9Exercises......................................................................................................................................................668.UsefulFunctions...............................................................................................................................................688.1ConfidenceIntervalsandTests....................................................................................................................688.2MatchingandOrdering...............................................................................................................................688.3StringFunctions...........................................................................................................................................688.4ApplicationofaFunctiontotheColumnsofanArrayorDataFrame.....................................................69*8.5aggregate()andtapply()............................................................................................................................69*8.7MergingDataFrames................................................................................................................................708.8Dates............................................................................................................................................................708.9Exercises......................................................................................................................................................719.WritingFunctionsandotherCode.................................................................................................................729.1SyntaxandSemantics...................................................................................................................................729.2IssuesfortheWritingandUseofFunctions...............................................................................................73ii 9.3FunctionsasaidstoDataManagement......................................................................................................739.4ASimulationExample..................................................................................................................................749.5Exercises......................................................................................................................................................75*10.GLM,andGeneralNon-linearModels......................................................................................................7810.1ATaxonomyofExtensionstotheLinearModel........................................................................................7810.2LogisticRegression....................................................................................................................................7910.3glmmodels(GeneralizedLinearRegressionModelling)..........................................................................8210.4ModelsthatIncludeSmoothSplineTerms................................................................................................8310.5SurvivalAnalysis........................................................................................................................................8310.6Non-linearModels.....................................................................................................................................8310.7ModelSummaries......................................................................................................................................8310.8FurtherElaborations.................................................................................................................................8310.9Exercises....................................................................................................................................................8410.10References................................................................................................................................................84*11.Multi-levelModels,RepeatedMeasuresandTimeSeries........................................................................8611.1Multi-LevelModels,IncludingRepeatedMeasuresModels.....................................................................8611.2TimeSeriesModels....................................................................................................................................9011.3Exercises....................................................................................................................................................9111.4References..................................................................................................................................................91*12.AdvancedProgrammingTopics..................................................................................................................9212.1.Methods.....................................................................................................................................................9212.2ExtractingArgumentstoFunctions..........................................................................................................9212.3ParsingandEvaluationofExpressions.....................................................................................................9312.4Plottingamathematicalexpression...........................................................................................................9412.4SearchingRfunctionsforaspecifiedtoken..............................................................................................9513.RResources....................................................................................................................................................9613.1RPackagesforWindows............................................................................................................................9613.2Literaturewrittenbyexpertusers..............................................................................................................9613.3TheR-helpelectronicmaildiscussionlist.................................................................................................9713.4CompetingSystems–XLISP-STAT...........................................................................................................9714.Appendix1......................................................................................................................................................9814.1DataSetsReferredtointheseNotes.........................................................................................................9814.2AnswerstoSelectedExercises...................................................................................................................98iii IntroductionThesenotesaredesignedtoallowindividualswhohaveabasicgroundingstatisticalmethodologytoworkthroughexamplesthatdemonstratetheuseofRforavarietyofdifferenttypesofdatamanipulation,graphicalpresentationandstatisticalanalysis.BooksthatprovideamoreextendedcommentaryonthemethodsillustratedintheseexamplesincludeMaindonaldandBraun(2003).TheRSystemRimplementsadialectoftheSlanguagethatwasdevelopedatAT&TBellLaboratoriesbyRickBecker,JohnChambersandAllanWilks.VersionsofRareavailable,atnocost,for32-bitversionsofMicrosoftWindowsforLinux,forUnixandforMacintoshOSX.(ThereareareolderversionsofRthatsupport8.6and9.)ItisavailablethroughtheComprehensiveRArchiveNetwork(CRAN).Webaddressesaregivenbelow.ThecitationforJohnChambers’1998AssociationforComputingMachinerySoftwareawardstatedthatShas“foreveralteredhowpeopleanalyze,visualizeandmanipulatedata.”TheRprojectenlargesontheideasandinsightsthatgeneratedtheSlanguage.HerearepointsrelatingtotheuseofRthatpotentialusersmightnote:•Rhasextensiveandpowerfulgraphicsabilities,thataretightlylinkedwithitsanalyticabilities.•TheRsystemisdevelopingrapidly.Newfeaturesandabilitiesappeareveryfewmonths.•Simplecalculationsandanalysescanbehandledstraightforwardly,albeit(inthecurrentversion)usingacommandlineinterface.Chapters1and2areintendedtogivetheflavourofwhatispossiblewithoutgettingdeeplyintotheRlanguage.Ifsimplemethodsproveinadequate,therecanberecoursetothehugerangeofmoreadvancedabilitiesthatRoffers.Adaptationofavailableabilitiesallowsevengreaterflexibility.•TheRcommunityiswidelydrawn,fromapplicationareaspecialistsaswellasstatisticalspecialists.Itisacommunitythatissensitivetothepotentialformisuseofstatisticaltechniquesandsuspiciousofwhatmightappeartobemindlessuse.Expectscepticismoftheuseofmodelsthatarenotsusceptibletosomeminimalformofdata-basedvalidation.•BecauseRisfree,usershavenorighttoexpectattention,onther-helplistorelsewhere,toqueries.Begratefulforwhateverhelpisgiven.•Pointandclickinterfacesareatanearlystageofdevelopment.WhileRisasreliableasanystatisticalsoftwarethatisavailable,andexposedtohigherstandardsofscrutinythanmostothersystems,therearetrapsthatcallforspecialcare.Manyofthemodelfittingroutinesareleadingedge.Thereisalimitedtraditionofexperienceofthelimitationsandpitfallsofsomeofthenewerabilities.Whateverthestatisticalsystem,andespeciallywhenthereissomeelementofcomplication,checkeachstepwithcare.Thereisnosubstituteforexperienceandexpertknowledge,evenwhenthestatisticalanalysistaskmayseemstraightforward.NeitherRnoranyotherstatisticalsystemwillgivethestatisticalexpertisethatisneededtousesophisticatedabilities,ortoknowwhennaïvemethodsarenotenough.ExperiencewiththeuseofRishowever,morethanwithmostsystems,likelytobeaneducationalexperience.HurrahfortheRdevelopmentteam!TheLookandFeelofR1Risafunctionallanguage.Thereisalanguagecorethatusesstandardformsofalgebraicnotation,allowingthecalculationssuchas2+3,or3^11.Beyondthis,mostcomputationishandledusingfunctions.TheactionofquittingfromanRsessionusesthefunctioncallq().Itisoftenpossibleanddesirabletooperateonobjects–vectors,arrays,listsandsoon–asawhole.Thislargelyavoidstheneedforexplicitloops,leadingtoclearercode.Section2.1.5hasanexample.1T++hestructureofanRprogramhassimilaritieswithprogramsthatarewritteninCorinitssuccessorsCandJava.ImportantdifferencesarethatRhasnoheaderfiles,mostdeclarationsareimplicit,therearenopointers,andvectorsoftextstringscanbedefinedandmanipulateddirectly.TheimplementationofRusesacomputingmodelthatisbasedontheSchemedialectoftheLISPlanguage.1 TheUseoftheseNotesThenotesaredesignedsothatuserscanruntheexamplesinthescriptfiles(ch1-2.R,ch3-4.R,etc.)usingthenotesascommentary.UnderWindowsanalternativetotypingthecommandsattheconsoleis,asdemonstratedinSection1.2,toopenadisplayfilewindowandtransferthecommandsacrossfromthethatwindow.Readersofthesenotesmayfindithelpfultohaveavailableforreferencethedocument:“AnIntroductiontoR”,writtenbytheRDevelopmentCoreTeam,suppliedwithRdistributionsandavailablefromCRANsites.TheRProjectTheinitialversionofRwasdevelopedbyRossIhakaandRobertGentleman,bothfromtheUniversityofAuckland.DevelopmentofRisnowoverseenbya`coreteam’ofaboutadozenpeople,widelydrawnfromdifferentinstitutionsworldwide.ThedevelopmentmodelissimilartothatofthepopularLinuxoperatingsystem.LikeLinux,Risan“opensource”system.Source-codeisavailableforinspectionorforadaptationtoothersystems.Inprinciple,ifitisunclearwhataroutinedoes,onecancheckthesourcecode.Exposingcodetothecriticalscrutinyofhighlyexpertusershasprovedanextremelyeffectivewaytoidentifybugsandotherinadequacies,andtoelicitideasforenhancement.Reportedbugsarecommonlyfixedinthenextminor-minorrelease,whichwillusuallyappearwithinamatterofweeks.NoviceuserswillnoticesmallbutoccasionallyimportantdifferencesbetweentheSdialectthatRimplementsandthecommercialS-PLUSimplementationofS.Thosewhowritetheirownsubstantialfunctionsand(moreimportantly)packageswillfindlargedifferences.PackagesthathavebeenwrittenforRofferabilitiesthatarebroadlycomparablewith,orinsomeinstancesgobeyond,thoseinS-PLUSlibraries.Thesegiveaccesstoup-to-datemethodologyfromleadingstatisticalresearchers.Rhasstronggraphicsabilities.ThelatticegraphicspackagegivesmanyoftheabilitiesthatareintheS-PLUStrellislibrary.Rprovidesalanguageenvironmentthatisattractiveforthedevelopmentofnewscientificcomputationaltools.Computer-intensivecomponentscan,ifcomputationalefficiencydemands,behandledbyacalltoafunctionthatiswrittenintheClanguage.TheRsystemmaystruggletohandleverylargedatasets.Dependingonavailablecomputermemory,theprocessingofadatasetcontainingonehundredthousandobservationsandperhapstwentyvariablesmaypressthelimitsofwhatRcaneasilyhandle.WebPagesandEmailListsForavarietyofofficialandcontributeddocumentation,forcopiesofvariousversionsofR,andforotherinformation,gotohttp://cran.r-project.organdfindthenearestCRAN(ComprehensiveRArchiveNetwork)mirrorsite.Australianusersmaywishtogodirectlytohttp://mirror.aarnet.edu.au/pub/CRANThereisnoofficialsupportforR.Ther-helpemaillistgivesaccesstoaninformalsupportnetworkthatcanbehighlyeffective.Detailsofther-helplist,andofotherliststhatservetheRcommunity,areavailablefromthewebsitefortheRprojectathttp://www.R-project.org/Besuretochecktheavailabledocumentationbeforepostingtotheemaillists.Emailarchivescanbesearchedforquestionsthatmayhavebeenpreviouslyanswered.DatasetsthatrelatetothesenotesCopydowntheRimagefilehttp://wwwmaths.anu.edu.au/~johnm/r/dsets/usingR.RDataSection1.6explainshowtoaccessthedatasets.Datasetsarealsoavailableindividually;gotohttp://wwwmaths.anu.edu.au/~johnm/r/dsets/individual-dsets/_________________________________________________________________________JeffWood(CMIS,CSIRO),AndreasRuckstuhl(TechnikumWinterthurIngenieurschule,Switzerland)andJohnBraun(UniversityofWesternOntario)gavemeexemplaryhelpingettingtheearlierS-PLUSversionofthisdocumentsomewherenearshipshapeform.JohnBraungavevaluablehelpwithproofreading,andprovidedseveralofthedatasetsandanumberoftheexercises.Itakefullresponsibilityfortheerrorsthatremain.Iamgrateful,also,tovariousscientistsnamedinthenoteswhohaveallowedmetousetheirdata.2 1.StartingUpRmustbeinstalledonyoursystem!Ifitisnot,followtheinstallationinstructionsappropriatetotheoperatingsystem.InstallationisnowespeciallystraightforwardforWindowsusers.CopydownthelatestSetupR.exefromtherelevantbasedirectoryonthenearestCRANsite,clickonitsicontostartinstallation,andfollowinstructions.Packagesthatdonotcomewiththebasedistributionmustbedownloadedandinstalledseparately.Itpaystohaveaseparateworkingdirectoryforeachmajorproject.Formoredetails.seetheREADMEfilethatisincludedwiththeRdistribution.UsersofMicrosoftWindowsmaywishtocreateaseparateiconforeach2suchworkingdirectory.Firstcreatethedirectory.Thenrightclick|copytocopyanexistingRicon,it,right3click|pastetoplaceacopyonthedesktop,rightclick|renameonthecopytorenameit,andthenfinallygotorightclick|propertiestosettheStartindirectorytobetheworkingdirectorythatwassetupearlier.1.1GettingstartedunderWindowsClickontheRicon.Orifthereismorethanoneicon,choosetheiconthatcorrespondstotheprojectthatisinhand.ForthisdemonstrationIwillclickonmyr-notesicon.IninteractiveuseunderMicrosoftWindowsthereareseveralwaystoinputcommandstoR.Figures1and2demonstratetwoofthepossibilities.Eitherorbothofthefollowingmaybeusedattheuser’sdiscretion:Forthemoment,wewilltypecommandsintothecommandwindow,atthecommandlineprompt.Figure1showsthecommandwindowasitappearswhenRhasjustbeenstarted,forversion2.0.0.Thisis,thetimeofwriting,thelatestversion.Fig.1:TheupperleftportionoftheRconsole(commandline)window.Figure1showstheconsolewindowimmediatelyafteropening.Thecommandlineprompt,i.e.the>,isaninvitationtostarttypinginyourcommands.Forexample,typein2+2andpresstheEnterkey.Hereiswhatappearsonthescreen:>2+2[1]4>Heretheresultis4.The[1]says,alittlestrangely,“firstrequestedelementwillfollow”.Here,thereisjustoneelement.The>indicatesthatRisreadyforanothercommand.Forlaterreference,notethattheexitorquitcommandis>q()2Thisisashortcutfor“rightclick,thenleftclickonthecopymenuitem”.3Enterthenameofyourchoiceintothenamefield.Foreaseofremembering,chooseanamethatcloselymatchesthenameoftheworkspacedirectory,perhapsthenameitself.3 AlternativesaretoclickontheFilemenuandthenonExit,ortoclickontheinthetoprighthandcorneroftheRwindow.Therewillbeamessageaskingwhethertosavetheworkspaceimage.ClickingYes(thesafeoption)willsavealltheobjectsthatremainintheworkspace–anythatwerethereatthestartofthesessionandanythathavebeenaddedsince.1.2UseofanEditorScriptWindowThescreensnapshotinFigure2showsascriptfilewindow.ThisallowsinputtoRofstatementsfromafilethathasbeensetupinadvance,orthathavebeentypedorcopiedintothewindow.Togetascriptfilewindow,gototheFilemenu.Ifanewblankwindowisrequired,clickonNewscript.Toloadanexistingfile,clickonOpenscript…;youwillbeaskedforthenameofafilewhosecontentsarethendisplayedinthewindow.InFigure2thefilewasfirstSteps.R.HighlightthecommandsthatareintendedforinputtoR.Clickonthe`Runlineorselection’icon,whichisthemiddleiconofthescriptfileeditortoolbarinFigs.2and3,tosendcommandstoR.Fig.2:ThefocusisonanRdisplayfilewindow,withtheconsolewindowinthebackground.Fig.3:Thisshowsthefiveiconsthatappearwhenthefocusisonascriptfilewindow.Theiconsare,startingfromtheleft:Openscript,Savescript,Runlineorselection,Returnfocustoconsole,andPrint.Thetextinascriptfilewindowcanbeedited,ornewtextadded.Displayfilewindows,whichhaveasomewhatsimilarsetoficonsbutdonotallowediting,areanotherpossibility.UnderUnix,thestandardformofinputisthecommandlineinterface.UnderbothMicrosoftWindowsand4Linux(orUnix),afurtherpossibilityistorunRfromwithintheemacseditor.ThisworksmuchbetterunderLinix/UnixthanunderWindows.UnderMicrosoftWindows,anattractiveoptionistouseautilitythatis5designedforusewiththesharewareWinEdteditor.4Thisrequiresbothemacs,andESSwhichrunsunderemacs.Botharefree.LookunderSoftware|OtherontheCRANwebpage.4 1.3AShortRSessionWewillreadintoRafilethatholdsthepopulationfiguresforAustralianstatesandterritories,andthetotalpopulation,atvarioustimessince1917.Wewilluseinformationfromthisfiletocreateagraph.Hereistheinformationinthefile:YearNSWVic.QldSAWATas.NTACTAust.1917190414096834403061935349411927240217278735653922114861821937269318539935894572336116836194729852055110664650225711177579195736252656141387368832621389640196742953274170011108793756210311799197750023837213012861204415104214141921987561742102675139314964491582651626419976274460534011480179847418731018532Thefollowingreadsinthedatafromthefileaustpop.txtonadiskindrivea:>austpop<-read.table(“a:/austpop.txt”,header=T)The<-isaleftdiamondbracket(<)followedbyaminussign(-).Itmeans“isassignedto”.Useofheader=TcausesRtouse=thefirstlinetogetheaderinformationforthecolumns.Ifcolumnheadingsarenotincludedinthefile,theargumentcanbeomitted.Nowtypeinaustpopatthecommandlineprompt,displayingtheobjectonthescreen:>austpopYearNSWVic.QldSAWATas.NTACTAust.11917190414096834403061935349412192724021727873565392211486182...WewilllearnlaterthataustpopisaspecialformofRobject,knownasadataframe.Dataframesthatconsistentirelyofnumericdatahaveastructurethatissimilartothatofnumericmatrices.HereisaplotoftheACTpopulationbetween1917and1997(Figure4).003002TCA00105019201940196019802000YearFigure4:ACTpopulation,atvarioustimesbetween1917and1997.Wefirstofallremindourselvesofthecolumnnames:>names(austpop)[1]"Year""NSW""Vic.""Qld""SA""WA""Tas.""NT"[9]"ACT""Aust."Asimplewaytogettheplotis:5TheR-WinEdtutility,whichisfree,isa“plugin”forWinEdt.Forlinkstotherelevantwebpages,forWinEdtandR-WinEdt,lookunderSoftware|OtherontheCRANwebpage.5 >plot(ACT~Year,data=austpop,pch=16)Theoptionpch=16setstheplottingcharactertosolidblackdots.Figure4showsthegraph:Thisplotcanbeimprovedgreatly.Wecanspecifymoreinformativeaxislabels,changesizeofthetextandoftheplottingsymbol,andsoon.1.3.1EntryofDataattheCommandLineAdataframeisarectangulararrayofcolumnsofdata.Herewewillhavetwocolumns,andbothcolumnswillbenumeric.Thefollowingdatagives,foreachamountbywhichanelasticbandisstretchedovertheendofaruler,thedistancethatthebandmovedwhenreleased:stretch46544850444252distance148182173166109141166Thefunctiondata.frame()canbeusedtoinputthese(orother)datadirectlyatthecommandline.Wewillgivethedataframethenameelasticband:elasticband<-data.frame(stretch=c(46,54,48,50,44,42,52),distance=c(148,182,173,166,109,141,166))1.3.2Entryand/oreditingofdatainaneditorwindowToeditthefileelasticbandinaspreadsheet-likeformat,typeelasticband<-edit(elasticband)Figure5:Editorwindow,showingthedataframeelasticband.1.3.3Optionsforuseofread.table()Thefunctionread.table()takes,optionallyvariousparametersadditionaltothefilenamethatholdsthedata.Specifyheader=TRUEifthereisaninitialrowofheadernames.Thedefaultisheader=FALSE.Inadditionuserscanspecifytheseparatorcharacterorcharacters.Commandalternativestothedefaultuseofaspacearesep=","andsep="t".Thislastchoicemakestabsseparators.Similarly,userscancontroloverthechoiceofmissingvaluecharacterorcharacters,whichbydefaultisNA.Ifthemissingvaluecharacterisaperiod(“.”),specifyna.strings=".".Thereareseveralvariantsofread.table()thatdifferonlyinhavingdifferentdefaultparametersettings.Noteinparticularread.csv(),whichhassettingsthataresuitableforcommadelimited(csv)filesthathavebeengeneratedfromExcelspreadsheets.Ifread.table()detectsthatlinesintheinputfilehavedifferentnumbersoffields,datainputwillfail,withanerrormessagethatdrawsattentiontothediscrepancy.Itisthenoftenusefultousethefunctioncount.fields()toreportthenumberoffieldsthatwereidentifiedoneachseparatelineofthefile.6 1.3.4Optionsforplot()andalliedfunctionsThefunctionplot()andrelatedfunctionsacceptparametersthatcontroltheplottingsymbol,andthesizeandcolouroftheplottingsymbol.Detailswillbegiveninsection3.3.1.4FurtherNotationalDetailsAsnotedearlier,thecommandlinepromptis>6Rcommands(expressions)aretypedinfollowingthisprompt.Thereisalsoacontinuationprompt,usedwhen,followingacarriagereturn,thecommandisstillnotcomplete.Bydefault,thecontinuationpromptis+Inthesenotes,weoftencontinuecommandsovermorethanoneline,butomitthe+thatwillappearonthecommandswindowifthecommandistypedinasweshowit.ForthenamesofRobjectsorcommands,caseissignificant.ThusAustpopisdifferentfromaustpop.Forfilenameshowever,theMicrosoftWindowsconventionsapply,andcasedoesnotdistinguishfilenames.OnUnixsystemslettersthathaveadifferentcasearetreatedasdifferent.Anythingthatfollowsa#onthecommandlineistakenascommentandignoredbyR.Note:Recallthat,inordertoquitfromtheRsessionwehadtotypeq().Thisisbecauseqisafunction.Typingqonitsown,withouttheparentheses,displaysthetextofthefunctiononthescreen.Tryit!1.5On-lineHelpTogetahelpwindow(underRforWindows)withalistofhelptopics,type:>help()InRforWindows,analternativeistoclickonthehelpmenuitem,andthenusekeywordstodoasearch.TogethelponaspecificRfunction,e.g.plot(),typein>help(plot)Thetwosearchfunctionshelp.search()andapropos()canbeahugehelpinfindingwhatonewants.Examplesoftheiruseare:>help.search("matrix")(Thislistsallfunctionswhosehelppageshaveatitleoraliasinwhichthetextstring“matrix”appears.)>apropos(“matrix”)(Thislistsallfunctionnamesthatincludethetext“matrix”.)Thefunctionhelp.start()opensabrowserwindowthatgivesaccesstothefullrangeofdocumentationforsyntax,packagesandfunctions.ExperimentationoftenhelpsclarifythepreciseactionofanRfunction.1.6TheLoadingorAttachingofDatasetsTherecommendedwaytoaccessdatasetsthataresuppliedforusewiththesenotesistoattachthefileusingR.RData.,availablefromtheauthor'swebpage.Placethisfileintheworkingdirectoryand,fromwithintheRsession,type:>attach("usingR.RData")Filesthatarementionedinthesenotes,andthatarenotsuppliedwithR(e.g.,fromthedatasetsorMASSpackages)shouldthenbeavailablewithoutneedforanyfurtheraction.Userscanalsoload(useload())orattach(useattach())specificfiles.Thesehaveasimilareffect,thedifferencebeingthatwithattach()datasetsareloadedintomemoryonlywhenrequiredforuse.6Multiplecommandsmayappearontheoneline,withthesemicolon(;)astheseparator.7 Distinguishbetweentheattachingofimagefilesandtheattachingofdataframes.Theattachingofdataframeswillbediscussedlaterinthesenotes.1.7Exercise1.Inthedataframeelasticbandfromsection1.3.1,plotdistanceagainststretch.2.Thefollowingtenobservations,takenduringtheyears1970-79,areonOctobersnowcoverforEurasia.(Snowcoverisinmillionsofsquarekilometers):yearsnow.cover19706.5197112.0197214.9197310.0197410.719757.9197621.9197712.5197814.519799.2i.EnterthedataintoR.[Section1.3.1showedonewaytodothis.Tosavekeystrokes,enterthesuccessiveyearsas1970:1979]ii.Plotsnow.coverversusyear.iiiUsethehist()commandtoplotahistogramofthesnowcovervalues.iv.Repeatiiandiiiaftertakinglogarithmsofsnowcover.3.Inputthefollowingdata,ondamagethathadoccurredinspaceshuttlelaunchespriortothedisastrouslaunchofJan281986.Thesearethedata,for6launchesoutof24,thatwereincludedinthepre-launchchartsthatwereusedindecidingwhethertoproceedwiththelaunch.(Dataforthe23launcheswhereinformationisavailableisinthedatasetoringsthataccompaniesthesenotes.)TemperatureErosionBlowbyTotal(F)incidentsincidentsincidents533255710163101701017010175021Enterthesedataintoadataframe,with(forexample)columnnamestemperature,erosion,blowbyandtotal.(ReferbacktoSection1.3.1).Plottotalincidentsagainsttemperature.8 2.AnOverviewofR2.1TheUsesofR2.1.1Rmaybeusedasacalculator.Revaluatesandprintsouttheresultofanyexpressionthatonetypesinatthecommandlineintheconsolewindow.Expressionsaretypedfollowingtheprompt(>)onthescreen.Theresult,ifany,appearsonsubsequentlines>2+2[1]4>sqrt(10)[1]3.162278>2*3*4*5[1]120>1000*(1+0.075)^5-1000#Intereston$1000,compoundedannually[1]435.6293>#at7.5%p.a.forfiveyears>pi#Rknowsaboutpi[1]3.141593>2*pi*6378#CircumferenceofEarthatEquator,inkm;radiusis6378km[1]40074.16>sin(c(30,60,90)*pi/180)#Convertanglestoradians,thentakesin()[1]0.50000000.86602541.00000002.1.2RwillprovidenumericalorgraphicalsummariesofdataAspecialclassofobject,calledadataframe,storesrectangulararraysinwhichthecolumnsmaybevectorsofnumbersorfactorsortextstrings.DataframesarecentraltothewaythatallthemorerecentRroutinesprocessdata.Fornow,thinkofdataframesasmatrices,wheretherowsareobservationsandthecolumnsarevariables.7Asafirstexample,considerthedataframehillsthataccompaniesthesenotes.Thishasthreecolumns(variables),withthenamesdistance,climb,andtime.Typinginsummary(hills)givessummaryinformationonthesevariables.Thereisonecolumnforeachvariable,thus:>load("hills.Rdata")#Assumeshills.Rdataisintheworkingdirectory>summary(hills)distanceclimbtimeMin.:2.000Min.:300Min.:15.951stQu.:4.5001stQu.:7251stQu.:28.00Median:6.000Median:1000Median:39.75Mean:7.529Mean:1815Mean:57.883rdQu.:8.0003rdQu.:22003rdQu.:68.62Max.:28.000Max.:7500Max.:204.60Wemayforexamplerequireinformationonrangesofvariables.Thustherangeofdistances(firstcolumn)isfrom2milesto28miles,whiletherangeoftimes(thirdcolumn)isfrom15.95(minutes)to204.6minutes.Wewilldiscussgraphicalsummariesinthenextsection.7ThereisalsoaversionintheVenablesandRipleyMASSlibrary.9 2.1.3RhasextensivegraphicalabilitiesThemainRgraphicsfunctionisplot().Inadditiontoplot()therearefunctionsforaddingpointsandlinestoexistinggraphs,forplacingtextatspecifiedpositions,forspecifyingtickmarksandticklabels,forlabellingaxes,andsoon.Therearevariousotheralternativehelpfulformsofgraphicalsummary.Ahelpfulgraphicalsummaryforthehillsdataframeisthescatterplotmatrix,showninFigure6.Forthis,type:>pairs(hills)100040007000525distance1500070004climb0001051time055152550150Figure6:ScatterplotmatrixfortheScottishhillracedata2.1.4RwillhandleavarietyofspecificanalysesTheexamplesthatwillbegivenarecorrelationandregression.Correlation:Wecalculatethecorrelationmatrixforthehillsdata:>options(digits=3)>cor(hills)distanceclimbtimedistance1.0000.6520.920climb0.6521.0000.805time0.9200.8051.000Supposewewishtocalculatelogarithms,andthencalculatecorrelations.Wecandoallthisinonestep,thus:>cor(log(hills))distanceclimbtimedistance1.000.7000.890climb0.701.0000.724time0.890.7241.000UnfortunatelyRwasnotcleverenoughtorelabeldistanceaslog(distance),climbaslog(climb),andtimeaslog(time).Noticethatthecorrelationsbetweentimeanddistance,andbetweentimeandclimb,havereduced.Whyhasthishappened?StraightLineRegression:Hereisastraightlineregressioncalculation.Onespecifiesanlm(=linearmodel)expression,whichRevaluates.Thedataarestoredinthedataframeelasticbandthataccompaniesthesenotes.Thevariablenamesarethenamesofcolumnsinthatdataframe.Thecommandasksfortheregressionofdistancetravelledbytheelasticband(distance)ontheamountbywhichitisstretched(stretch).10 >plot(distance~stretch,data=elasticband,pch=16)>elastic.lm<-lm(distance~stretch,data=elasticband)>lm(distance~stretch,data=elasticband)Call:lm(formula=distance~stretch,data=elasticband)Coefficients:(Intercept)stretch-63.5714.554Morecompleteinformationisavailablebytyping>summary(lm(distance~stretch,data=elasticband))Tryit!2.1.5RisanInteractiveProgrammingLanguageWecalculatetheFahrenheittemperaturesthatcorrespondtoCelsiustemperatures25,26,…,30:>celsius<-25:30>fahrenheit<-9/5*celsius+32>conversion<-data.frame(Celsius=celsius,Fahrenheit=fahrenheit)>print(conversion)CelsiusFahrenheit12577.022678.832780.642882.452984.263086.0Wecouldalsohaveusedaloop.Ingeneralitispreferabletoavoidloopswhenever,ashere,thereisagoodalternative.Loopsmayinvolveseverecomputationaloverheads.2.2RObjectsAllRentities,includingfunctionsanddatastructures,existasobjects.Theycanallbeoperatedonasdata.Typeinls()toseethenamesofallobjectsinyourworkspace.Analternativetols()isobjects().In8bothcasesthereisprovisiontospecifyaparticularpattern,e.g.startingwiththeletter`p’.Typingthenameofanobjectcausestheprintingofitscontents.Trytypingq,mean,etc.Inalongsession,itmakessensetosavethecontentsoftheworkingdirectoryfromtimetotime.Itisalsopossibletosaveindividualobjects,orcollectionsofobjectsintoanamedimagefile.Someofthepossibilitiesare:save.image()#Savecontentsofworkspace,intothefile.RDatasave.image(file="archive.RData")#Saveintothefilearchive.RDatasave(celsius,fahrenheit,file="tempscales.RData")Imagefiles,fromtheworkingdirectoryor(withthepathspecified)fromanotherdirectory,canbeattached,thusmakingobjectsinthefileavailableonrequest.Forexampleattach("tempscales.RData")ls(pos=2)#Checkthecontentsofthefilethathasbeenattached8Typeinhelp(ls)andhelp(grep)togetdetails.Thepatternmatchingconventionsarethoseusedforgrep(),whichismodelledontheUnixgrepcommand.11 Theparameterposgivesthepositiononthesearchlist.Thesearchlistisdiscussedlaterinthischapter,inSection2.9.Important:Onquitting,Rofferstheoptionofsavingtheworkspaceimage,bydefaultinthefile.RDataintheworkingdirectory.Thisallowstheretention,foruseinthenextsessioninthesameworkspace,anyobjectsthatwerecreatedinthecurrentsession.Carefulhousekeepingmaybeneededtodistinguishbetweenobjectsthataretobekeptandobjectsthatwillnotbeusedagain.Beforetypingq()toquit,userm()toremoveobjectsthatarenolongerrequired.Savingtheworkspaceimagewillthensaveeverythingremains.Theworkspaceimagewillbeautomaticallyloadeduponstartinganothersessioninthatdirectory.9*2.3LoopingInRthereisoftenabetteralternativetowritinganexplicitloop.Wherepossible,useoneofthebuilt-in10functionstoavoidexplicitlooping.Asimpleexampleofaforloopisfor(iin1:10)print(i)Hereisanotherexampleofaforloop,todoinacomplicatedwaywhatwedidverysimplyinsection2.1.5:>#CelsiustoFahrenheit>for(celsiusin25:30)+print(c(celsius,9/5*celsius+32))[1]2577[1]26.078.8[1]27.080.6[1]28.082.4[1]29.084.2[1]30862.3.1MoreonloopingHereisalong-windedwaytosumthethreenumbers31,51and91:>answer<-0>for(jinc(31,51,91)){answer<-j+answer}>answer[1]173Thecalculationiterativelybuildsuptheobjectanswer,usingthesuccessivevaluesofjlistedinthevector(31,51,91).i.e.Initially,j=31,andanswerisassignedthevalue31+0=31.Thenj=51,andanswerisassignedthevalue51+31=82.Finally,j=91,andanswerisassignedthevalue91+81=173.Thentheprocedureends,andthecontentsofanswercanbeexaminedbytypinginanswerandpressingtheEnterkey.Thereisamucheasier(andbetter)waytodothiscalculation:>sum(c(31,51,91))[1]173SkilledRusershavelimitedrecoursetoloops.Thereareoften,asintheexampleabove,betteralternatives.2.4VectorsExamplesofvectorsarec(2,3,5,2,7,1)9Asterisks(*)identifysectionsthataremoretechnicalandmightbeomittedatafirstreading10Otherloopingconstructsare:repeat##breakmustappearsomewhereinsidetheloopwhile(x>0)HereisanRstatement,orasequenceofstatementsthatareenclosedwithinbraces12 3:10#Thenumbers3,4,..,10c(T,F,F,F,T,T,F)c(”Canberra”,”Sydney”,”Newcastle”,”Darwin”)11Vectorsmayhavemodelogical,numericorcharacter.Thefirsttwovectorsabovearenumeric,thethirdislogical(i.e.avectorwithelementsofmodelogical),andthefourthisastringvector(i.e.avectorwithelementsofmodecharacter).Themissingvaluesymbol,whichisNA,canbeincludedasanelementofavector.2.4.1Joining(concatenating)vectorsThecinc(2,3,5,7,1)aboveisanacronymfor“concatenate”,i.e.themeaningis:“Jointhesenumberstogetherintoavector.Existingvectorsmaybeincludedamongtheelementsthataretobeconcatenated.Inthefollowingweformvectorsxandy,whichwethenconcatenatetoformavectorz:>x<-c(2,3,5,2,7,1)>x[1]235271>y<-c(10,15,12)>y[1]101512>z<-c(x,y)>z[1]235271101512Theconcatenatefunctionc()mayalsobeusedtojoinlists.2.4.2SubsetsofVectors12Therearetwocommonwaystoextractsubsetsofvectors.1.Specifythenumbersoftheelementsthataretobeextracted,e.g.>x<-c(3,11,8,15,12)#Assigntoxthevalues3,11,8,15,12>x[c(2,4)]#Extractelements(rows)2and4[1]1115Onecanusenegativenumberstoomitelements:>x<-c(3,11,8,15,12)>x[-c(2,3)][1]315122.Specifyavectoroflogicalvalues.TheelementsthatareextractedarethoseforwhichthelogicalvalueisT.Thussupposewewanttoextractvaluesofxthataregreaterthan10.>x>10#Thisgeneratesavectoroflogical(TorF)[1]FTFTT>x[x>10]11Itwill,laterinthesenotes,beimportanttoknowthe“class”ofsuchobjects.Thisdetermineshowthemethodusedbysuchgenericfunctionsasprint(),plot()andsummary().Usethefunctionclass()todeterminetheclassofanobject.12Athirdmoresubtlemethodisavailablewhenvectorshavenamedelements.Onecanthenuseavectorofnamestoextracttheelements,thus:>c(Andreas=178,John=185,Jeff=183)[c("John","Jeff")]JohnJeff18518313 [1]111512Arithmeticrelationsthatmaybeusedintheextractionofsubsetsofvectorsare<,<=,>,>=,==,and!=.Thefirstfourcomparemagnitudes,==testsforequality,and!=testsforinequality.2.4.3TheUseofNAinVectorSubscriptsNotethatanyarithmeticoperationorrelationthatinvolvesNAgeneratesanNA.Sety<-c(1,NA,3,0,NA)Bewarnedthaty[y==NA]<-0leavesyunchanged.Thereasonisthatallelementsofy==NAevaluatetoNA.Thisdoesnotselectanelementofy,andthereisnoassignment.ToreplaceallNAsby0,usey[is.na(y)]<-02.4.4FactorsAfactorisaspecialtypeofvector,storedinternallyasanumericvectorwithvalues1,2,3,k.Thevaluekisthe13numberoflevels.Anattributestablegivesthe‘level’foreachintegervalue.Factorsprovideacompactwaytostorecharacterstrings.Theyarecrucialintherepresentationofcategoricaleffectsinmodelandgraphicsformulae.Theclassattributeofafactorhas,notsurprisingly,thevalue“factor”.Considerasurveythathasdataon691femalesand692males.Ifthefirst691arefemalesandthenext692males,wecancreateavectorofstringsthatthatholdsthevaluesthus:gender<-c(rep(“female”,691),rep(“male”,692))(Theusageisthatrep(“female”,691)creates691copiesofthecharacterstring“female”,andsimilarlyforthecreationof692copiesof“male”.)Wecanchangethevectortoafactor,byentering:gender<-factor(gender)Internallythefactorgenderisstoredas6911’s,followedby6922’s.Ithasstoredwithitatablethatlookslikethis:1female2maleOncestoredasafactor,thespacerequiredforstorageisreduced.Wheneverthecontextseemstodemandacharacterstring,the1istranslatedinto“female”andthe2into“male”.Thevalues“female”and“male”arethelevelsofthefactor.Bydefault,thelevelsareinalphanumericorder,sothat“female”precedes“male”.Hence:>levels(gender)#Assumesgenderisafactor,createdasabove[1]"female""male"Theorderofthelevelsinafactordeterminestheorderinwhichthelevelsappearingraphsthatusethisinformation,andintables.Tocause“male”tocomebefore“female”,usegender<-relevel(gender,ref=“male”)Analternativeisgender<-factor(gender,levels=c(“male”,“female”))Thislastsyntaxisavailablebothwhenthefactorisfirstcreated,orlaterwhenonewishestochangetheorderoflevelsinanexistingfactor.Incorrectspellingofthelevelnameswillgenerateanerrormessage.Trygender<-factor(c(rep(“female”,691),rep(“male”,692)))table(gender)13Theattributes()functionmakesitpossibletoinspectattributes.Forexampleattributes(factor(1:3))Thefunctionlevels()givesabetterwaytoinspectfactorlevels.14 gender<-factor(gender,levels=c(“male”,“female”))table(gender)gender<-factor(gender,levels=c(“Male”,“female”))#Erroneous-"male"rowsnowholdmissingvaluestable(gender)rm(gender)#Removegender2.5DataFramesDataframesarefundamentaltotheuseoftheRmodellingandgraphicsfunctions.Adataframeisageneralisationofamatrix,inwhichdifferentcolumnsmayhavedifferentmodes.Allelementsofanycolumnmusthoweverhavethesamemode,i.e.allnumericorallfactor,orallcharacter.AmongthedatasetsthataresuppliedtoaccompanythesenotesisonecalledCars93.summary,createdfrominformationintheCars93datasetintheVenablesandRipleyMASSpackage.Hereitis:>Cars93.summaryMin.passengersMax.passengersNo.of.carsabbrevCompact4616CLarge6611LMidsize4622MSmall4521SmSporty2414SpVan789VThedataframehasrowlabels(accesswithrow.names(Cars93.summary))Compact,Large,...Thecolumnnames(accesswithnames(Cars93.summary))areMin.passengers(i.e.theminimumnumberofpassengersforcarsinthiscategory),Max.passengers,No.of.cars.,andabbrev.Thefirstthreecolumnshavemodenumeric,andthefourthhasmodecharacter.Columnscanbevectorsofanymode.Thecolumnabbrevcouldequallywellbestoredasafactor.14AnyofthefollowingwillpickoutthefourthcolumnofthedataframeCars93.summary,thenstoringitinthevectortype.type<-Cars93.summary$abbrevtype<-Cars93.summary[,4]type<-Cars93.summary[,”abbrev”]type<-Cars93.summary[[4]]#Taketheobjectthatisstored#inthefourthlistelement.2.5.1Dataframesaslists15Adataframeisalistofcolumnvectors,allofequallength.Justaswithanyotherlist,subscriptingextractsalist.ThusCars93.summary[4]isadataframewithasinglecolumn,whichisthefourthcolumnvectorofCars93.summary.Asnotedabove,useCars93.summary[[4]]orCars93.summary[,4]toextractthecolumnvector.Theuseofmatrix-likesubscripting,e.g.Cars93.summary[,4]orCars93.summary[1,4],takesadvantageoftherectangularstructureofdataframes.14AlsolegalisCars93.summary[2].ThisgivesadataframewiththesinglecolumnType.15Ingeneralformsoflist,elementsthatareofarbitrarytype.Theymaybeanymixtureofscalars,vectors,functions,etc.15 2.5.2InclusionofcharacterstringvectorsindataframesWhendataareinputusingread.table(),orwhenthedata.frame()functionisusedtocreatedataframes,vectorsofcharacterstringsarebydefaultturnedintofactors.Theparametersettingas.is=T,availablebothwithread.table()andwithdata.frame(),willifneededensurethatcharacterstringsareinputwithoutsuchconversion.2.5.3Built-indatasetsWewilloftenusedatasetsthataccompanyoneoftheRpackages,usuallystoredasdataframes.Onesuchdataframe,inthedatasetspackage,istrees,whichgivesgirth,heightandvolumefor31BlackCherryTrees.>data(trees)#Loaddataset(notneededforversions>=2.0.0)Hereissummaryinformationonthisdataframe>summary(trees)GirthHeightVolumeMin.:8.30Min.:63Min.:10.201stQu.:11.051stQu.:721stQu.:19.40Median:12.90Median:76Median:24.20Mean:13.25Mean:76Mean:30.173rdQu.:15.253rdQu.:803rdQu.:37.30Max.:20.60Max.:87Max.:77.00(InversionsofRpriorto2.0.0,itwillbenecessarytospecifydata(trees)inordertobrindthisdatasetintotheworkspace.)16Typedata()togetalistofbuilt-indatasetsinthepackagesthathavebeenloaded.2.6CommonUsefulFunctionsprint()#PrintsasingleRobjectcat()#Printsmultipleobjects,oneaftertheotherlength()#Numberofelementsinavectororofalistmean()median()range()unique()#Givesthevectorofdistinctvaluesdiff()#Replaceavectorbythevectoroffirstdifferences#N.B.diff(x)hasonelesselementthanxsort()#Sortelementsintoorder,butomittingNAsorder()#x[order(x)]orderselementsofx,withNAslastcumsum()cumprod()rev()#reversetheorderofvectorelementsThefunctionsmean(),median(),range(),andanumberofotherfunctions,taketheargumentna.rm=T;i.e.removeNAs,thenproceedwiththecalculation.Bydefault,sort()omitsanyNAs.Thefunctionorder()placesNAslast.Hence:>x<-c(1,20,2,NA,22)>order(x)[1]13254>x[order(x)]16Thelistincludeallpackagesthatareinthecurrentenvironment.16 [1]122022NA>sort(x)[1]1220222.6.1ApplyingafunctiontoallcolumnsofadataframeThefunctionsapply()doesthis.Ittakesasargumentsthenameofthedataframe,andthefunctionthatisto17beapplied.Hereareexamples,usingthesupplieddatasetrainforest.>sapply(rainforest,is.factor)dbhwoodbarkrootrootskbranchspeciesFALSEFALSEFALSEFALSEFALSEFALSETRUE>sapply(rainforest[,-7],range)#Thefinalcolumn(7)isafactordbhwoodbarkrootrootskbranch[1,]4NANANANANA[2,]56NANANANANAThefunctionsmean()andrange(),andanumberofotherfunctions,takeparametersna.rm.Forexample>range(rainforest$branch,na.rm=T)#OmitNAs,thendeterminetherange[1]4120Onecanspecifyna.rm=Tasathirdargumenttothefunctionsapply.Thisargumentisthenautomaticallypassedtothefunctionthatisspecifiedinthesecondargumentposition.Forexample:>sapply(rainforest[,-7],range,na.rm=T)dbhwoodbarkrootrootskbranch[1,]43820.34[2,]56153010513524.0120Chapter8hasfurtherdetailsontheuseofsapply().Thereisanexamplethatshowshowtouseittocountthenumberofmissingvaluesineachcolumnofdata.2.7MakingTablestable()makesatableofcounts.Specifyonevectorofvalues(oftenafactor)foreachtablemarginthatisrequired.Forexample:>library(lattice)#Thedataframebarleyaccompanieslattice>table(barley$year,barley$site)GrandRapidsDuluthUniversityFarmMorrisCrookstonWaseca19321010101010101931101010101010WARNING:NAsarebydefaultignored.TheactionneededtogetNAstabulatedunderaseparateNAcategorydepends,annoyingly,onwhetherornotthevectorisafactor.Ifthevectorisnotafactor,specifyexclude=NULL.Ifthevectorisafactorthenitisnecessarytogenerateanewfactorthatincludes“NA”asalevel.Specifyx<-factor(x,exclude=NULL)>x_c(1,5,NA,8)>x<-factor(x)>x[1]15NA8Levels:158>factor(x,exclude=NULL)17Source:Ash,J.andSouthern,W.1982:ForestbiomassatButler’sCreek,Edith&JoyLondonFoundation,NewSouthWales,Unpublishedmanuscript.SeealsoAsh,J.andHelman,C.1990:Floristicsandvegetationbiomassofaforestcatchment,Kioloa,southcoastalN.S.W.Cunninghamia,2(2):167-182.17 [1]15NA8Levels:158NA2.7.1NumbersofNAsinsubgroupsofthedataThefollowinggivesinformationonthenumberofNAsinsubgroupsofthedata:>table(rainforest$species,!is.na(rainforest$branch))FALSETRUEAcaciamabellae610C.fraseri012Acmenasmithii1511B.myrtifolia110ThusforAcaciamabellaethereare6NAsforthevariablebranch(i.e.numberofbranchesover2cmindiameter),outofatotalof16datavalues.2.8TheSearchListRhasasearchlistwhereitlooksforobjects.Thiscanbechangedinthecourseofasession.Togetafulllistofthesedirectories,calleddatabases,type:>search()[1]".GlobalEnv""package:methods""package:stats"[4]"package:graphics""package:grDevices""package:utils"[7]"package:datasets""Autoloads""package:base"Noticethattheloadingofanewpackageextendsthesearchlist.>library(MASS)>search()[1]".GlobalEnv""package:MASS""package:methods"[4]"package:stats""package:graphics""package:grDevices"[7]"package:utils""package:datasets""Autoloads"[10]"package:base"Useofattach()likewiseextendsthesearchlist.Thisfunctioncanbeusedtoattachdataframesorlists(usethename,withoutquotes)orimage(.RData)files(thefilenameisplacedinquotes).Thefollowingdemonstratestheattachingofthedataframeprimates:>names(primates)[1]"Bodywt""Brainwt">BodywtError:Object"Bodywt"notfound>attach(primates)#RwillnowknowwheretofindBodywt>Bodywt[1]10.0207.062.06.852.2Oncethedataframeprimateshasbeenattached,itscolumnscanbeaccessedbygivingtheirnames,withoutfurtherreferencetothenameofthedataframe.Intechnicalterms,thedataframebecomesadatabase,whichissearchedasrequiredforobjectsthattheusermayspecify.2.9FunctionsinRWegivetwosimpleexamplesofRfunctions.2.9.1AnApproximateMilestoKilometersConversionmiles.to.km<-function(miles)miles*8/518 18Thereturnvalueisthevalueofthefinal(andinthisinstanceonly)expressionthatappearsinthefunctionbody.Usethefunctionthus>miles.to.km(175)#ApproximatedistancetoSydney,inmiles[1]280Thefunctionwilldotheconversionforseveraldistancesallatonce.Toconvertavectorofthethreedistances100,200and300milestodistancesinkilometers,specify:>miles.to.km(c(100,200,300))[1]1603204802.9.2APlottingfunctionThedatasetfloridahasthevotesinthe2000electionforthevariousUSPresidentialcandidates,countybycountyinthestateofFlorida.ThefollowingplotsthevoteforBuchananagainstthevoteforBush.attach(florida)plot(BUSH,BUCHANAN,xlab="Bush",ylab="Buchanan")detach(florida)#InS-PLUS,specifydetach("florida")Hereisafunctionthatmakesitpossibletoplotthefiguresforanypairofcandidates.plot.florida<-function(xvar=”BUSH”,yvar=”BUCHANAN”){x<-florida[,xvar]y<-florida[,yvar]plot(x,y,xlab=xvar,ylab=yvar)mtext(side=3,line=1.75,“VotesinFlorida,bycounty,in the2000USPresidentialelection”)}Notethatthefunctionbodyisenclosedinbraces({}).Aswellasplot.florida(),thisallows,e.g.plot.florida(yvar=”NADER”)#yvar=”NADER”over-ridesthedefaultplot.florida(xvar=”GORE”,yvar=”NADER”)Figure7showsthegraphproducedbyplot.florida(),i.e.parametersettingsareleftattheirdefaults.VotesinFlorida,bycounty,inthe2000USPresidentialelection00530052NANA0H0C51UB0050050000150000250000BUSHFigure7:Electionnightcountofvotesreceived,bycounty,intheUS2000Presidentialelection.18Alternativelyareturnvaluemaybegivenusinganexplicitreturn()statement.Thisishoweveranuncommonconstruction19 2.10MoreDetailedInformationThischapterhasgiventheminimumdetailthatseemsnecessaryforgettingstarted.Lookinchapters7and8foramoredetailedcoverageofthetopicsinthischapter.Itmaypay,atthispoint,toglancethroughchapters7and8toseewhatisthere.RememberalsotousetheRhelp.Topicsfromchapter7,additionaltothosecoveredabove,thatmaybeimportantforrelativelyelementaryusesofRinclude:oTheentryofpatterneddata(7.1.3)oThehandlingofmissingvaluesinsubscriptswhenvectorsareassigned(7.2)oUnexpectedconsequences(e.g.conversionofcolumnsofnumericdataintofactors)fromerrorsindata(7.4.1).2.11Exercises1.Foreachofthefollowingcodesequences,predicttheresult.Thendothecomputation:a)answer<-0for(jin3:5){answer<-j+answer}b)answer<-10for(jin3:5){answer<-j+answer}c)answer<-10for(jin3:5){answer<-j*answer}2.Lookupthehelpforthefunctionprod(),anduseprod()todothecalculationin1(c)above.Alternatively,howwouldyouexpectprod()towork?Tryit!3.Addupallthenumbersfrom1to100intwodifferentways:usingforandusingsum.Nowapplythefunctiontothesequence1:100.Whatisitsaction?4.Multiplyallthenumbersfrom1to50intwodifferentways:usingforandusingprod.35.Thevolumeofasphereofradiusrisgivenby4r/3.Forsphereshavingradii3,4,5,…,20findthecorrespondingvolumesandprinttheresultsoutinatable.Usethetechniqueofsection2.1.5toconstructadataframewithcolumnsradiusandvolume.6.Usesapply()toapplythefunctionis.factortoeachcolumnofthesupplieddataframetinting.Foreachofthecolumnsthatareidentifiedasfactors,determinethelevels.Whichcolumnsareorderedfactors?[Useis.ordered()].20 3.PlottingThefunctionsplot(),points(),lines(),text(),mtext(),axis(),identify()etc.formasuitethatplotspoints,linesandtext.ToseesomeofthepossibilitiesthatRoffers,enterdemo(graphics)PresstheEnterkeytomovetoeachnewgraph.3.1plot()andalliedfunctionsThefollowingbothplotyagainstx:plot(y~x)#Useaformulatospecifythegraphplot(x,y)#Obviouslyxandymustbethesamelength.Tryplot((0:20)*pi/10,sin((0:20)*pi/10))plot((1:30)*0.92,sin((1:30)*0.92))Commentontheappearancethatthesegraphspresent.Isitobviousthatthesepointslieonasinecurve?Howcanonemakeitobvious?(Placethecursoroverthelowerborderofthegraphsheet,untilitbecomesadouble-sidedarror.Dragtheborderintowardsthetopborder,makingthegraphsheetshortandwide.)Herearetwofurtherexamples.attach(elasticband)#Rnowknowswheretofinddistance&stretchplot(distance~stretch)plot(ACT~Year,data=austpop,type="l")plot(ACT~Year,data=austpop,type="b")19Thepoints()functionaddspointstoaplot.Thelines()functionaddslinestoaplot.Thetext()functionaddstextatspecifiedlocations.Themtext()functionplacestextinoneofthemargins.Theaxis()functiongivesfinecontroloveraxisticksandlabels.Hereisafurtherpossibilityattach(austpop)plot(spline(Year,ACT),type="l")#Fitsmoothcurvethroughpointsdetach(austpop)#InS-PLUS,specifydetach(“austpop”)3.1.1NewerplotmethodsAbove,Idescribedthedefaultplotmethod.Theplotfunctionisagenericfunctionthathasspecialmethodsfor“plotting”variousdifferentclassesofobject.Forexample,plottingadataframegives,foreachnumericvariable,anormalprobabilityplot.Plottingthelmobjectthatiscreatedbytheuseofthelm()modellingfunctiongivesdiagnosticandotherinformationthatisintendedtohelpintheinterpretationofregressionresults.Tryplot(hills)#Hasthesameeffectaspairs(hills)3.2Finecontrol–ParametersettingsThedefaultsettingsofparameters,suchascharactersize,areoftenadequate.Whenitisnecessarytochangeparametersettingsforasubsequentplot,thepar()functiondoesthis.Forexample,par(cex=1.25,mex=1.25)#character(cex)&margin(mex)expansion19Actuallythesefunctionsdifferonlyinthedefaultsettingfortheparametertype.Thedefaultsettingforpoints()istype="p",andforlines()istype="l".Explicitlysettingtype="p"causeseitherfunctiontoplotpoints,type="l"giveslines.21 increasesthetextandplotsymbolsize25%abovethedefault.Theadditionofmex=1.25makesroominthemargintoaccommodatetheincreasedtextsize.Onthefirstuseofpar()tomakechangestothecurrentdevice,itisoftenusefultostoreexistingsettings,sothattheycanberestoredlater.Forthis,specifyoldpar<-par(cex=1.25,mex=1.25)Thisstorestheexistingsettingsinoldpar,thenchangesparameters(herecexandmex)asrequested.Torestoretheoriginalparametersettingsatsomelatertime,enterpar(oldpar).Hereisanexample:attach(elasticband)oldpar<-par(cex=1.5,mex=1.5)plot(distance~stretch)par(oldpar)#Restorestheearliersettingsdetach(elasticband)Insideafunctionspecify,e.g.oldpar<-par(cex=1.25,mex=1.25)on.exit(par(oldpar))Typeinhelp(par)togetdetailsofalltheparametersettingsthatareavailablewithpar().3.2.1MultipleplotsontheonepageTheparametermfrowcanbeusedtoconfigurethegraphicssheetsothatsubsequentplotsappearrowbyrow,oneaftertheotherinarectangularlayout,ontheonepage.Foracolumnbycolumnlayout,usemfcolinstead.Intheexamplebelowwepresentfourdifferenttransformationsoftheprimatesdata,inatwobytwolayout:par(mfrow=c(2,2),pch=16)data(Animals)#NeededifAnimals(MASSpackage)isnotalreadyloadedattach(Animals)plot(body,brain)plot(sqrt(body),sqrt(brain))plot((body)^0.1,(brain)^0.1)plot(log(body),log(brain))detach(Animals)par(mfrow=c(1,1),pch=1)#Restoreto1figureperpage3.2.2TheshapeofthegraphsheetOftenitisdesirabletoexercisecontrolovertheshapeofthegraphpage,e.g.sothattheindividualplotsarerectangularratherthansquare.TheRforWindowsfunctionswin.graph()orx11()thatsetuptheWindowsscreentaketheparameterswidth(ininches),height(ininches)andpointsize(in1/72ofaninch).Thesettingofpointsize(default=12)determinescharacterheights.ItistherelativesizesoftheseparametersthatmatterforscreendisplayorforincorporationintoWordandsimilarprograms.Graphscanbeenlargedorshrunkbypointingatonecorner,holdingdowntheleftmousebutton,andpulling.3.3Addingpoints,linesandtextHereisasimpleexamplethatshowshowtousethefunctiontext()toaddtextlabelstothepointsonaplot.>primatesBodywtBrainwtPotarmonkey10.0115Gorilla207.0406Human62.01320Rhesusmonkey6.8179Chimp52.244022 20Observethattherownamesstorelabelsforeachrow.attach(primates)#Neededifprimatesisnotalreadyattached.plot(Bodywt,Brainwt,xlim=c(5,270))#Specifyxlimsothatthereisroomforthelabelstext(x=Bodywt,y=Brainwt,labels=row.names(primates),adj=0)#adj=0impliesleftadjustedtextFigure8Ashowstheresult.ABHumanHuman00)000(g01t1thwiginera0w0B06in06raChimpBChimpGorillaGorilla0002Rhesusmonkey02RhesusmonkeyPotarmonkeyPotarmonkey050100200050100200BodywtBodyweight(kg)Figure8:Plotoftheprimatesdata,withlabelsonpoints.Figure8BisanimprovedversionofFigure8A.Figure8Awouldbeadequateforidentifyingpoints,butisnotapresentationqualitygraph.Wenowshowhowtoimproveit.InFigure8Bweusethexlab(x-axis)andylab(y-axis)parameterstospecifymeaningfulaxistitles.Wemovethelabellingtoonesideofthepointsbyincludingappropriatehorizontalandverticaloffsets.Weusechw<-par()$cxy[1]togeta1-characterspacehorizontaloffset,andchh<-par()$cxy[2]togeta1-characterheightverticaloffset.I’veusedpch=16tomaketheplotcharacteraheavyblackdot.Thishelpsmakethepointsstandoutagainstthelabelling.HereistheRcodeforFigure8B:plot(x=Bodywt,y=Brainwt,pch=16,xlab="Bodyweight(kg)",ylab="Brainweight(g)",xlim=c(5,240),ylim=c(0,1500))chw<-par()$cxy[1]chh<-par()$cxy[2]text(x=Bodywt+chw,y=Brainwt,labels=row.names(primates),adj=0)detach(primates)Toplacethetexttotheleftofthepoints,specifytext(x=Bodywt-0.75*chw,y=Brainwt,labels=row.names(primates),adj=1)20Rownamescanbecreatedinseveraldifferentways.Theycanbeassigneddirectly,e.g.row.names(primates)<-c("Potarmonkey","Gorilla","Human","Rhesusmonkey","Chimp")Whenusingread.table()toinputdata,theparameterrow.namesisavailabletospecify,bynumberorname,acolumnthatholdstherownames.23 3.3.1Size,colourandchoiceofplottingsymbolForplot()andpoints()theparametercex(“characterexpansion”)controlsthesize,whilecol(“colour”)controlsthecolouroftheplottingsymbol.Theparameterpchcontrolsthechoiceofplottingsymbol.Theparameterscexandcolmaybeusedinasimilarwaywithtext().Tryplot(1,1,xlim=c(1,7.5),ylim=c(0,5),type="n")#Donotplotpointspoints(1:7,rep(4.5,7),cex=1:7,col=1:7,pch=0:6)text(1:7,rep(3.5,7),labels=paste(0:6),cex=1:7,col=1:7)Thefollowing,addedtotheplotthatresultsfromtheabovethreestatements,demonstratesotherchoicesofpch.points(1:7,rep(2,7),pch=(0:6)+7)#Plotsymbols7to13text((1:7)+0.25,rep(2,7),paste((0:6)+7))#Labelwithsymbolnumberpoints(1:7,rep(1,7),pch=(0:6)+14)#Plotsymbols14to20text((1:7)+0.25,rep(1,7),paste((0:6)+14))#LabelswithsymbolnumberHere(Figure9)istheplot:540123453627891011121311415161718192001234567Figure9:Differentplotsymbols,coloursandsizesAvarietyofcolorpalettesareavailable.Hereisafunctionthatdisplayssomeofthepossibilities:view.colours<-function(){plot(1,1,xlim=c(0,14),ylim=c(0,3),type="n",axes=F,xlab="",ylab="")text(1:6,rep(2.5,6),paste(1:6),col=palette()[1:6],cex=2.5)text(10,2.5,"Defaultpalette",adj=0)rainchars<-c("R","O","Y","G","B","I","V")text(1:7,rep(1.5,7),rainchars,col=rainbow(7),cex=2.5)text(10,1.5,"rainbow(7)",adj=0)cmtxt<-substring("cm.colors",1:9,1:9)#Split“cm.colors”intoits9characterstext(1:9,rep(0.5,9),cmtxt,col=cm.colors(9),cex=3)text(10,0.5,"cm.colors(9)",adj=0)}Torunthefunction,enterview.colours()24 3.3.2AddingTextintheMarginmtext(side,line,text,..)addstextinthemarginofthecurrentplot.Thesidesarenumbered1(x-axis),2(y-axis),3(top)and4.3.4IdentificationandLocationontheFigureRegionTwofunctionsareavailableforthispurpose.Drawthegraphfirst,thencalloneorotherofthesefunctions.identify()labelspoints.Onepositionsthecursornearthepointthatistobeidentified,andclickstheleftmousebutton.locator()printsouttheco-ordinatesofpoints.Onepositionsthecursoratthelocationforwhichcoordinatesarerequired,andclickstheleftmousebutton.Aclickwiththerightmousebuttonsignifiesthattheidentificationorlocationtaskiscomplete,unlessthesettingoftheparameternisreachedfirst.Foridentify()thedefaultsettingofnisthenumberofdatapoints,whileforlocator()thedefaultsettingisn=500.3.4.1identify()Thisfunctionrequiresspecificationofavectorx,avectory,andavectoroftextstringsthatareavailableforusealabels.ThedatasetfloridahasthevotesforthevariousPresidentialcandidates,countybycountyinthestateofFlorida.WeplotthevoteforBuchananagainstthevoteforBush,theninvokingidentify()sothatwecanlabelselectedpointsontheplot.attach(florida)plot(BUSH,BUCHANAN,xlab=”Bush”,ylab=”Buchanan”)identify(BUSH,BUCHANAN,County)detach(florida)Clicktotheleftorright,andslightlyaboveorbelowapoint,dependingonthepreferredpositioningofthelabel.Whenlabellingisterminated(clickwiththerightmousebutton),therownumbersoftheobservationsthathavebeenlabelledareprintedonthescreen,inorder.3.4.2locator()Leftclickatthelocationswhosecoordinatesarerequiredattach(florida)#ifnotalreadyattachedplot(BUSH,BUCHANAN,xlab=”Bush”,ylab=”Buchanan”)locator()detach(florida)Thefunctioncanbeusedtomarknewpoints(specifytype=”p”)orlines(specifytype=”l”)orbothpointsandlines(specifytype=”b”).3.5PlotsthatshowthedistributionofdatavaluesWediscusshistograms,densityplots,boxplotsandnormalprobabilityplots.3.5.1HistogramsTheshapesofhistogramsdependontheplacementofthebreaks,asFigure10illustrates:25 A:Breaksat72.5,77.5,...B:Breaksat75,80,...002255y1y1ccnneeuuq0q0e1e1rrFF550075808590957580859095100TotallengthTotallengthFigure10:Thetwographsshowthesamedata,butwithadifferentchoiceofbreakpoints.Hereisthecodeusedtoplotthehistograms:par(mfrow=c(1,2))attach(possum)here<-sex=="f"hist(totlngth[here],breaks=72.5+(0:5)*5,ylim=c(0,22),xlab="Totallength",main="A:Breaksat72.5,77.5,...")hist(totlngth[here],breaks=75+(0:5)*5,ylim=c(0,22),xlab="Totallength",main="B:Breaksat75,80,...")par(mfrow=c(1,1))detach(possum)3.5.2DensityPlotsDensityplots,nowthattheyareavailable,areoftenapreferredalternativetoahistogram.InFigure11thehistogramsfromFigure10areoverlaidwithadensityplot.88yc.0yc.0n0n0eeuuqqererFFe4e4vti.0vti.000lalaeeRR00.0.000707580859095100707580859095100TotallengthTotallengthFigure11:OneachofthehistogramsfromFigure10adensityplothasbeenoverlaid.Densityplotsdonotdependonachoiceofbreakpoints.Thechoiceofwidthandtypeofwindow,controllingthenatureandamountofsmoothing,doesaffecttheappearanceoftheplot.Themaineffectistomakeitmoreorlesssmooth.26 Thefollowingwillgiveadensityplot:attach(possum)plot(density(totlngth[here]),type="l")detach(possum)NotethatinFig.12they-axisforthehistogramislabelledsothattheareaofarectangleisthefrequencyforthatrectangle.Togettheplotontheleft,specify:attach(possum)here<-sex=="f"dens<-density(totlngth[here])xlim<-range(dens$x)ylim<-range(dens$y)hist(totlngth[here],breaks=72.5+(0:5)*5,probability=T,xlim=xlim,ylim=ylim,xlab="Totallength",main="")lines(dens)detach(possum)3.5.3BoxplotsWenowmakeaboxplotoftheabovedata:attach(possum)boxplot(totlngth[here])detach(possum)Figure12addsinformationthatshouldassistintheinterpretationofboxplots.Largestvalue(outliersexcepted)59090.5upperquartileInter-quartilerange9=90.5-85.25)=5.2mmedianComparec0.75xInter-QuartileRange(th=3.9gnwithstandarddeviationlel585.25lowerquartile=4.28taoTSmallestvalue0(outliersexcepted)857OutlierFigure12:Boxplotoffemalepossumlengths,withadditionallabellinginformation.27 3.5.4Normalprobabilityplotsqqnorm(y)givesanormalprobabilityplotoftheelementsofy.ThepointsofthisplotwilllieapproximatelyonastraightlineifthedistributionisNormal.Inordertocalibratetheeyetorecogniseplotsthatindicatenon-normalvariation,itishelpfultodoseveralnormalprobabilityplotsforrandomsamplesoftherelevantsizefromanormaldistribution.x11(width=8,height=6)#Thisisabettershapeforthisplotattach(possum)here<-sex=="f"par(mfrow=c(3,4))#A3by4layoutofplotsy<-totlngth[here]qqnorm(y,xlab="",ylab="Length",main="Possums")for(iin1:11)qqnorm(rnorm(43),xlab="",ylab="Simulatedlengths",main="Simulated")detach(possum)#Beforecontinuing,typedev.off()Figure13showstheplots.Thereisoneunusuallysmallvalue.Otherwisethepointsforthefemalepossumlengthsareasclosetoastraightlineasinmanyoftheplotsforrandomnormaldata.PossumsSimulatedSimulatedSimulated225sss9h1hhtgtg1tg1nnnhte0e0eglll0n5dededee8tttLla1-lalauuum2m2-m2-5Si-SiSi7-2-1012-2-1012-2-1012-2-1012SimulatedSimulatedSimulatedSimulated3ssss2ht.01ht2ht2htggggnene1ne1ne1llllde.5de0de0de0t0tttla-lalalauuuuimm2m2mS.0Si-Si-Si2-2--2-1012-2-1012-2-1012-2-1012SimulatedSimulatedSimulatedSimulateds2s.0sshtht1ht1ht1ggggn1nnnelel.0el0el0dd0ddet0etet1-etlalalalau1uuu-2imm.5mm-S2Si1-Si3Si--.-2-1012-2-1012-2-1012-2-1012Figure13:Normalprobabilityplots.Ifdataarefromanormaldistributionthenpointsshouldfall,approximately,alongaline.Theplotinthetoplefthandcornershowsthe43lengthsoffemalepossums.Theotherplotsareforindependentnormalrandomsamplesofsize43.Theideaisanimportantone.Inordertojudgewhetherdataarenormallydistributed,examineanumberofrandomlygeneratedsamplesofthesamesizefromanormaldistribution.Itisawaytotraintheeye.Bydefault,rnorm()generatesrandomsamplesfromadistributionwithmean0andstandarddeviation1.28 3.6OtherUsefulPlottingFunctions21Forthefunctionsdemonstratedhere,weusedataontheheightsof100femaleathletes.3.6.1Scatterplotsmoothingpanel.smooth()plotspoints,thenaddsasmoothcurvethroughthepoints.Forexample:attach(ais)here<-sex=="f"plot(pcBfat[here]~ht[here],xlab=“Height”,ylab=“%Bodyfat”)panel.smooth(ht[here],pcBfat[here])detach(ais)3.6.2AddinglinestoplotsUsethefunctionabline()forthis.Theparametersmaybeaninterceptandslope,oravectorthatholdstheinterceptandslope,oranlmobject.Alternativelyitispossibletodrawahorizontalline(h=),oraverticalline(v=).attach(ais)here<-sex=="f"plot(pcBfat[here]~ht[here],xlab=“Height”,ylab=“%Bodyfat”)abline(lm(pcBfat[here]~ht[here]))detach(ais)3.6.3RugplotsBydefaultrug(x)adds,alongthex-axisofthecurrentplot,verticalbarsshowingthedistributionofvaluesofx.Itcanhoweverbeparticularlyusefulforshowingtheactualvaluesalongthesideofaboxplot.Figure14showsaboxplotofthedistributionofheightoffemaleathletes,witharugplotaddedonthey-axis.091081thgne0L71061051Figure14:Distributionofheightsoffemaleathletes.Thebarsontheleftplotshowactualdatavalues.Hereisthecode21Datarelatetothepaper:Telford,R.D.andCunningham,R.B.1991:Sex,sportandbody-sizedependencyofhematologyinhighlytrainedathletes.MedicineandScienceinSportsandExercise23:788-794.29 attach(ais)here<-sex=="f"boxplot(ht[here],boxwex=0.15,ylab="Height")rug(ht[here],side=2)detach(ais)Theparameterboxwexcontrolsthewidthoftheboxplot.3.6.4ScatterplotmatricesSection2.1.3demonstratedtheuseofthepairs()function.3.6.5DotchartsThesecanbeagoodalternativetobarcharts.Theyhaveamuchhigherinformationtoinkratio!Trydata(islands)#Useforversions<=1.9.1;basepackagedotchart(islands)#vectorofnamednumericvaluesUnfortunatelytherearemanynames,andthereissubstantialoverlap.Thefollowingisbetter,butshrinksthesizesofthepointssothattheyalmostdisappear:dotchart(islands,cex=0.5)3.7PlottingMathematicalSymbolsBothtext()andmtext()willtakeanexpressionratherthanatextstring.Inplot(),eitherorbothofxlabandylabcanbeanexpression.Figure15wasproducedwithplot(x,y,xlab=”Radius”,ylab=expression(Area==pi*r^2))000800062rp0=00a4erA0002001020304050RadiusFigure15:They-axislabelisamathematicalexpression.Noticethatinexpression(Area==pi*r^2),thereisadoubleequalssign(“==”),althoughwhatwillappearontheplotisArea=pi*r^2,withasingleequalssign.ThereasonforthisisthatArea==pi*r^2isavalidmathematicalexpression,whileArea=pi*r^2isnot.Seehelp(plotmath)fordetailedinformationontheplottingofmathematicalexpressions.Thereisafurtherexampleinchapter12.Thefinalplotfromdemo(graphics)30 showssomeofthepossibilitiesforplottingmathematicalsymbols.3.8GuidelinesforGraphsDesigngraphstomaketheirpointterselyandclearly,withaminimumwasteofink.Labelasnecessarytoidentifyimportantfeatures.Inscatterplotsthegraphshouldattracttheeye’sattentiontothepointsthatareplotted,andtoimportantgroupinginthedata.Usesolidpoints,largeenoughtostandoutrelativetootherfeatures,whenthereislittleornooverlap.Whenthereisextensiveoverlapofplottingsymbols,useopenplottingsymbols.Wherepointsaredense,overlappingpointswillgiveahighinkdensity,whichisexactlywhatonewants.Usescatterplotsinpreferencetobarorrelatedgraphswheneverthehorizontalaxisrepresentsaquantitativeeffect.Usegraphsfromwhichinformationcanbereaddirectlyandeasilyinpreferencetothosethatrelyonvisualimpressionandperspective.Thusinscientificpaperscontourplotsaremuchpreferabletosurfaceplotsortwo-dimensionalbargraphs.Drawgraphssothatreductionandreproductionwillnotinterferewithvisualclarity.Explainclearlyhowerrorbarsshouldbeinterpreted—SElimits,95%confidenceinterval,SDlimits,orwhatever.Explainwhatsourceof`error(s)’isrepresented.Itispointlesstopresentinformationonasourceoferrorthatisoflittleornointerest,forexampleanalyticalerrorwhentherelevantsourceof`error’forcomparisonoftreatmentsisbetweenfruit.Usecolourordifferentplottingsymbolstodistinguishdifferentgroups.Takecaretousecoloursthatcontrast.Thelistofreferencesattheendofthischapterhasfurthercommentsongraphicalandotherpresentationissues.3.9Exercises1.Plotthegraphofbrainweight(brain)versusbodyweight(body)forthedatasetAnimalsfromtheMASSpackage.Labeltheaxesappropriately.[Toaccessthisdataframe,specifylibrary(MASS);data(Animals)]2.Repeattheplot1,butthistimeplottinglog(brainweight)versuslog(bodyweight).Usetherowlabelstolabelthepointswiththethreelargestbodyweightvalues.Labeltheaxesinuntransformedunits.3.Repeattheplots1and2,butthistimeplacetheplotssidebysideontheonepage.4.ThedatasethuronthataccompaniesthesenoteshasmeanJulyaveragewatersurfaceelevations,infeet,22IGLD(1955)forHarborBeach,Michigan,onLakeHuron,Station5014,for1860-1986.(AlternativelyyoucanworkwiththevectorLakeHuronfromthedatasetspackage,thathasmeanheightsfor1875-1772only.)a)Plotmean.heightagainstyear.b)Usetheidentifyfunctiontodeterminewhichyearscorrespondtothelowestandhighestmeanlevels.Thatis,typeidentify(huron$year,huron$mean.height,labels=huron$year)andusetheleftmousebuttontoclickonthelowestpointandhighestpointontheplot.Toquit,pressbothmousebuttonssimultaneously.c)Asinthecaseofmanytimeseries,themeanlevelsarecorrelatedfromyeartoyear.Toseehoweachyear'smeanlevelisrelatedtothepreviousyear'smeanlevel,uselag.plot(huron$mean.height)Thisplotsthemeanlevelatyeariagainstthemeanlevelatyeari-1.22Source:GreatLakesWaterLevels,1860-1986.U.S.Dept.ofCommerce,NationalOceanicandAtmosphericAdministration,NationalOceanSurvey.31 235.Checkthedistributionsofheadlengths(hdlngth)inthepossumdatasetthataccompaniesthesenotes.Comparethefollowingformsofdisplay:a)ahistogram(hist(possum$hdlngth));b)astemandleafplot(stem(qqnorm(possum$hdlngth));c)anormalprobabilityplot(qqnorm(possum$hdlngth));andd)adensityplot(plot(density(possum$hdlngth)).Whataretheadvantagesanddisadvantagesofthesedifferentformsofdisplay?6.Tryx<-rnorm(10).Printoutthenumbersthatyouget.Lookupthehelpforrnorm.Nowgenerateasampleofsize10fromanormaldistributionwithmean170andstandarddeviation4.7.Usemfrow()tosetupthelayoutfora3by4arrayofplots.Inthetop4rows,shownormalprobabilityplots(section3.4.2)forfourseparate`random’samplesofsize10,allfromanormaldistribution.Inthemiddle4rows,displayplotsforsamplesofsize100.Inthebottomfourrows,displayplotsforsamplesofsize1000.Commentonhowtheappearanceoftheplotschangesasthesamplesizechanges.8.Thefunctionrunif()canbeusedtogenerateasamplefromauniformdistribution,bydefaultontheinterval0to1.Tryx<-runif(10),andprintoutthenumbersyouget.Thenrepeatexercise6above,buttakingsamplesfromauniformdistributionratherthanfromanormaldistribution.Whatshapedothepointsfollow?*9.Ifyoufindexercise8interesting,youmightliketotryitforsomefurtherdistributions.Forexamplex<-rchisq(10,1)willgenerate10randomvaluesfromachi-squareddistributionwithdegreesoffreedom1.Thestatementx<-rt(10,1)willgenerate10randomvaluesfromatdistributionwithdegreesoffreedomone.Makenormalprobabilityplotsforsamplesofvarioussizesfromthesedistributions.10.Forthefirsttwocolumnsofthedataframehills,examinethedistributionusing:(a)histograms(b)densityplots(c)normalprobabilityplots.Repeat(a),(b)and(c),nowworkingwiththelogarithmsofthedatavalues.3.10ReferencesBellLab'sTrellisPage:http://cm.bell-labs.com/cm/ms/departments/sia/project/trellis/Becker,R.A.,Cleveland,W.S.andShyu,M.TheVisualDesignandControlofTrellisDisplay.JournalofComputationalandGraphicalStatistics.Cleveland,W.S.1993.VisualizingData.HobartPress,Summit,NewJersey.Cleveland,W.S.1985.TheElementsofGraphingData.Wadsworth,Monterey,California.MaindonaldJH1992.Statisticaldesign,analysisandpresentationissues.NewZealandJournalofAgriculturalResearch35:121-141.MaindonaldJHandBraunWJ2003.DataAnalysisandGraphicsUsingR–AnExample-BasedApproach.CambridgeUniversityPress.Tufte,E.R.1983.TheVisualDisplayofQuantitativeInformation.GraphicsPress,Cheshire,Connecticut,U.S.A.Tufte,E.R.1990.EnvisioningInformation.GraphicsPress,Cheshire,Connecticut,U.S.A.Tufte,E.R.1997.VisualExplanations.GraphicsPress,Cheshire,Connecticut,U.S.A.Wainer,H.1997.VisualRevelations.Springer-Verlag,NewYork23Datarelatetothepaper:Lindenmayer,D.B.,Viggers,K.L.,Cunningham,R.B.,andDonnelly,C.F.1995.Morphologicalvariationamongpopulationsofthemountainbrushtailpossum,TrichosuruscaninusOgilby(Phalangeridae:Marsupialia).AustralianJournalofZoology43:449-458.32 4.LatticegraphicsLatticeplotsallowtheuseofthelayoutonthepagetoreflectmeaningfulaspectsofdatastructure.TheyofferabilitiessimilartothoseintheS-PLUStrellislibrary.Thelatticepackagesitsontopofthegridpackage.Touselatticegraphics,boththesepackagesmustbeinstalled.Providingitisinstalled,thegridpackagewillbeloadedautomaticallywhenlatticeisloaded.Theoldercoplot()functionthatisinthebasepackagehassomeofsameabilitiesasxyplot(),butwithalimitationtotwoconditioningfactorsorvariablesonly.4.1ExamplesthatPresentPanelsofScatterplots–Usingxyplot()Thebasicfunctionfordrawingpanelsofscatterplotsisxyplot().Wewillusethedataframetinting(supplied)todemonstratetheuseofxyplot().Thesedataarefromanexperimentthatinvestigatedthe24effectsoftintingofcarwindowsonvisualperformance.Theauthorsweremainlyinterestedinvisualrecognitiontasksthatwouldbeperformedwhenlookingthroughsidewindows.Inthisdataframe,csoa(criticalstimulusonsetasynchrony,i.e.thetimeinmillisecondsrequiredtorecogniseanalphanumerictarget),it(inspectiontime,i.e.thetimerequiredforasimplediscriminationtask)andagearevariables,tint(leveloftinting:no,lo,hi)andtarget(contrast:locon,hicon)areorderedfactors,sex(1=male,2=female)andagegp(1=young,intheearly20s;2=anolderparticipant,intheearly70s)arefactors.Figure16showsthestyleofgraphthatonecangetfromxyplot().Thedifferentsymbolsaredifferentcontrasts.loconhicon50100150200OlderOlderfm120100806040aYoungerYoungersocfm12010080604050100150200itFigure16:csoaversusit,foreachcombinationoffemales/malesandelderly/young.Thetwotargets(low,+=highcontrast)areshownwithdifferentsymbols.InasimplifiedversionofFigure16above,wemightplotcsoaagainstitforeachcombinationofsexandagegp.Forthissimplifiedversion,itwouldbeenoughtotype:xyplot(csoa~it|sex*agegp,data=tinting)#Simpleuseofxyplot()HereisthestatementusedtogetFigure16.Thetwodifferentsymbolsdistinguishbetweenlowcontrastandhighcontrasttargets.24Datarelatetothepaper:Burns,N.R.,Nettlebeck,T.,White,M.andWillson,J.1999.Effectsofcarwindowtintingonvisualperformance:acomparisonofelderlyandyoungdrivers.Ergonomics42:428-443.33 xyplot(csoa~it|sex*agegp,data=tinting,panel=panel.superpose,groups=target,auto.key=list(columns=2))Ifcolourisavailable,differentcolourswillbeusedforthedifferentgroups.Astrikingfeatureisthattheveryhighvalues,forbothcsoaandit,occuronlyforelderlymales.Itisapparentthatthelongresponsetimesforsomeoftheelderlymalesoccur,aswemighthaveexpected,withthelowcontrasttarget.Thefollowingputssmoothcurvesthroughthedata,separatelyforthetwotargettypes:xyplot(csoa~it|sex*agegp,data=tinting,panel=panel.superpose,groups=target,type=c("p","smooth")Therelationshipbetweencsoaanditseemsmuchthesameforbothlevelsofcontrast.Finally,wedoaplot(Figure17)thatusesdifferentsymbols(inblackandwhite)fordifferentlevelsoftinting.Thelongesttimesareforthehighleveloftinting.xyplot(csoa~it|sex*agegp,data=tinting,groups=tint,auto.key=list(columns=3))nolohi50100150200OlderOlderfm120100806040aYoungerYoungersocfm12010080604050100150200itFigure17:csoaversusit,foreachcombinationoffemales/malesandelderly/young.Thedifferentlevelsoftinting(no,+=low,>=high)areshownwithdifferentsymbols.AnincompletelistoflatticeFunctionssplom(~data.frame)#Scatterplotmatrixbwplot(factor~numeric,..)#Boxandwhiskerplotqqnorm(numeric,..)#normalprobabilityplotsdotplot(factor~numeric,..)#1-dim.Displaystripplot(factor~numeric,..)#1-dim.Displaybarchart(character~numeric,..)histogram(~numeric,..)densityplot(~numeric,..)#Smoothedversionofhistogramqqmath(numeric~numeric,..)#QQplotsplom(~dataframe,..)#Scatterplotmatrixparallel(~dataframe,..)#ParallelcoordinateplotsIneachinstance,conditioningvariablescanbeadded.34 4.3Exercises251.Thefollowingdatagivesmilkvolume(g/day)forsmokingandnonsmokingmothers:SmokingMothers:621,793,593,545,753,655,895,767,714,598,693NonsmokingMothers:947,945,1086,1202,973,981,930,745,903,899,961Presentthedata(i)insidebysideboxplots;(ii)usingadotplotformofdisplay.2.Repeattheplotasinexercise1,butthistimeincludingascatterplotsmoothoneachpanel.3.Forthepossumdataset,generatethefollowingplots:a)histogramsofhdlngth–usehist();b)normalprobabilityplotsofhdlngth–useqqnorm();c)densityplotsofhdlngth–useplot(density()).Investigatetheeffectofvaryingthedensitybandwidth(bw).4.Thefollowingexercisesrelatetothedataframepossumthataccompaniesthesenotes:(a)Usingthecoplotfunction,exploretherelationbetweenhdlngthandtotlngth,takingintoaccountsexandPop.(b)Constructacontourplotofchestversusbellyandtotlngth.(c)Constructboxandwhiskerplotsforhdlngth,usingsiteasafactor.(d)Constructnormalprobabilityplotsforhdlgth,foreachseparatelevelofsexandPop.Isthereevidencethatthedistributionofhdlgthvarieswiththeleveloftheseotherfactors.6.TheframeairqualitythatisinthedatasetspackagehascolumnsOzone,Solar.R,Wind,Temp,MonthandDay.PlotOzoneagainstSolar.Rforeachofthreetemperatureranges,andeachofthreewindranges.25Dataarefromthepaper``SmokingDuringPregnancyandLactationandItsEffectsonBreastMilkVolume''(Amer.J.ofClinicalNutrition).35 36 5.Linear(MultipleRegression)ModelsandAnalysisofVariance5.1TheModelFormulainStraightLineRegressionWebeginwiththestraightlineregressionexamplethatappearedearlier,insection2.1.4.Firstweplotthedata:plot(distance~stretch,data=elasticband)Thecodefortheregressioncalculationis:elastic.lm<-lm(distance~stretch,data=elasticband)Heredistance~stretchisamodelformula.Othermodelformulaewillappearinthecourseofthischapter.Figure18showstheplot:081061ecntais0d4102142444648505254stretchFigure18:Plotofdistanceversusstretchfortheelasticbanddata,withfittedleastsquareslineTheoutputfromtheregressionisanlmobject,whichwehavecalledelastic.lm.Nowexamineasummaryoftheregressionresults.Noticethattheoutputdocumentsthemodelformulathatwasused:>options(digits=4)>summary(elastic.lm)Call:lm(formula=distance~stretch,data=elasticband)Residuals:12345672.107-0.32118.0001.893-27.78613.321-7.214Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)-63.5774.33-0.860.431stretch4.551.542.950.032Residualstandarderror:16.3on5degreesoffreedomMultipleR-Squared:0.635,AdjustedR-squared:0.562F-statistic:8.71on1and5degreesoffreedom,p-value:0.031937 5.2RegressionObjectsAnlmobjectisalistofnamedelements.Above,wecreatedtheobjectelastic.lm.Herearethenamesofitselements:>names(elastic.lm)[1]"coefficients""residuals""effects""rank"[5]"fitted.values""assign""qr""df.residual"[9]"xlevels""call""terms""model"Variousfunctionsareavailableforextractinginformationthatyoumightwantfromthelist.Thisisbetterthanmanipulatingthelistdirectly.Examplesare:>coef(elastic.lm)(Intercept)stretch-63.5714.554>resid(elastic.lm)12345672.1071-0.321418.00001.8929-27.785713.3214-7.2143Thefunctionmostoftenusedtoinspectregressionoutputissummary().Itextractstheinformationthatusersaremostlikelytowant.Forexample,insection5.1,wehadsummary(elastic.lm)ThereisaplotmethodforlmobjectsthatgivesthediagnosticinformationshowninFigure19.ResidualsvsFittedNormalQ-QplotScale-LocationplotCook'sdistanceplot052853ls0.6320.61lsa0u1.e1iduaid63c6esrseant0.als00.r86idued0de0.dis4sediziz's0.R10-rdadar0.d4okoan1-n0.C220-ttaS0.S33050.00-2-50.0.130150170-1.00.01.01301501701234567FittedvaluesTheoreticalQuantilesFittedvaluesObs.numberFigure19:Diagnosticplotoflmobject,obtainedbyplot(elastic.lm).TogetFigure19,type:x11(width=7,height=2,pointsize=10)par(mfrow=c(1,4),par()$mar=c(5.1,4.1,2.1,1.1))plot(elastic.lm)par(mfrow=c(1,1))Bydefaultthefirst,secondandfourthplotusetherownamestoidentifythethreemostextremeresiduals.[Ifexplicitrownamesarenotgivenforthedataframe,thentherownumbersareused.]5.3ModelFormulae,andtheXMatrixThemodelformulafortheelasticbandexamplewasdistance~stretch.Themodelformulaisarecipeforsettingupthecalculations.AllthecalculationsdescribedinthischapterrequiretheuseofanmodelmatrixorXmatrix,andavectoryofvaluesofthedependentvariable.Forsomeoftheexampleswediscusslater,ithelpstoknowwhattheXmatrixlookslike.Detailsfortheelasticbandexamplefollow.TheXmatrix,withthey-vectoralongside,is:38 XyStretch(mm)Distance(cm)146148154182148173150166144109142141152166Essentially,themodelmatrixrelatestothepartofthemodelthatappearstotherightoftheequalssign.Thestraightlinemodelisy=a+bx+residualwhichwewriteasy=1a+xb+residualTheparametersthataretobeestimatedareaandb.Fittedvaluesaregivenbymultiplyingeachcolumnofthemodelmatrixbyitscorrespondingparameter,i.e.thefirstcolumnbyaandthesecondcolumnbyb,andadding.Anothernameispredictedvalues.Theaimistoreproduce,ascloselyaspossible,thevaluesinthey-column.Theresidualsarethedifferencesbetweenthevaluesinthey-columnandthefittedvalues.Leastsquaresregression,whichistheformofregressionthatwedescribeinthiscourse,choosesaandbsothatthesumofsquaresoftheresidualsisassmallaspossible.Thefunctionmodel.matrix()printsoutthemodelmatrix.Thus:>model.matrix(distance~stretch,data=elasticband)(Intercept)stretch1146215431484150514461427152attr(,"assign")[1]01Anotherpossibility,withelastic.lmasinsection5.1,is:model.matrix(elastic.lm)Thefollowingarethefittedvaluesandresidualsthatwegetwiththeestimatesofa(=-63.6)andb(=4.55)thatresultfromleastsquaresregression:XyˆyyyˆStretch(mm)(Fitted)(Observed)(Residual)-63.64.55-63.6+4.55StretchDistance(mm)Observed-Fitted146-63.6+4.5546=145.7148148-145.7=2.3154-63.6+4.5554=182.1182182-182.1=-0.1148-63.6+4.5548=154.8173173-154.8=18.2150-63.6+4.5550=163.9166166-163.9=2.1144-63.6+4.5544=136.6109109-136.6=-27.6142-63.6+4.5542=127.5141141-127.5=13.5152-63.6+4.5552=173.0166166-173.0=-7.039 Notethatweuseyˆ[pronouncedy-hat]asthesymbolforpredictedvalues.Wemightalternativelyfitthesimpler(nointercept)model.Forthiswehavey=xb+ewhereeisarandomvariablewithmean0.TheXmatrixthenconsistsofasinglecolumn,thex’s.5.3.1ModelFormulaeinGeneralModelformulaetakeaformsuchas:y~x+z:lm,glm,,etc.y~x+fac+fac:x:lm,glm,aov,etc.(Iffacisafactorandxisavariable,fac:xallowsadifferentslopeforeachdifferentleveloffac.)ModelformulaearewidelyusedtosetupmostofthemodelcalculationsinR.Noticethesimilaritybetweenmodelformulaeandtheformulaethatareusedforspecifyingcoplots.Thus,recallthatthegraphformulaforacoplotthatgivesaplotofyagainstxforeachdifferentcombinationoflevelsoffac1(acrossthepage)andfac2(upthepage)is:y~x|fac1+fac2*5.3.2ManipulatingModelFormulaeModelformulaecanbeassigned,e.g.formyxz<-formula(y~x+z)orformyxz<-formula(“y~x+z”)Theargumenttoformula()can,asjustdemonstrated,beatextstring.Thismakesitstraightforwardtopastetheargumenttogetherfromcomponentsthatarestoredintextstrings.Forexample>names(elasticband)[1]"stretch""distance">nam<-names(elasticband)>formds<-formula(paste(nam[1],"~",nam[2]))>lm(formds,data=elasticband)Call:lm(formula=formds,data=elasticband)Coefficients:(Intercept)distance26.37800.1395Notethatgraphicsformulaecanbemanipulatedinexactlythesamewayasmodelformulae.5.4MultipleLinearRegressionModels5.4.1ThedataframeRubber26ThedatasetRubberfromtheMASSpackageisfromtheacceleratedtestingoftyrerubber.Thevariablesareloss(theabrasionlossingm/hr),hard(hardnessin`Shore’units),andtens(tensilestrengthinkg/sqm).Weobtainascatterplotmatrix(Figure20)thus:library(MASS)#ifneededdata(Rubber)#notneededforversions1.9.1andlater26TheoriginalsourceisO.L.Davies(1947)StatisticalMethodsinResearchandProduction.OliverandBoyd,Table6.1p.119.40 pairs(Rubber)507090003loss051050907hard05042002tens06102150150300120160200240Figure20:ScatterplotmatrixfortheRubberdataframefromtheMASSpackage.Thereisanegativecorrelationbetweenlossandhardness.Weproceedtoregresslossonhardandtens.Rubber.lm<-lm(loss~hard+tens,data=Rubber)>options(digits=3)>summary(Rubber.lm)Call:lm(formula=loss~hard+tens,data=Rubber)Residuals:Min1QMedian3QMax-79.38-14.613.8219.7565.98Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)885.16161.75214.333.8e-14hard-6.5710.583-11.271.0e-11tens-1.3740.194-7.071.3e-07Residualstandarderror:36.5on27degreesoffreedomMultipleR-Squared:0.84,AdjustedR-squared:0.828F-statistic:71on2and27degreesoffreedom,p-value:1.77e-011Inadditiontotheuseofplot.lm(),notetheuseoftermplot().Figure21)usedthefollowingcode:par(mfrow=c(1,2))termplot(Rubber.lm,partial=TRUE,smooth=panel.smooth)par(mfrow=c(1,1))Thisplotraisesinterestingquestions.41 00100d1srnae0ht5rrof0oflliatiatrr0aaP0P01-05-002-5060708090120140160180200220240hardtensFigure21:Plot,obtainedwithtermplot(),showingthecontributionofeachofthetwotermsinthemodel,atthemeanofthecontributionsfortheotherterm.Asmoothcurvehas,ineachpanel,beenfittedthroughthepartialresiduals.Thereisaclearsuggestionthat,attheupperendoftherange,theresponseisnotlinearwithtensilestrength.5.4.2WeightsofBooksThebookstowhichthedatainthedatasetoddbooks(accompanyingthesenotes)referwerechosentocoverawiderangeofweighttoheightratios.Herearethedata:>oddbooksthickheightwidthweight11430.523.0107521529.120.594031827.518.562542323.215.240052421.614.055062523.515.560072819.712.645082819.812.645092917.310.5300103022.815.4690113617.811.0400124413.59.2250Noticethatasthicknessincreases,weightreduces.>logbooks<-log(oddbooks)#Wemightexpectweighttobe>#proportionaltothick*height*width>logbooks.lm1<-lm(weight~thick,data=logbooks)>summary(logbooks.lm1)$coefEstimateStd.ErrortvaluePr(>|t|)(Intercept)9.690.70813.78.35e-08thick-1.070.219-4.96.26e-04>logbooks.lm2<-lm(weight~thick+height,data=logbooks)>summary(logbooks.lm2)$coef42 EstimateStd.ErrortvaluePr(>|t|)(Intercept)-1.2633.552-0.3560.7303thick0.3130.4720.6620.5243height2.1140.6783.1170.0124>logbooks.lm3<-lm(weight~thick+height+width,data=logbooks)>summary(logbooks.lm3)$coefEstimateStd.ErrortvaluePr(>|t|)(Intercept)-0.7193.216-0.2240.829thick0.4650.4341.0700.316height0.1541.2730.1210.907width1.8771.0701.7550.117Soisweightproportionaltothick*height*width?Thecorrelationsbetweenthick,heightandwidtharesostrongthatifonetriestousemorethanoneofthemasaexplanatoryvariables,thecoefficientsareill-determined.Theycontainverysimilarinformation,asisevidentfromthescatterplotmatrix.Theregressionsonheightandwidthgiveplausibleresults,whilethecoefficientoftheregressiononthickisentirelyanartefactofthewaythatthebookswereselected.Thedesignofthedatacollectionreallyisimportantfortheinterpretationofcoefficientsfromaregressionequation.Eventhoughregressionequationsfromobservationaldatamayworkquitewellforpredictivepurposes,theindividualcoefficientsmaybemisleading.Thisismorethananacademicissue,astheanalysesin27Lalonde(1986)demonstrate.Theyhaddatafromexperimental“treatment”and“control”groups,andalsofromtwocomparablenon-experimental“controls”.Theregressionestimateofthetreatmenteffect,whencomparisonwaswithoneofthenon-experimentalcontrols,wasstatisticallysignificantbutwiththewrongsign!Theregressionshouldbefittedonlytothatpartofthedatawherevaluesofthecovariatesoverlapsubstantially.DehejiaandWahbademonstratetheuseofscores(“propensities”)thatmaybeusedbothtoidentifysubsetsthataredefensiblycomparable.Propensitiesvaluesarethentheonlycovariateintheequationthatestimatesthetreatmenteffect.5.5PolynomialandSplineRegressionWeshowhowcalculationsthathavethesamestructureasmultiplelinearregressionmaybeusedtomodelacurvilinearresponse.Webuildupcurvesfromlinearcombinationsoftransformedvalues.Awarningisthattheuseofpolynomialcurvesofhighdegreeareingeneralunsatisfactory.Splinecurves,constructedbyjoiningloworderpolynomialcurves(typicallycubics)insuchawaythattheslopechangessmoothly,areingeneralpreferable.5.5.1PolynomialTermsinLinearModels28Thedataframeseedratesthataccompaniesthesenotesgives,foreachofanumberofdifferentseedingrates,thenumberofbarleygrainperhead.plot(grain~rate,data=seedrates)#PlotthedataFigure22showsthedata,withfittedquadraticcurve:27DehejiaandWahba(1999)revisitLalonde’sdata,demonstratingtheuseofamethodologythatwasabletoreproduceresultssimilartotheexperimentalresults.28DataarefromMcLeod,C.C.(1982)Effectofratesofseedingonbarleygrownforgrain.NewZealandJournalofAgriculture10:133-136.SummarydetailsareinMaindonald,J.H.(1992).43 .012dae.00hr2epsinar.09G1.0816080100120140SeedingrateFigure22:Numberofgrainperheadversusseedingrate,forthebarleyseedingratedata,withfittedquadraticcurve.WewillneedanX-matrixwithacolumnofones,acolumnofvaluesofrate,andacolumnofvaluesof2rate.Forthis,bothrateandI(rate^2)mustbeincludedinthemodelformula.>seedrates.lm2<-lm(grain~rate+I(rate^2),data=seedrates)>summary(seedrates.lm2)Call:lm(formula=grain~rate+I(rate^2),data=seedrates)Residuals:123450.04571-0.122860.09429-0.00286-0.01429Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)24.0600000.45569452.800.00036rate-0.0666860.009911-6.730.02138I(rate^2)0.0001710.0000493.500.07294Residualstandarderror:0.115on2degreesoffreedomMultipleR-Squared:0.996,AdjustedR-squared:0.992F-statistic:256on2and2degreesoffreedom,p-value:0.0039>hat<-predict(seedrates.lm2)>lines(spline(seedrates$rate,hat))>#Placingthesplinefitthroughthefittedpointsallowsasmoothcurve.>#Forthistoworkthevaluesofseedrates$ratemustbeordered.Again,checktheformofthemodelmatrix.Typein:>model.matrix(grain~rate+I(rate^2),data=seedrates)(Intercept)rateI(rate^2)1150250021755625311001000041125156255115022500attr(,"assign")44 [1]012Thisexampledemonstratesawaytoextendlinearmodelstohandlespecifictypesofnon-linearrelationships.We3canuseanytransformationwewishtoformcolumnsofthemodelmatrix.Wecould,ifwewished,addanxcolumn.Oncethemodelmatrixhasbeenformed,wearelimitedtotakinglinearcombinationsofcolumns.5.5.2Whatorderofpolynomial?Apolynomialofdegree2,i.e.aquadraticcurve,lookedaboutrightfortheabovedata.Howdoesonecheck?Onewayistofitpolynomials,e.g.ofeachofdegrees1and2,andcomparethemthus:>seedrates.lm1<-lm(grain~rate,data=seedrates)>seedrates.lm2<-lm(grain~rate+I(rate^2),data=seedrates)>anova(seedrates.lm2,seedrates.lm1)AnalysisofVarianceTableModel1:grain~rate+I(rate^2)Model2:grain~rateRes.DfRes.SumSqDfSumSqFvaluePr(>F)120.026286230.187000-1-0.16071412.2280.07294TheF-valueislarge,butonthisevidencetherearetoofewdegreesoffreedomtomakeatotallyconvincingcaseforpreferringaquadratictoaline.Howeverthepaperfromwhichthesedatacomegivesanindependentestimateoftheerrormeansquare(0.17on35d.f.)basedon8replicateresultsthatwereaveragedtogiveeachvaluefornumberofgrainsperhead.Ifwecomparethechangeinthesumofsquares(0.1607,on1df)withameansquare2of0.17(35df),theF-valueisnow5.4on1and35degreesoffreedom,andwehavep=0.024.TheincreaseinthenumberofdegreesoffreedommorethancompensatesforthereductionintheF-statistic.>#Howeverwehaveanindependentestimateoftheerrormean>#square.Theestimateis0.17^2,on35df.>1-pf(0.16/0.17^2,1,35)[1]0.02442FinallynotethatRwas0.972forthestraightlinemodel.Thismayseemgood,butgiventheaccuracyofthesedataitwasnotgoodenough!Thestatisticisaninadequateguidetowhetheramodelisadequate.Evenforany22onecontext,Rwillingeneralincreaseastherangeofthevaluesofthedependentvariableincreases.(Rislargerwhenthereismorevariationtobeexplained.)Apredictivemodelisadequatewhenthestandarderrorsof2predictedvaluesareacceptablysmall,notwhenRachievessomemagicthreshold.5.5.3PointwiseconfidenceboundsforthefittedcurveHereiscodethatwillgivepointwise95%confidencebounds.Notethatthesedonotcombinetogiveaconfidenceregionforthetotalcurve!Theconstructionofsucharegionisamuchmorecomplicatedtask!plot(grain~rate,data=seedrates,pch=16,xlim=c(50,175),ylim=c(15.5,22),xlab="Seedingrate",ylab="Grainsperhead")new.df<-data.frame(rate=c((4:14)*12.5))seedrates.lm2<-lm(grain~rate+I(rate^2),data=seedrates)pred2<-predict(seedrates.lm2,newdata=new.df,interval="confidence")hat2<-data.frame(fit=pred2[,"fit"],lower=pred2[,"lwr"],upper=pred2[,"upr"])attach(new.df)lines(rate,hat2$fit)lines(rate,hat2$lower,lty=2)lines(rate,hat2$upper,lty=2)detach(new.df)45 Theextrapolationhasdeliberatelybeentakenbeyondtherangeofthedata,inordertoshowhowtheconfidenceboundsspreadout.Confidenceboundsforafittedlinespreadoutmoreslowly,butareevenlessbelievable!5.5.4SplineTermsinLinearModelsBynow,readersofthisdocumentwillbeusedtotheideathatitispossibletouselinearmodelstofittermsthatmaybehighlynonlinearfunctionsofoneormoreofthevariables.Thefittingofpolynomialfunctionswasasimpleexampleofthis.Splinefunctionsvariablesextendthisideafurther.ThesplinesthatIdemonstrateareconstructedbyjoiningtogethercubiccurves,insuchawaythejoinsaresmooth.Theplaceswherethecubicsjoinareknownas`knots’.Itturnsoutthat,oncetheknotsarefixed,anddependingontheclassofsplinecurvesthatareused,splinefunctionsofavariablecanbeconstructedasalinearcombinationofbasisfunctions,whereeachbasisfunctionisatransformationofthevariable.Thedataframecarsisinthedatasetspackage.>data(cars)>plot(dist~speed,data=cars)>library(splines)>cars.lm<-lm(dist~bs(speed),data=cars)#Bydefault,therearenoknots>hat<-predict(cars.lm)>lines(cars$speed,hat,lty=3)#NBassumesvaluesofspeedaresorted>cars.lm5<-lm(dist~bs(speed,5),data=cars)#tryforacloserfit(1knot)>ci5<-predict(cars.lm5,interval="confidence",se.fit=T)>names(ci5)[1]"fit""se.fit""df""residual.scale">lines(cars$speed,ci5$fit[,"fit"])>lines(cars$speed,ci5$fit[,"lwr"],lty=2)>lines(cars$speed,ci5$fit[,"upr"],lty=2)5.6UsingFactorsinRModelsFactorsarecrucialforspecifyingRmodelsthatincludecategoricalor“factor”variables,.Considerdatafroman29experimentthatcomparedhouseswithandwithoutcavityinsulation.Whileonewouldnotusuallyhandlethesecalculationsusinganlmmodel,itmakesasimpleexampletoillustratethechoiceofabaselinelevel,andasetofcontrasts.Differentchoices,althoughtheyfitequivalentmodels,giveoutputinwhichsomeofthenumbersaredifferentandmustbeinterpreteddifferently.Webeginbyenteringthedatafromthecommandline:insulation<-factor(c(rep("without",8),rep("with",7)))#8without,then7with#`with’precedes`without’inalphanumericorder,&isthebaselinekWh<-c(10225,10689,14683,6584,8541,12086,12467,12669,9708,6700,4307,10315,8017,8162,8022)Toformulatethisasaregressionmodel,wetakekWhasthedependentvariable,andthefactorinsulationastheexplanatoryvariable.>insulation<-factor(c(rep("without",8),rep("with",7)))>#8without,then7with>kWh<-c(10225,10689,14683,6584,8541,12086,12467,+12669,9708,6700,4307,10315,8017,8162,8022)>insulation.lm<-lm(kWh~insulation)>summary(insulation.lm,corr=F)29DataarefromHand,D.J.;Daly,F.;Lunn,A.D.;Ostrowski,E.,eds.(1994).AHandbookofSmallDataSets.ChapmanandHall.46 Call:lm(formula=kWh~insulation)Residuals:Min1QMedian3QMax-4409-97913215753690Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)78908749.035.8e-07insulation310311962.590.022Residualstandarderror:2310on13degreesoffreedomMultipleR-Squared:0.341,AdjustedR-squared:0.29F-statistic:6.73on1and13degreesoffreedom,p-value:0.0223Thep-valueis0.022,whichmaybetakentoindicate(p<0.05)thatwecandistinguishbetweenthetwotypesofhouses.Butwhatdoesthe“intercept”of7890mean,andwhatdoesthevaluefor“insulation”of3103mean?Tointerpretthis,weneedtoknowthatthefactorlevelsare,bydefault,takeninalphabeticalorder,andthattheinitiallevelistakenasthebaseline.Sowithcomesbeforewithout,andwithisthebaseline.Hence:AverageforInsulatedHouses=7980Togettheestimateforuninsulatedhousestake7980+3103=10993.Thestandarderrorofthedifferenceis1196.5.6.1TheModelMatrixItoftenhelpstokeepinmindthemodelmatrixorXmatrix.HerearetheXandtheythatareusedforthecalculations.Notethatthefirsteightdatavalueswereallwithouts:ContrastkWh79803103AddtogetComparewithResidual117980+3103=109931022510225-10993117980+3103=109931068910689-10993................107980+097089708-7980107980+067006700-7980................Typeinmodel.matrix(kWh~insulation)andcheckthatitgivestheabovemodelmatrix.*5.6.2OtherChoicesofContrastsThereareotherwaystosetuptheXmatrix.Intechnicaljargon,thereareotherchoicesofcontrasts.Oneobviousalternativeistomakewithoutthefirstfactorlevel,sothatitbecomesthebaseline.Forthis,specify:insulation<-relevel(insulation,baseline="without")#Make`without’thebaseline47 Anotherpossibilityistousewhatarecalledthe“sum”contrasts.Withthe“sum”contraststhebaselineisthemeanoverallfactorlevels.Theeffectforthefirstlevelisomitted;theuserhastocalculateitasminusthesum30oftheremainingeffects.Hereistheoutputfromuseofthe`sum’contrasts:>options(contrasts=c("contr.sum","contr.poly"),digits=2)#Trythe`sum’contrasts>insulation<-factor(insulation,levels=c("without","with"))#Make`without'thebaseline>insulation.lm<-lm(kWh~insulation)>summary(insulation.lm,corr=F)Call:lm(formula=kWh~insulation)Residuals:Min1QMedian3QMax-4409-97913215753690Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)944259815.787.4e-10insulation15515982.590.022Residualstandarderror:2310on13degreesoffreedomMultipleR-Squared:0.341,AdjustedR-squared:0.29F-statistic:6.73on1and13degreesoffreedom,p-value:0.0223Hereistheinterpretation:averageof(meanfor“without”,“meanforwith”)=9442Togettheestimateforuninsulatedhouses(thefirstlevel),take9442+1551=10993The`effects’sumtoone.Sotheeffectforthesecondlevel(`with’)is-1551.Thustogettheestimateforinsulatedhouses(thefirstlevel),take9442-1551=7980.Thesumcontrastsaresometimescalled“analysisofvariance”contrasts.Youcansetthechoiceofcontrastsforeachfactorseparately,withastatementsuchas:insulation<-C(insulation,contr=treatment)AlsoavailablearetheHelmertcontrasts.Thesearenotatallintuitiveandrarelyhelpful,eventhoughS-PLUS31usesthemasthedefault.Novicesshouldavoidthem.30Thesecondstringelement,i.e."contr.poly",isthedefaultsettingforfactorswithorderedlevels.[Oneusesthefunctionordered()tocreateorderedfactors.]31Theinterpretationofthehelmertcontrastsissimpleenoughwhentherearejusttwolevels.With>2levels,thehelmertcontrastsgiveparameterestimateswhichingeneraldonotmakealotofsense,basicallybecausethebaselinekeepschanging,totheaverageforallpreviousfactorlevels.Youdobettertouseeitherthetreatmentcontrasts,orthesumcontrasts.Withthesumcontraststhebaselineistheoverallmean.S-PLUSmakeshelmertcontraststhedefault,perhapsforreasonsofcomputationalefficiency.Thiswasanunfortunatechoice.48 5.7MultipleLines–DifferentRegressionLinesforDifferentSpeciesThetermsthatappearontherightofthemodelformulamaybevariablesorfactors,orinteractionsbetweenvariablesandfactors,orinteractionsbetweenfactors.Herewetakeadvantageofthistofitdifferentlinestodifferentsubsetsofthedata.Intheexamplethatfollows,wehadweightsforaporpoisespecies(Stellenastyx)andforadolphinspecies(Delphinusdelphis).Wetakex1tobeavariablethathasthevalue0forDelphinusdelphis,and1forStellenastyx.Wetakex2tobebodyweight.Thenpossibilitieswemaywanttoconsiderare:A:Asingleline:y=a+bx2B:Twoparallellines:y=a1+a2x1+bx2[Forthefirstgroup(Stellenastyx;x1=0)theconstanttermisa1,whileforthesecondgroup(Delphinusdelphis;x1=1)theconstanttermisa1+a2.]C:Twoseparatelines:y=a1+a2x1+b1x2+b2x1x2[Forthefirstgroup(Delphinusdelphis;x1=0)theconstanttermisa1andtheslopeisb1.Forthesecondgroup(Stellenastyx;x1=1)theconstanttermisa1+a2,andtheslopeisb1+b2.]Weshowresultsfromfittingthefirsttwoofthesemodels,i.e.AandB:>plot(logheart~logweight,data=dolphins)#Plotthedata>options(digits=4)>cet.lm1<-lm(logheart~logweight,data=dolphins)>summary(cet.lm1,corr=F)Call:lm(formula=logheart~logweight,data=dolphins)Residuals:Min1QMedian3QMax-0.15874-0.082490.002740.049810.21858Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)1.3250.5222.540.024logweight1.1330.1338.526.5e-07Residualstandarderror:0.111on14degreesoffreedomMultipleR-Squared:0.838,AdjustedR-squared:0.827F-statistic:72.6on1and14degreesoffreedom,p-value:6.51e-007FormodelB(parallellines)wehave>cet.lm2<-lm(logheart~factor(species)+logweight,data=dolphins)Checkwhatthemodelmatrixlookslike:>model.matrix(cet.lm2)(Intercept)factor(species)logweight1113.5552113.738....8103.989....16103.951attr(,"assign")[1]012attr(,"contrasts")49 [1]"contr.treatment"Entersummary(cet.lm2)togetanoutputsummary,andplot(cet.lm2)toplotdiagnosticinformationforthismodel.FormodelC,thestatementis:>cet.lm3<-lm(logheart~factor(species)+logweight+factor(species):logweight,data=dolphins)Checkwhatthemodelmatrixlookslike:>model.matrix(cet.lm3)(Intercept)factor(species)logweightfactor(species).logweight1113.5553.555....8103.9890.000....16103.9510.000attr(,"assign")[1]0123attr(,"contrasts")$"factor(species)"[1]"contr.treatment"NowseewhyoneshouldnotwastetimeonmodelC.>anova(cet.lm1,cet.lm2,cet.lm3)AnalysisofVarianceTableModel1:logheart~logweightModel2:logheart~factor(species)+logweightModel3:logheart~factor(species)+logweight+factor(species):logweightRes.DfRes.SumSqDfSumSqFvaluePr(>F)1140.17172130.095910.075810.280.00693120.094910.00100.120.73465.8aovmodels(AnalysisofVariance)Theclassofmodelsthatcanbedirectlyfittedasaovmodelsisquitelimited.Inessence,aovprovides,fordatawhereallcombinationsoffactorlevelshavethesamenumberofobservations,anotherviewofanlmmodel.Onecanhoweverspecifytheerrortermthatistobeusedintestingfortreatmenteffects.Seesection5.8.2below.Bydefault,Rusesthetreatmentcontrastsforfactors,i.e.thefirstlevelistakenasthebaselineorreferencelevel.Ausefulfunctionisrelevel().Theparameterrefcanbeusedtosetthelevelthatyouwantasthereferencelevel.5.8.1PlantGrowthExampleHereisasimplerandomisedblockdesign:>data(PlantGrowth)>attach(PlantGrowth)>boxplot(split(weight,group))#LooksOK>data()>PlantGrowth.aov<-aov(weight~group)>summary(PlantGrowth.aov)DfSumSqMeanSqFvaluePr(>F)group23.76631.88324.84610.0159150 Residuals2710.49210.3886>summary.lm(PlantGrowth.aov)Call:aov(formula=weight~group)Residuals:Min1QMedian3QMax-1.0710-0.4180-0.00600.26271.3690Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)5.03200.197125.527<2e-16grouptrt1-0.37100.2788-1.3310.1944grouptrt20.49400.27881.7720.0877Residualstandarderror:0.6234on27degreesoffreedomMultipleR-Squared:0.2641,AdjustedR-squared:0.2096F-statistic:4.846on2and27degreesoffreedom,p-value:0.01591>help(cabbages)>data(cabbages)#FromtheMASSpackage>names(cabbages)[1]"Cult""Date""HeadWt""VitC">coplot(HeadWt~VitC|Cult+Date,data=cabbages)Examinationoftheplotsuggeststhatcultivarsdiffergreatlyinthevariabilityinheadweight.VariationinthevitaminClevelsseemsrelativelyconsistentbetweencultivars.>VitC.aov<-aov(VitC~Cult+Date,data=cabbages)>summary(VitC.aov)DfSumSqMeanSqFvaluePr(>F)Cult12496.152496.1553.04111.179e-09Date2909.30454.659.66090.0002486Residuals562635.4047.06*5.8.2ShadingofKiwifruitVinesThesedata(yieldsinkilograms)areinthedataframekiwishadethataccompaniesthesenotes.Theyarefrom32anexperimentwheretherewerefourtreatments-noshading,shadingfromAugusttoDecember,shadingfromDecembertoFebruary,andshadingfromFebruarytoMay.Eachtreatmentappearedonceineachofthethreeblocks.Thenorthernmostplotsweregroupedinoneblockbecausetheyweresimilarlyaffectedbyshadingfromthesun.Fortheremainingtwoblockssheltereffects,inonecasefromtheeastandintheothercasefromthewest,werethoughtmoreimportant.Resultsaregivenforeachofthefourvinesineachplot.Inexperimentaldesignparlance,thefourvineswithinaplotconstitutesubplots.Theblock:shademeansquare(sumofsquaresdividedbydegreesoffreedom)providestheerrorterm.(Ifthisisnotspecified,onestillgetsacorrectanalysisofvariancebreakdown.ButtheF-statisticsandp-valueswillbewrong.)32Datarelatetothepaper:Snelgar,W.P.,Manson.P.J.,Martin,P.J.1992.Influenceoftimeofshadingonfloweringandyieldofkiwifruitvines.JournalofHorticulturalScience67:481-487.Furtherdetails,includingadiagramshowingthelayoutofplotsandvinesanddetailsofshelter,areinMaindonald(1992).Thetwopapershavedifferentshorthands(e.g.Sept-NovversusAug-Dec)fordescribingthetimeperiodsforwhichtheshadingwasapplied.51 >kiwishade$shade<-relevel(kiwishade$shade,ref="none")>##Makesurethatthelevel“none”(noshade)isusedasreference>kiwishade.aov<-aov(yield~block+shade+Error(block:shade),data=kiwishade)>summary(kiwishade.aov)Error:block:shadeDfSumSqMeanSqFvaluePr(>F)block2172.3586.174.11760.074879shade31394.51464.8422.21120.001194Residuals6125.5720.93Error:WithinDfSumSqMeanSqFvaluePr(>F)Residuals36438.5812.18>coef(kiwishade.aov)(Intercept):(Intercept)96.5327block:shade:blocknorthblockwestshadeAug2DecshadeDec2FebshadeFeb2May0.993125-3.4300003.030833-10.281667-7.428333Within:numeric(0)5.9Exercises1.Herearetwosetsofdatathatwereobtainedthesameapparatus,includingthesamerubberband,asthedataframeelasticband.Forthedatasetelastic1,thevaluesare:stretch(mm):46,54,48,50,44,42,52distance(cm):183,217,189,208,178,150,249.Forthedatasetelastic2,thevaluesare:stretch(mm):25,45,35,40,55,5030,50,60distance(cm):71,196,127,187,249,217,114,228,291.Usingadifferentsymboland/oradifferentcolour,plotthedatafromthetwodataframeselastic1andelastic2onthesamegraph.Dothetwosetsofresultsappearconsistent.2.Foreachofthedatasetselastic1andelastic2,determinetheregressionofstretchondistance.In2eachcasedetermine(i)fittedvaluesandstandarderrorsoffittedvaluesand(ii)theRstatistic.Comparethetwosetsofresults.Whatisthekeydifferencebetweenthetwosetsofdata?3.Usingthedataframebeams(inthedatasetsaccompanyingthesenotes),carryoutaregressionofstrengthonSpecificGravityandMoisture.Carefullyexaminetheregressiondiagnosticplot,obtainedbysupplyingthenameofthelmobjectasthefirstparametertoplot().Whatdoesthisindicate?4.Usingthedataframecars(inthedatasetspackage),plotdistance(i.e.stoppingdistance)versusspeed.Fitalinetothisrelationship,andplottheline.Thentryfittingandplottingaquadraticcurve.Doesthequadraticcurvegiveausefulimprovementtothefit?Ifyouhavestudiedthedynamicsofparticles,canyoufindatheorythatwouldtellyouhowstoppingdistancemightchangewithspeed?5.Usingthedataframehills(inpackageMASS),regresstimeondistanceandclimb.Whatcanyoulearnfromthediagnosticplotsthatyougetwhenyouplotthelmobject?Tryalsoregressinglog(time)onlog(distance)andlog(climb).Whichoftheseregressionequationswouldyouprefer?5.Usethemethodofsection5.7todetermine,formally,whetheroneneedsdifferentregressionlinesforthetwodataframeselastic1andelastic2.52 6.Insection5.7checktheformofthemodelmatrix(i)forfittingtwoparallellinesand(ii)forfittingtwoarbitrarylineswhenoneusesthesumcontrasts.Repeattheexerciseforthehelmertcontrasts.7.Typehosp<-rep(c(”RNC”,”Hunter”,”Mater”),2)hospfhosp<-factor(hosp)levels(fhosp)Nowrepeatthestepsinvolvedinformingthefactorfhosp,thistimekeepingthefactorlevelsintheorderRNC,Hunter,Mater.Usecontrasts(fhosp)toformandprintoutthematrixofcontrasts.Dothisusinghelmertcontrasts,treatmentcontrasts,andsumcontrasts.Usinganoutcomevariabley<-c(2,5,8,10,3,9)fitthemodellm(y~fhosp),repeatingthefitforeachofthethreedifferentchoicesofcontrasts.Commentonwhatyouget.8.Forwhichchoice(s)ofcontrastsdotheparameterestimateschangewhenyoure-orderthefactorlevels?9.Inthedatasetcement(MASSpackage),examinethedependenceofy(amountofheatproduced)onx1,x2,x3andx4(whichareproportionsoffourconstituents).Beginbyexaminingthescatterplotmatrix.Astheexplanatoryvariablesareproportions,dotheyrequiretransformation,perhapsbytakinglog(x/(100-x))?Whatalternativestrategiesonemightusetofindaneffectivepredictionequation?10.Inthedatasetpressure(datasetspackage),examinethedependenceofpressureontemperature.[TransformationoftemperaturemakessenseonlyifonefirstconvertstodegreesKelvin.Considertransformationofpressure.Alogarithmictransformationistooextreme;thedirectionofthecurvaturechanges.Whatfamilyoftransformationsmightonetry?11.Modifythecodeinsection5.5.3tofit:(a)aline,withaccompanying95%confidencebounds,and(b)acubiccurve,withaccompanying95%pointwiseconfidencebounds.Whichofthethreepossibilities(line,quadratic,curve)ismostplausible?Cananyofthembetrusted?12.*Repeattheanalysisofthekiwishadedata(section5.8.2),butreplacingError(block:shade)withblock:shade.Commentontheoutputthatyougetfromsummary().Towhatextentisitpotentiallymisleading?Alsodotheanalysiswheretheblock:shadetermisomittedaltogether.Commentonthatanalysis.5.10ReferencesAtkinson,A.C.1986.Comment:Aspectsofdiagnosticregressionanalysis.StatisticalScience1,397–402.Atkinson,A.C.1988.TransformationsUnmasked.Technometrics30:311-318.Cook,R.D.andWeisberg,S.1999.AppliedRegressionincludingComputingandGraphics.Wiley.Dalgaard,P2002.IntroductoryStatisticswithR.Springer,NewYork.Dehejia,R.H.andWahba,S.1999.Causaleffectsinnon-experimentalstudies:re-evaluatingtheevaluationoftrainingprograms.JournaloftheAmericanStatisticalAssociation94:1053-1062.Fox,J2002.AnRandS-PLUSCompaniontoAppliedRegression.SageBooks.Harrell,F.E.,Lee,K.L.,andMark,D.B.1996.TutorialinBiostatistics.MultivariablePrognosticModels:IssuesinDevelopingModels,EvaluatingAssumptionsandAdequacy,andMeasuringandReducingErrors.StatisticsinMedicine15:361-387.Lalonde,R.1986.Evaluatingtheeconomicevaluationsoftrainingprograms.AmericanEconomicReview76:604-620.MaindonaldJH1992.Statisticaldesign,analysisandpresentationissues.NewZealandJournalofAgriculturalResearch35:121-141.MaindonaldJHandBraunWJ2003.DataAnalysisandGraphicsUsingR–AnExample-BasedApproach.CambridgeUniversityPress.Venables,W.N.andRipley,B.D.,4thedn2002.ModernAppliedStatisticswithS.Springer,NewYork.ndWeisberg,S.,2edn,1985.AppliedLinearRegression.Wiley.53 Williams,G.P.1983.Improperuseofregressionequationsintheearthsciences.Geology11:195-19754 6.MultivariateandTree-BasedMethods6.1MultivariateEDA,andPrincipalComponentsAnalysisPrincipalcomponentsanalysisisoftenausefulexploratorytoolformultivariatedata.Theideaistoreplacetheinitialsetofvariablesbyasmallnumberof“principalcomponents”thattogethermayexplainmostofthevariationinthedata.Thefirstprincipalcomponentisthecomponent(linearcombinationoftheinitialvariables)thatexplainsthegreatestpartofthevariation.Thesecondprincipalcomponentisthecomponentthat,amonglinearcombinationsofthevariablesthatareuncorrelatedwiththefirstprincipalcomponent,explainsthegreatestpartoftheremainingvariation,andsoon.Themeasureofvariationusedisthesumofthevariancesofvariables,perhapsafterscalingthevariablessothattheyeachhavevarianceone.Ananalysisthatworkswiththeunscaledvariables,andhencewiththevariance-covariancematrix,givesagreaterweighttovariablesthathavealargevariance.Thecommonalternative–scalingvariablessothattheyeachhavevarianceequaltoone–isequivalenttoworkingwiththecorrelationmatrix.Withbiologicalmeasurementdata,itisusuallydesirabletobeginbytakinglogarithms.Thestandarddeviationsthenmeasurethelogarithmofrelativechange.Becauseallvariablesmeasuremuchthesamequantity(i.e.relativevariability),andbecausethestandarddeviationsaretypicallyfairlycomparable,scalingtogiveequalvariancesisunnecessary.Thedatasetpossumthataccompaniesthesenoteshasninemorphometricmeasurementsoneachof10233mountainbrushtailpossums,trappedatsevensitesfromsouthernVictoriatocentralQueensland.Itisgoodpracticetobeginbyexaminingrelevantscatterplotmatrices.Thismaydrawattentiontogrosserrorsinthedata.Aplotinwhichthesitesand/orthesexesareidentifiedwilldrawattentiontoanyverystrongstructureinthedata.Forexampleonesitemaybequitedifferentfromtheothers,forsomeorallofthevariables.Takinglogarithmsofthesedatadoesnotmakemuchdifferencetotheappearancethattheypresentwhenplotted.Thisisbecausetheratiooflargesttosmallestvalueisrelativelysmall,nevermorethan1.6,forallvariables.Herearesomeofthescatterplotmatrixpossibilities:pairs(possum[,6:14],col=palette()[as.integer(possum$sex)])pairs(possum[,6:14],col=palette()[as.integer(possum$site)])here<-!is.na(possum$footlgth)#Weneedtoexcludemissingvaluesprint(sum(!here))#CheckhowmanyvaluesaremissingWenowlookatparticularviewsofthedatathatwegetfromaprincipalcomponentsanalysis:library(mva)#Loadx-variateanalysispackagepossum.prc<-princomp(log(possum[here,6:14]))#Principalcomponents#Printscoresonsecondpcversusscoresonfirstpc,#bypopulationsandsex,identifiedbysitecoplot(possum.prc$scores[,2]~possum.prc$scores[,1]|possum$Pop[here]+possum$sex[here],col=palette()[as.integer(possum$site)])Figure23,whichusesdifferentplotsymbolsfordifferentsites,usedthecode:coplot(possum.prc$scores[,2]~possum.prc$scores[,1]|possum$Pop[here]+possum$sex[here],pch=as.integer(possum$site))33Datarelatetothepaper:Lindenmayer,D.B.,Viggers,K.L.,Cunningham,R.B.,andDonnelly,C.F.1995.Morphologicalvariationamongcolumnsofthemountainbrushtailpossum,TrichosuruscaninusOgilby(Phalangeridae:Marsupiala).AustralianJournalofZoology43:449-458.55 Given:possum$Pop[here]0.51.01.52.02.5otherVic-0.3-0.10.10.35.2.10]0.]200erm.e,[2hs[exreo.2s$c0s-m$5.ucr1ss.p1.om0pu:snso.00f0.ep1ivG2.0-5.0-0.3-0.10.10.3possum.prc$scores[,1]Figure23:Secondprincipalcomponentversusfirstprincipalcomponent,bypopulationandbysex,forthepossumdata.6.2ClusterAnalysis34InthelanguageofRipley(1996),clusteranalysisisaformofunsupervisedclassification.Itis“unsupervised”becausetheclustersarenotknowninadvance.Therearetwotypesofalgorithms–algorithmsbasedonhierachicalagglomeration,andalgorithmsbasedoniterativerelocation.Inhierarchicalagglomerationeachobservationstartsasaseparategroup.Groupsthatare“close”tooneanotherarethensuccessivelymerged.Theoutputyieldsahierarchicalclusteringtreethatshowstherelationshipsbetweenobservationsandbetweentheclustersintowhichtheyaresuccessivelymerged.Ajudgementisthenneededonthepointatwhichfurthermergingisunwarranted.Initerativerelocation,thealgorithmstartswithaninitialclassification,thatitthentriestoimprove.Howdoesonegettheinitialclassification?Typically,byaprioruseofahierarchicalagglomerationalgorithm.Themvapackagehastheclusteranalysisroutines.Thefunctiondist()calculatesdistances.Thefunctionhclust()doeshierarchicalagglomerativeclustering,withachoiceofmethodsavailable.Thefunctionkmeans()(k-meansclustering)implementsiterativerelocation.6.3DiscriminantAnalysisWestartwithdatathatareclassifiedintoseveralgroups,andwantarulethatwillallowustopredictthegrouptowhichanewdatavaluewillbelong.InthelanguageofRipley(1996),ourinterestisinsupervisedclassification.Forexample,wemaywishtopredict,basedonprognosticmeasurementsandoutcomeinformationforpreviouspatients,whichfuturepatientswillremainfreeofdiseasesymptomsfortwelvemonthsormore.34Referencesareattheendofthechapter.56 Herearecalculationsforthepossumdataframe,usingthelda()functionfromtheVenables&RipleyMASSpackage.Ourinterestisinwhetheritispossible,onthebasisofmorphometricmeasurements,todistinguishanimalsfromdifferentsites.Acruderdistinctionisbetweenpopulations,i.e.sitesinVictoria(anAustralianstate)asopposedtositesinotherstates(NewSouthWalesorQueensland).Becauseithaslittleonthedistributionofvariablevalues,Ihavenotthoughtitnecessarytotakelogarithms.Idiscussthisfurtherbelow.>library(MASS)#Onlyifnotalreadyattached.>here<-!is.na(possum$footlgth)>possum.lda<-lda(site~hdlngth+skullw+totlngth++taillgth+footlgth+earconch+eye+chest+belly,data=possum,+subset=here)>options(digits=4)>possum.lda$svd#Examinethesingularvalues[1]15.75783.93723.18601.50781.14200.7772>>plot(possum.lda,dimen=3)>#Scatterplotmatrixforscoreson1st3canonicalvariates,asinFigure24-4-20266665555665566465667355656357365663566567667575754474574576567652774576774577657777733737374444774477477734373533077LD1222-112111124-211112111122221111111112121122111112211211111221121116-221111221111111111111118-4444442133111444411111117457411511111111111376565551715115116151613511111377575677711131557160112113337475537575547312115371335111177666LD2171716166122177772771212176667261662-222227565672222665577776666224-22225536621133556653513116565111116765365656661371661151111111237662613171161111117561761151112121217555212721555111011121143766626171111314LD3221111273453227311151432274772277471-27572547747744277525772-447744777731771--8-6-4-2024-3-2-10123Figure24:Scatterplotmatrixoffirstthreecanonicalvariates.Thesingularvaluesaretheratioofbetweentowithingroupsumsofsquares,forthecanonicalvariatesinturn.Clearlycanonicalvariatesafterthethirdwillhavelittleifanydiscriminatorypower.Onecanusepredict.lda()toget(amongotherinformation)scoresonthefirstfewcanonicalvariates.Notethattheremaybeinterpretativeadvantagesintakinglogarithmsofbiologicalmeasurementdata.Thestandardagainstwhichpatternsofmeasurementarecommonlycomparedisthatofallometricgrowth,whichimpliesalinearrelationshipbetweenthelogarithmsofthemeasurements.Differencesbetweendifferentsitesarethenindicativeofdifferentpatternsofallometricgrowth.Thereadermaywishtorepeattheaboveanalysis,butworkingwiththelogarithmsofmeasurements.Wheretherearetwogroups,logisticregressionisofteneffective.AsourceofcodeforhandlingmoregeneralsupervisedclassificationproblemsisHastieandTibshirani’smda(mixturediscriminantanalysis)package.ThereisabriefoverviewofthispackageintheVenablesandRipley`Complements’,referredtoinsection13.2.57 6.4DecisionTreemodels(Tree-basedmodels)Weincludetree-basedclassificationherebecauseitisamultivariatesupervisedclassification,ordiscrimination,method.Atree-basedregressionapproachisavailableforuseforregressionproblems.Tree-basedmethodsseemmoresuitedtobinaryregressionandclassificationthantoregressionwithanordinalorcontinuousdependentvariable.Tree-basedmodels,alsoknownas“ClassificationandRegressionTrees”(CART),maybesuitableforregressionandclassificationproblemswhenthereareextensivedata.Oneadvantageofsuchmethodsisthattheyautomaticallyhandlenon-linearityandinteractions.Outputincludesa“decisiontree”thatisimmediatelyusefulforprediction.library(rpart)data(fgl)#Forensicglassfragmentdata;fromMASSpackageglass.tree<-rpart(type~RI+Na+Mg+Al+Si+K+Ca+Ba+Fe,data=fgl)plot(glass.tree);text(glass.tree)summary(glass.tree)Tousethesemodelseffectively,youalsoneedtoknowaboutapproachestopruningtrees,andaboutcross-validation.Methodsforreductionoftreecomplexitythatarebasedonsignificancetestsateachindividualnode(i.e.branchingpoint)typicallychoosetreesthatover-predict.TheAtkinsonandTherneaurpart(recursivepartitioning)packageisclosertoCARTthanistheS-PLUStreelibrary.Itintegratescross-validationwiththealgorithmforformingtrees.6.5Exercises1.Usingthedatasetpainters(MASSpackage),applyprincipalcomponentsanalysistothescoresforComposition,Drawing,Colour,andExpression.Examinetheloadingsonthefirstthreeprincipalcomponents.Plotascatterplotmatrixofthefirstthreeprincipalcomponents,usingdifferentcoloursorsymbolstoidentifythedifferentschools.2.ThedatasetCars93isintheMASSpackage.Usingthecolumnsofcontinuousorordinaldata,determinescoresonthefirstandsecondprincipalcomponents.Investigatethecomparisonbetween(i)USAandnon-USAcars,and(ii)thesixdifferenttypes(Type)ofcar.Nowcreateanewdatasetinwhichbinaryfactorsbecomecolumnsof0/1data,andincludetheseintheprincipalcomponentsanalysis.3.Repeatthecalculationsofexercises1and2,butthistimeusingthefunctionlda()fromtheMASSpackagetoderivecanonicaldiscriminantscores,asinsection6.3.4.TheMASSpackagehastheAids2dataset,containingde-identifieddataonthesurvivalstatusofpatientsdiagnosedwithAIDSbeforeJuly11991.Usetree-basedclassification(rpart())toidentifymajorinfluencesonsurvival.355.Investigatediscriminationbetweenplagiotropicandorthotropicspeciesinthedatasetleafshape.6.6ReferencesChambers,J.M.andHastie,T.J.1992.StatisticalModelsinS.WadsworthandBrooksColeAdvancedBooksandSoftware,PacificGroveCA.Friedman,J.,Hastie,T.andTibshirani,R.(1998).Additivelogisticregression:Astatisticalviewofboosting.Availablefromtheinternet.MaindonaldJHandBraunWJ2003.DataAnalysisandGraphicsUsingR–AnExample-BasedApproach.CambridgeUniversityPress.Ripley,B.D.1996.PatternRecognitionandNeuralNetworks.CambridgeUniversityPress,CambridgeUK.Therneau,T.M.andAtkinson,E.J.1997.AnIntroductiontoRecursivePartitioningUsingtheRPARTRoutines.Thisisoneoftwodocumentsincludedin:http://www.stats.ox.ac.uk/pub/SWin/rpartdoc.zipVenables,W.N.andRipley,B.D.,2ndedn1997.ModernAppliedStatisticswithS-Plus.Springer,NewYork.35Datarelatetothepaper:King.D.A.andMaindonald,J.H.1999.Treearchitectureinrelationtoleafdimensionsandtreestatureintemperateandtropicalrainforests.JournalofEcology87:1012-1024.58 *7.RDataStructures7.1Vectors36Recallthatvectorsmayhavemodelogical,numericorcharacter.7.1.1SubsetsofVectorsRecall(section2.6.2)twocommonwaystoextractsubsetsofvectors:1.Specifythenumbersoftheelementsthataretobeextracted.Onecanusenegativenumberstoomitelements.2.Specifyavectoroflogicalvalues.TheelementsthatareextractedarethoseforwhichthelogicalvalueisT.Thussupposewewanttoextractvaluesofxthataregreaterthan10.Thefollowingdemonstratesathirdpossibility,forvectorsthathavenamedelements:>c(Andreas=178,John=185,Jeff=183)[c("John","Jeff")]JohnJeff185183Avectorofnameshasbeenusedtoextracttheelements.7.1.2PatternedDataUse5:15togeneratethenumbers5,6,…,15.Entering15:5willgeneratethesequenceinthereverseorder.Torepeatthesequence(2,3,5)fourtimesover,enterrep(c(2,3,5),4)thus:>rep(c(2,3,5),4)[1]235235235235>Ifinsteadonewantsfour2s,thenfour3s,thenfour5s,enterrep(c(2,3,5),c(4,4,4)).>rep(c(2,3,5),c(4,4,4))#Analternativeisrep(c(2,3,5),each=4)[1]222233335555Notefurtherthat,inplaceofc(4,4,4)wecouldwriterep(4,3).Soafurtherpossibilityisthatinplaceofrep(c(2,3,5),c(4,4,4))wecouldenterrep(c(2,3,5),rep(4,3)).Inadditiontotheabove,notethatthefunctionrep()hasanargumentlength.out,meaning“keeponrepeatingthesequenceuntilthelengthislength.out.”7.2MissingValuesInR,themissingvaluesymbolisNA.AnyarithmeticoperationorrelationthatinvolvesNAgeneratesanNA.Thisappliesalsototherelations<,<=,>,>=,==,!=.Thefirstfourcomparemagnitudes,==testsforequality,and!=testsforinequality.UserswhodonotcarefullyconsiderimplicationsforexpressionsthatincludeNasmaybepuzzledbytheresults.Specifically,notethatx==NAgeneratesNA.Besuretouseis.na(x)totestwhichvaluesofxareNA.Asx==NAgivesavectorofNAs,yougetnoinformationatallaboutx.Forexample>x<-c(1,6,2,NA)>is.na(x)#TRUEforwhenNAappears,andotherwiseFALSE[1]FALSEFALSEFALSETRUE>x==NA#AllelementsaresettoNA[1]NANANANA>NA==NA[1]NA36Below,wewillmeetthenotionof“class”,whichisimportantforsomeofthemoresophisticatedlanguagefeaturesofR.59 WARNING:ThisischieflyforthosewhomaymovebetweenRandS-PLUS.Inimportantrespects,R’sbehaviourwithmissingvaluesismoreintuitivethanthatofS-PLUS.ThusinRy[x>2]<-x[x>2]givestheresultthatthenaïveusermightexpect,i.e.replaceelementsofywithcorrespondingelementsofxwhereverx>2.Whereverx>2givestheresultNA,noactionistaken.InR,anyNAinx>2yieldsavalueofNAfory[x>2]ontheleftoftheequation,andavalueofNAforx[x>2]ontherightoftheequation.InS-PLUS,theresultontherightisthesame,i.e.anNA.However,ontheleft,elementsthathaveasubscriptNAdropout.Thevectoronthelefttowhichvalueswillbeassignedhas,asaresult,fewerelementsthanthevectorontheright.ThusthefollowinghastheeffectinRthatthenaïveusermightexpect,butnotinS-PLUS:x<-c(1,6,2,NA,10)y<-c(1,4,2,3,0)y[x>2]<-x[x>2]yInS-PLUSitisessentialtospecify,intheexamplejustconsidered:y[!is.na(x)&x>2]<-x[!is.na(x)&x>2]HereisafurtherexampleofR’sbehaviour:>x<-c(1,6,2,NA,10)>x>2[1]FALSETRUEFALSENATRUE>x[x>3]<-c(21,22)#Now,explaintheresultthatfollowsWarningmessage:numberofitemstoreplaceisnotamultipleofreplacementlength>x[1]1212NA21Thesafeway,inbothS-PLUSandR,istouse!is.na(x)tolimittheselection,ononeorbothsidesasnecessary,tothoseelementsofxthatarenotNAs.Wewillhavemoretosayonmissingvaluesinthesectionondataframesthatnowfollows.7.3DataframesTheconceptofadataframeisfundamentaltotheuseofmostoftheRmodellingandgraphicsfunctions.Adataframeisageneralisationofamatrix,inwhichdifferentcolumnsmayhavedifferentmodes.Allelementsofanycolumnmusthoweverhavethesamemode,i.e.allnumericorallfactor,orallcharacter.Dataframeswhereallcolumnsholdnumericdatahavesome,butnotall,ofthepropertiesofmatrices.Thereareimportantdifferencesthatarisebecausedataframesareimplementedaslists.Toturnadataframeofnumericdataintoamatrixofnumericdata,useas.matrix().Listsarediscussedbelow,insection7.6.7.3.1ExtractionofComponentPartsofDataframesConsiderthedataframebarleythataccompaniesthelatticepackage:>names(barley)[1]"yield""variety""year""site">levels(barley$site)[1]"GrandRapids""Duluth""UniversityFarm""Morris"[5]"Crookston""Waseca"Wewillextractthedatafor1932,attheDuluthsite.>Duluth1932<-barley[barley$year=="1932"&barley$site=="Duluth",60 +c("variety","yield")]varietyyield66Manchuria22.5666772Glabron25.8666778Svansota22.2333384Velvet22.4666790Trebi30.6000096No.45722.70000102No.46222.50000108Peatland31.36667114No.47527.36667120WisconsinNo.3829.33333Thefirstcolumnholdstherowlabels,whichinthiscasearethenumbersoftherowsthathavebeenextracted.Inplaceofc("variety","yield")wecouldhavewritten,moresimply,c(2,4).7.3.2DataSetsthatAccompanyRPackagesTypeindata()togetalistofdatasets(mostlydataframes)associatedwithallpackagesthatareinthecurrentsearchpath.Togetinformationonthedatasetsthatareincludedinthedatasetspackage,specifydata(package="datasets")andsimilarlyforanyotherpackage.InversionsofRpreviousto2.0.0,itisusuallynecessarytospecificallybringanyofthesedataframesintotheworkingdirectory.(Ensurethoughthattherelevantpackageisattached.)Thustobringinthedatasetairquality(datasetspackage),typedata(airquality)ThedefaultWindowsdistributionincludesmanycommonlyrequiredpackages.Otherpackagesmustbeexplicitlyinstalled.Forremainingsectionsofthesenotes,theMASSpackage,whichcomeswiththedefaultdistributution,willbeusedfromtimetotime.Thebasepackage,andseveralotherpackages,areautomaticallyattachedatthebeginningofthesession.Toattachanyotherinstalledpackage,usethelibrary()command.7.4DataEntryThefunctionread.table()offersareadymeanstoreadarectangulararrayintoanRdataframe.Supposethatthefileprimates.datcontains:"Potarmonkey"10115Gorilla207406Human621320"Rhesusmonkey"6.8179Chimp52.2440Thenprimates<-read.table("a:/primates.txt")willcreatethedataframeprimates,fromafileonthea:drive.Thetextstringsinthefirstcolumnwillbecomethefirstcolumninthedataframe.Supposethatprimatesisadataframewiththreecolumns–speciesname,bodyweight,andbrainweight.Youcangivethecolumnsnamesbytypingin:names(primates)<-c(“Species”,"Bodywt","Brainwt")Herethenarethecontentsofthedataframe.>primatesSpeciesBodywtBrainwt1Potarmonkey10.011561 2Gorilla207.04063Human62.013204Rhesusmonkey6.81795Chimp52.2440Specifyheader=TRUEifthereisaninitialhowofheaderinformation.Ifthenumberofheadersisonelessthanthenumberofcolumnsofdata,thenthefirstcolumnwillbeused,providingentriesareunique,forrowlabels.7.4.1IdiosyncrasiesThefunctionread.table()isstraightforwardforreadinginrectangulararraysofdatathatareentirelynumeric.When,asintheaboveexample,oneofthecolumnscontainstextstrings,thecolumnisbydefault37storedasafactorwithasmanydifferentlevelsasthereareuniquetextstrings.ProblemsmayarisewhensmallmistakesinthedatacauseRtointerpretacolumnofsupposedlynumericdataascharacterstrings,whichareautomaticallyturnedintofactors.ForexampletheremaybeanO(oh)somewherewherethereshouldbea0(zero),oranel(l)wherethereshouldbeaone(1).Ifyouuseanymissingvaluesymbolsotherthanthedefault(NA),youneedtomakethisexplicitseesection7.3.2below.Otherwiseanyappearanceofsuchsymbolsas*,period(.)andblank(inacasewheretheseparatorissomethingotherthanaspace)willcausetowholecolumntobetreatedascharacterdata.Userswhofindthisdefaultbehaviourofread.table()confusingmaywishtousetheparametersetting38as.is=TRUE.Ifthecolumnislaterrequiredforuseasafactorinamodelorgraphicsformula,itmaybenecessarytomakeitintoafactoratthattime.Somefunctionsdothisconversionautomatically.7.4.2Missingvalueswhenusingread.table()Thefunctionread.table()expectsmissingvaluestobecodedasNA,unlessyousetna.stringstorecogniseothercharactersasmissingvalueindicators.IfyouhaveatextfilethathasbeenoutputfromSAS,youwillprobablywanttosetna.strings=c(".").Theremaybemultiplemissingvalueindicators,e.g.na.strings=c(“NA”,".",”*”,"").The""willensurethatemptycellsareenteredasNAs.7.4.3Separatorswhenusingread.table()39Withdatafromspreadsheets,itissometimesnecessarytousetab(“t”)orcommaastheseparator.Thedefaultseparatoriswhitespace.Tosettabastheseparator,specifysep="t".7.5FactorsandOrderedFactorsWediscussedfactorsinsection2.6.4.Theyprovideaneconomicalwaytostorevectorsofcharacterstringsinwhichtherearemanymultipleoccurrencesofthesamestrings.Morecrucially,theyhaveacentralroleintheincorporationofqualitativeeffectsintomodelandgraphicsformulae.Factorshaveadualidentity.Theyarestoredasintegervectors,witheachofthevaluesinterpretedaccordingto40theinformationthatisinthetableoflevels.37Storageofcolumnsofcharacterstringsasfactorsisefficientwhenasmallnumberofdistinctstringsareeachrepeatedalargenumberoftimes.38Specifyingas.is=Tpreventscolumnsof(intendedorunintended)characterstringsfrombeingconvertedintofactors.39OnewaytogetmixedtextandnumericdataacrossfromExcelistosavetheworksheetina.csvtextfilewithcommaastheseparator.Ifforexamplefilenameismyfile.csvandisondrivea:,useread.table("a:/myfile.csv",sep=",")toreadthedataintoR.Thiscopeswithanyspaceswhichmayappearintextstrings.[Butwatchthatnoneofthecellentriesincludecommas.]40Factorsarevectorswhichhavemodenumericandclass“factor”.Theyhaveanattributelevelsthatholdsthelevelnames.62 Thedataframeislandcitiesthataccompaniesthesenotesholdsthepopulationsofthe19islandnationcitieswitha1995urbancentrepopulationof1.4millionormore.Therownamesarethecitynames,thefirstcolumn(country)hasthenameofthecountry,andthesecondcolumn(population)hastheurbancentrepopulation,inmillions.HereisatablethatgivesthenumberoftimeseachcountryoccursAustraliaCubaIndonesiaJapanPhilippinesTaiwanUnitedKingdom3146212[Thereare19citiesinall.]Printingthecontentsofthecolumnwiththenamecountrygivesthenames,nottheintegervalues.Asinmostoperationswithfactors,Rdoesthetranslationinvisibly.Therearethoughannoyingexceptionsthatcanmaketheuseoffactorstricky.Tobesureofgettingthecountrynames,specifyas.character(islandcities$country)Togettheintegervalues,specifyunclass(islandcities$country)Bydefault,Rsortsthelevelnamesinalphabeticalorder.Ifweformatablethathasthenumberoftimesthateachcountryappears,thisistheorderthatisused:>table(islandcities$country)AustraliaCubaIndonesiaJapanPhilippinesTaiwanUnitedKingdom3146212Thisorderofthelevelnamesispurelyaconvenience.Wemightprefercountriestoappearinorderoflatitude,fromNorthtoSouth.Wecanchangetheorderofthelevelnamestoreflectthisdesiredorder:>lev<-levels(islandcities$country)>lev[c(7,4,6,2,5,3,1)][1]"UnitedKingdom""Japan""Taiwan""Cuba"[5]"Philippines""Indonesia""Australia">country<-factor(islandcities$country,levels=lev[c(7,4,6,2,5,3,1)])>table(country)UnitedKingdomJapanTaiwanCubaPhilippinesIndonesiaAustralia2611243Inorderedfactors,i.e.factorswithorderedlevels,thereareinequalitiesthatrelatefactorlevels.Factorshavethepotentialtocauseafewsurprises,sobecareful!Herearetwopointstonote:1.Whenavectorofcharacterstringsbecomesacolumnofadataframe,Rbydefaultturnsitintoafactor.EnclosethevectorofcharacterstringsinthewrapperfunctionI()ifitistoremaincharacter.2.Therearesomecontextsinwhichfactorsbecomenumericvectors.Tobesureofgettingthevectoroftextstrings,specifye.g.as.character(country).3.Toextractthenumericlevels1,2,3,…,specifyas.numeric(country).7.6OrderedFactorsActually,itistheirlevelsthatareordered.Tocreateanorderedfactor,ortoturnafactorintoanorderedfactor,usethefunctionordered().Thelevelsofanorderedfactorareassumedtospecifypositionsonanordinalscale.Try>stress.level<-rep(c("low","medium","high"),2)>ordf.stress<-ordered(stress.level,levels=c("low","medium","high"))>ordf.stress[1]lowmediumhighlowmediumhighLevels:lowordf.stress<"medium"[1]TRUEFALSEFALSETRUEFALSEFALSE>ordf.stress>="medium"[1]FALSETRUETRUEFALSETRUETRUE63 Laterwewillmeetthenotionofinheritance.Orderedfactorsinherittheattributesoffactors,andhaveafurtherorderingattribute.Whenyouaskfortheclassofanobject,yougetdetailsbothoftheclassoftheobject,andofanyclassesfromwhichitinherits.Thus:>class(ordf.stress)[1]"ordered""factor"7.7ListsListsmakeitpossibletocollectanarbitrarysetofRobjectstogetherunderasinglename.Youmightforexamplecollecttogethervectorsofseveraldifferentmodesandlengths,scalars,matricesormoregeneralarrays,functions,etc.Listscanbe,andoftenare,arag-tagofdifferentobjects.WewilluseforillustrationthelistobjectthatRcreatesasoutputfromanlmcalculation.Forexample,considerthelinearmodel(lm)objectelastic.lm(c.f.sections1.1.4and2.1.4)createdthus:elastic.lm<-lm(distance~stretch,data=elasticband)Itisreadilyverifiedthatelastic.lmconsistsofavarietyofdifferentkindsofobjects,storedasalist.Youcangetthenamesoftheseobjectsbytypingin>names(elastic.lm)[1]"coefficients""residuals""effects""rank"[5]"fitted.values""assign""qr""df.residual"[9]"xlevels""call""terms""model"Thefirstlistelementis:>elastic.lm$coefficients(Intercept)stretch-63.5714294.553571Alternativewaystoextractthisfirstlistelementare:elastic.lm[["coefficients"]]elastic.lm[[1]]Wecanalternativelyaskforthesublistwhoseonlyelementisthevectorelastic.lm$coefficients.Forthis,specifyelastic.lm[“coefficients”]orelastic.lm[1].Thereisasubtledifferenceintheresultthatisprintedout.Theinformationisprecededby$coefficients,meaning“listelementwithnamecoefficients”.>elastic.lm[1]$coefficients(Intercept)stretch-63.5714294.553571Thesecondlistelementisavectoroflength7>options(digits=3)>elastic.lm$residuals12345672.107-0.32118.0001.893-27.78613.321-7.214Thetenthlistelementdocumentsthefunctioncall:>elastic.lm$calllm(formula=distance~stretch,data=elasticband)>mode(elastic.lm$call)[1]"call"64 *7.8MatricesandArraysInthesenotestheuseofmatricesandarrayswillbequitelimited.Foralmosteverythingwedohere,dataframeshavemoregeneralrelevance,andachievewhatwerequire.Matricesarelikelytobeimportantforthoseuserswhowishtoimplementnewregressionandmultivariatemethods.Allelementsofamatrixhavethesamemode,i.e.allnumeric,orallcharacter.Thusamatrixisamorerestrictedstructurethanadataframe.Onereasonfornumericmatricesisthattheyallowavarietyofmathematicaloperationsthatarenotavailablefordataframes.Anotherreasonisthatmatrixgeneralisestoarray,whichmayhavemorethantwodimensions.Notethatmatricesarestoredcolumnwise.Thusconsider>xx<-matrix(1:6,ncol=3)#Equivalently,entermatrix(1:6,nrow=2)>xx[,1][,2][,3][1,]135[2,]246Ifxxisanymatrix,theassignmentx<-as.vector(xx)placescolumnsofxx,inorder,intothevectorx.Intheexampleabove,wegetbacktheelements1,2,...,6.Matriceshavetheattribute“dimension”.Thus>dim(xx)[1]23Infactamatrixisavector(numericorcharacter)whosedimensionattributehaslength2.Nowset>x34<-matrix(1:12,ncol=4)>x34[,1][,2][,3][,4][1,]14710[2,]25811[3,]36912Hereareexamplesoftheextractionofcolumnsorrowsorsubmatricesx34[2:3,c(1,4)]#Extractrows2&3&columns1&4x34[2,]#Extractthesecondrowx34[-2,]#Extractallrowsexceptthesecondx34[-2,-3]#Extractthematrixobtainedbyomittingrow2&column3Thedimnames()functionassignsand/orextractsmatrixrowandcolumnnames.Thedimnames()functiongivesalist,inwhichthefirstlistelementisthevectorofrownames,andthesecondlistelementisthevectorofcolumnnames.Thisgeneralisesintheobviouswayforusewitharrays,whichwenowdiscuss.7.8.1ArraysThegeneralisationfromamatrix(2dimensions)toallow>2dimensionsgivesanarray.Amatrixisa2-dimensionalarray.Consideranumericvectoroflength24.Sothatwecaneasilykeeptrackoftheelements,wewillmakethem1,2,..,24.Thusx<-1:24Thendim(x)<-c(4,6)turnsthisintoa4x6matrix.>x[,1][,2][,3][,4][,5][,6]65 [1,]159131721[2,]2610141822[3,]3711151923[4,]4812162024Nowtry>dim(x)<-c(3,4,2)>x,,1[,1][,2][,3][,4][1,]14710[2,]25811[3,]36912,,2[,1][,2][,3][,4][1,]13161922[2,]14172023[3,]151821247.8.2ConversionofNumericDataframesintoMatricesTherearevariousmanipulationsthatareavailableformatrices,butnotfordataframes.Useas.matrix()tohandleanyconversionthatmaybenecessary.7.9Exercises1.Generatethenumbers101,102,…,112,andstoretheresultinthevectorx.2.Generatefourrepeatsofthesequenceofnumbers(4,6,3).3.Generatethesequenceconsistingofeight4s,thenseven6s,andfinallynine3s.4.Createavectorconsistingofone1,thentwo2’s,three3’s,etc.,andendingwithnine9’s.5.Determine,foreachofthecolumnsofthedataframeairquality(datasetspackage),themedian,mean,upperandlowerquartiles,andrange.6.Foreachofthefollowingcalculations,whatyouwouldexpect?Checktoseeifyouwereright!a)answer<-c(2,7,1,5,12,3,4)for(jin2:length(answer)){answer[j]<-max(answer[j],answer[j-1])}b)answer<-c(2,7,1,5,12,3,4)for(jin2:length(answer)){answer[j]<-sum(answer[j],answer[j-1])}7.Inthebuilt-indataframeairquality(a)extracttheroworrowsforwhichOzonehasitsmaximumvalue;and(b)extractthevectorofvaluesofWindforvaluesofOzonethatareabovetheupperquartile.8.RefertotheEurasiansnowdatathatisgiveninExercise1.6.Findthemeanofthesnowcover(a)fortheodd-numberedyearsand(b)fortheeven-numberedyears.9.DeterminewhichcolumnsofthedataframeCars93(MASSpackage)arefactors.Foreachofthesefactorcolumns,printoutthelevelsvector.Whichoftheseareorderedfactors?10.Usesummary()togetinformationaboutdatainthedataframesairquality,attitude(bothinthedatasetspackage),andcpus(MASSpackage).Writebriefnotes,foreachofthesedatasets,onwhatyouhavebeenabletolearn.11.Fromthedataframemtcars(MASSpackage)extractadataframemtcars6thatholdsonlytheinformationforcarswith6cylinders.66 12.FromthedataframeCars93(MASSpackage)extractadataframewhichholdsonlyinformationforsmallandsportycars.13.Storethenumbersobtainedinexercise2,inorder,inthecolumnsofa3x4matrix.14.Storethenumbersobtainedinexercise3,inorder,inthecolumnsofa6by4matrix.67 8.UsefulFunctions8.1ConfidenceIntervalsandTestsUsethehelptogetcompleteinformation.Below,Inotetwoofthesimplerfunctions.8.1.1Thet-testandassociatedconfidenceintervalUset.test().Thisallowsbothaone-sampleandatwo-sampletest.8.1.2Chi-Squaretestsfortwo-waytablesUsechisq.test()foratestfornoassociationbetweenrowsandcolumnsintheoutputfromtable().Alternatively,theargumentmaybeamatrix.Thistestthatcountsenterindependentlyintothecellsofatable.Forexample,thetestisinvalidifthereisclusteringinthedata.8.2MatchingandOrdering>match(,)##Foreachelementof,returnsthe##positionofthefirstoccurrencein>order()##Returnsthevectorofsubscriptsgiving##theorderinwhichelementsmustbetaken##sothatwillbesorted.>rank()##Returnstheranksofthesuccessiveelements.Numericvectorswillbesortedinnumericalorder.Charactervectorswillbesortedinalphanumericorder.Theoperator%in%canbehighlyusefulinpickingoutsubsetsofdata.Forexample:>x<-rep(1:5,rep(3,5))>x[1]111222333444555>two4<-x%in%c(2,4)>two4[1]FALSEFALSEFALSETRUETRUETRUEFALSEFALSEFALSETRUE[11]TRUETRUEFALSEFALSEFALSE>#Nowpickoutthe2sandthe4s>x[two4][1]2224448.3StringFunctionssubstring(,,)nchar()##Returnsvectorofnumberofcharactersineachelement.*8.3.1OperationswithVectorsofTextStrings–AFurtherExampleWewillworkwiththecolumnMakeinthedatasetCars93fromtheMASSpackage.library(MASS)#ifneededdata(Cars93)#ifneededToextractthefirstpartofthename,uptothefirstspace,specifycar.brandnames<-substring(Cars93$Make,1,nblank-1)>car.brandnames[1:5][1]"Acura""Acura""Audi""Audi""BMW"68 Tofindthepositionatwhichthefirstspaceappears,wemightdothefollowing:nblank<-sapply(Cars93$Make,function(x){n<-nchar(x);a<-substring(x,1:n,1:n);m<-match("",a,nomatch=1);m})8.4ApplicationofaFunctiontotheColumnsofanArrayorDataFrameapply(,,)lapply(,)##N.B.Adataframeisalist.Outputisalist.sapply(,)##Aslapply(),butsimplify(e.g.toavector##ormatrix),ifpossible.8.4.1apply()Thefunctionapply()canbeusedondataframesaswellasmatrices.Hereisanexample:>apply(airquality,2,mean)#Allelementsmustbenumeric!OzoneSolar.RWindTempMonthDayNANA9.9677.886.9915.80>apply(airquality,2,mean,na.rm=T)OzoneSolar.RWindTempMonthDay42.13185.939.9677.886.9915.80Theuseofapply(airquality,1,mean)willgivemeansforeachrow.Thesearenot,forthesedata,usefulinformation!8.4.2sapply()Thefunctionsapply()canbeusefulforgettinginformationaboutthecolumnsofadataframe.Hereweuseittocountthatnumberofmissingvaluesineachcolumnofthebuilt-indataframeairquality.>sapply(airquality,function(x)sum(is.na(x)))OzoneSolar.RWindTempMonthDay3770000Hereareseveralfurtherexamplesthatusethedataframemothsthataccompaniesthesenotes:>sapply(moths,is.factor)#DeterminewhichcolumnsarefactorsmetersAPhabitatFALSEFALSEFALSETRUE>#Howmanylevelsdoeseachfactorhave?>sapply(moths,function(x)if(!is.factor(x))return(0)elselength(levels(x)))metersAPhabitat0008*8.5aggregate()andtapply()Theargumentsareineachcaseavariable,alistoffactors,andafunctionthatoperatesonavectortoreturnasinglevalue.Foreachcombinationoffactorlevels,thefunctionisappliedtocorrespondingvaluesofthevariable.Thefunctionaggregate()returnsadataframe.Forexample:>str(cabbages)`data.frame':60obs.of4variables:$Cult:Factorw/2levels"c39","c52":1111111111...$Date:Factorw/3levels"d16","d20","d21":1111111111...$HeadWt:num2.52.23.14.32.54.33.84.31.73.1...$VitC:int51554542535050525649...>attach(cabbages)>aggregate(HeadWt,by=list(Cult=Cult,Date=Date),FUN=mean)69 CultDatex1c39d163.182c52d162.263c39d202.804c52d203.115c39d212.746c52d211.47Thesyntaxfortapply()issimilar,exceptthatthenameofthesecondargumentisINDEXratherthanby.Theoutputisanarraywithasmanydimensionsastherearefactors.Wheretherearenodatavaluesforaparticularcombinationoffactorlevels,NAisreturned.*8.7MergingDataFramesThedataframeCars93(MASSpackage)holdsextensiveinformationondatafrom93carsonsaleintheUSAin1993.Oneofthevariables,storedasafactor,isType.IhavecreatedadataframeCars93.summary,inwhichtherownamesarethedistinctvaluesofType,whilealatercolumnholdstwocharacterabbreviationsofeachofthecartypes,suitableforuseinplotting.>Cars93.summaryMin.passengersMax.passengersNo.of.carsabbrevCompact4616CLarge6611LMidsize4622MSmall4521SmSporty2414SpVan789VWeproceedthustoaddacolumnthathastheabbreviationstothedataframe.Herehoweverourdemandsaresimple,andwecanproceedthus:new.Cars93<-merge(x=Cars93,y=Cars93.summary[,4,drop=F],by.x="Type",by.y="row.names")Thiscreatesadataframethathastheabbreviationsintheadditionalcolumnwithname“abbrev”.IftherehadbeenrowswithmissingvaluesofType,thesewouldhavebeenomittedfromthenewdataframe.ThiscanbeavoidedbymakingsurethatTypehasNAasoneofitslevels,inbothdataframes.8.8DatesSinceversion1.9.0,thedatepackagehasbeensupersededbyfunctionsforworkingwithdatesthatareinRbase.Seehelp(Dates),help(as.Date)andhelp(format.Date)fordetailedinformation.Useas.Date()toconverttextstringsintodates.Thedefaultisthattheyearcomesfirst,thenthemonth,andthenthedayofthemonth,thus:>#ElectricityBillingDates>dd<-as.Date(c("2003/08/24","2003/11/23","2004/02/22","2004/05/>diff(dd)Timedifferencesof91,91,91daysUseformat()tosetorchangethewaythatadateisformatted.Thefollowingareaselectionofthesymbolsused:%d:day,asnumber%a:abbreviatedweekdayname(%A:unabbreviated)%m:month(00-12)%b:monthabbreviatedname(%B:unabbreviated)%y:finaltwodigitsofyear(%Y:allfourdigits)Thedefaultformatis"%Y-%m-%d".70 Thefunctionas.Date()willtakeavectorofcharacterstringsthathasanappropriateformat,andconvertitintoadatesobject.Bydefault,datesarestoredusingJanuary11970asorigin.Thisbecomesapparentwhenas.integer()isusedtoconvertadateintoanintegervalue.Hereareexamples:>as.Date("1/1/1960",format="%d/%m/%Y")[1]"1960-01-01">as.Date("1:12:1960",format="%d:%m:%Y")[1]"1960-12-01">as.Date("1960-12-1")-as.Date("1960-1-1")as.Date("1960-12-1")-as.Date("1960-1-1")>as.Date("31/12/1960","%d/%m/%Y")[1]"1960-12-31">as.integer(as.Date("1/1/1970","%d/%m/%Y")[1]0>as.integer(as.Date("1/1/2000","%d/%m/%Y"))[1]10957Thefunctionformat()allowscontroloftheformattingofdates.Seehelp(format.Date).>dec1<-as.Date("2004-12-1")>format(dec1,format="%b%d%Y")[1]"Dec012004">format(dec1,format="%a%b%d%Y")[1]"WedDec012004"8.9Exercisesa)ForthedataframeCars93,gettheinformationprovidedbysummary()foreachlevelofType.(Usesplit().)b)Determinethenumberofcars,inthedataframeCars93,foreachOriginandType.c)Removeallthedataframesandotherobjectsthatyouhaveaddedtotheworkingdirectory.[Ifyouhaveavectorthatholdsthenamesoftheobjectsthatwereinthedirectorywhenyoustarted,thefunctionadditions()willgivethenamesofobjectsthathavebeenadded.]d)Determinethenumberofdays,accordingtoR,betweenthefollowingdates:a)January1intheyear1700,andJanuary1intheyear1800b)January1intheyear1998,andJanuary1intheyear200071 9.WritingFunctionsandotherCodeWehavealreadymetseveralfunctions.HereisafunctiontoconvertFahrenheittoCelsius:>fahrenheit2celsius<-function(fahrenheit=32:40)(fahrenheit-32)*5/9>#Nowinvokethefunction>fahrenheit2celsius(c(40,50,60))[1]4.44444410.00000015.555556Thefunctionreturnsthevalue(fahrenheit-32)*5/9.Moregenerally,afunctionreturnsthevalueofthelaststatementofthefunction.Unlesstheresultfromthefunctionisassignedtoaname,theresultisprinted.Hereisafunctionthatprintsoutthemeanandstandarddeviationofasetofnumbers:>mean.and.sd<-function(x=1:10){+av<-mean(x)+sd<-sqrt(var(x))+c(mean=av,SD=sd)+}>>#Nowinvokethefunction>mean.and.sd()meanSD5.5000003.027650>mean.and.sd(hills$climb)meanSD1815.3141619.1519.1SyntaxandSemanticsAfunctioniscreatedusinganassignment.Ontherighthandside,theparametersappearwithinroundbrackets.Youcan,ifyouwish,giveadefault.Intheexampleabovethedefaultwasx=1:10,sothatuserscanrunthefunctionwithoutspecifyingaparameter,justtoseewhatitdoes.Followingtheclosing“)”thefunctionbodyappears.Exceptwherethefunctionbodyconsistsofjustonestatement,thisisenclosedbetweencurlybraces({}).Thereturnvalueusuallyappearsonthefinallineofthefunctionbody.Intheexampleabove,thiswasthevectorconsistingofthetwonamedelementsmeanandsd.9.1.1AFunctionthatgivesDataFrameDetailsFirstwewilldefineafunctionthatacceptsavectorxasitsonlyargument.Itwillallowustodeterminewhetherxisafactor,andifafactor,howmanylevelsithas.Thebuilt-infunctionis.factor()willreturnTifxisafactor,andotherwiseF.Thefollowingfunctionfaclev()usesis.factor()totestwhetherxisafactor.Itprintsout0ifxisnotafactor,andotherwisethenumberoflevelsofx.>faclev<-function(x)if(!is.factor(x))return(0)elselength(levels(x))Earlier,weencounteredthefunctionsapply()thatcanbeusedtorepeatacalculationonallcolumnsofadataframe.[Moregenerally,thefirstargumentofsapply()maybealist.]Toapplyfaclev()toallcolumnsofthedataframemothswecanspecify>sapply(moths,faclev)Wecanalternativelygivethedefinitionoffaclevdirectlyasthesecondargumentofsapply,thus>sapply(moths,function(x)if(!is.factor(x))return(0)elselength(levels(x)))Finally,wemaywanttodosimilarcalculationsonanumberofdifferentdataframes.Sowecreateafunctioncheck.df()thatencapsulatesthecalculations.Hereisthedefinitionofcheck.df().check.df<-function(df=moths)72 sapply(df,function(x)if(!is.factor(x))return(0)elselength(levels(x)))9.1.2CompareWorkingDirectoryDataSetswithaReferenceSetAtthebeginningofanewsession,wemightstorethenamesoftheobjectsintheworkingdirectoryinthevectordsetnames,thus:dsetnames<-objects()Nowsupposethatwehaveafunctionadditions(),definedthus:additions<-function(objnames=dsetnames){newnames<-objects(pos=1)existing<-as.logical(match(newnames,objnames,nomatch=0))newnames[!existing]}Atsomelaterpointinthesession,wecanenteradditions(dsetnames)togetthenamesofobjectsthathavebeenaddedsincethestartofthesession.9.2IssuesfortheWritingandUseofFunctionsTherecanbemanyfunctions.Choosetheirnamescarefully,sothattheyaremeaningful.Choosemeaningfulnamesforarguments,evenifthismeansthattheyarelongerthanonewouldlike.Rememberthattheycanbeabbreviatedinactualuse.Asfaraspossible,makecodeself-documenting.UsemeaningfulnamesforRobjects.Ensurethatthenamesusedreflectthehierarchiesoffiles,datastructuresandcode.Rallowstheuseofnamesforelementsofvectorsandlists,andforrowsandcolumnsofarraysanddataframes.Considertheuseofnamesratherthannumberswhenyoupulloutindividualelements,columnsetc.Thusdead.tot[,"dead"]ismoremeaningfulandsaferthandead.tot[,2].Settingsthatmayneedtochangeinlateruseofthefunctionshouldappearasdefaultsettingsforparameters.Uselists,wherethisseemsappropriate,togrouptogetherparametersthatbelongtogetherconceptually.Whereappropriate,provideademonstrationmodeforfunctions.Suchamodewillprintoutsummaryinformationonthedataand/orontheresultsofmanipulationspriortoanalysis,withappropriatelabelling.Thecodeneededtoimplementthisfeaturehastheside-effectofshowingbyexamplewhatthefunctiondoes,andmaybeusefulfordebugging.Breakfunctionsupintoasmallnumberofsub-functionsor“primitives”.Re-useexistingfunctionswhereverpossible.Writeanynew“primitives”sothattheycanbere-used.Thishelpsensurethatfunctionscontainwell-testedandwell-understoodcomponents.Watchther-helpelectronicmaillist(section13.3)forusefulfunctionsforroutinetasks.Whereverpossible,giveparameterssensibledefaults.Oftenagoodstrategyistouseasdefaultsparametersthatwillserveforademonstrationrunofthefunction.NULLisausefuldefaultwheretheparametermostlyisnotrequired,butwheretheparameterifitappearsmaybeanyoneofseveraltypesofdatastructure.Thetestif(!is.null())thendetermineswhetheroneneedstoinvestigatethatparameterfurther.Structurecomputationssothatitiseasytoretracethem.Forthisreasonsubstantialchunksofcodeshouldbeincorporatedintofunctionssoonerratherthanlater.Structurecodetoavoidmultipleentryofinformation.9.3FunctionsasaidstoDataManagementWheredata,labellingetcmustbepulledtogetherfromanumberofsources,andespeciallywhereyoumaywanttoretraceyourstepssomemonthslater,takethesamecareoverstructuringdataasoverstructuringcode.Thus73 ifthereisafactorialstructuretothedatafiles,choosefilenamesthatreflectit.Youcanthengeneratethefilenamesautomatically,usingpaste()togluetheseparateportionsofthenametogether.Listsareausefulmechanismforgroupingtogetheralldataandlabellinginformationthatonemaywishtobringtogetherinasinglesetofcomputations.Useasthenameofthelistauniqueandmeaningfulidentificationcode.Considerwhetheryoushouldincludeobjectsaslistitems,orwhetheridentificationbynameispreferable.Bearinmind,also,theuseofswitch(),withtheidentificationcodeusedtodeterminewhatswitch()shouldpickout,topulloutspecificinformationanddatathatisrequiredforaparticularrun.Concentrateinonefunctionthetaskofpullingtogetherdataandlabellinginformation,perhapswithsomesubsequentmanipulation,fromanumberofseparatefiles.Thisstructuresthecode,andmakesthefunctionasourceofdocumentationforthedata.Useuser-defineddataframeattributestodocumentyourdata.Forexample,giventhedataframeelasticcontainingtheamountofstretchandresultingdistanceofmovementofarubberband,onemightspecifyattributes(elasticband)$title<-“Extentofstretchofband,andResultingDistance”9.3.1GraphsUsegraphsfreelytoshedlightbothoncomputationsandondata.OneofR’sbigplusesisitstightintegrationofcomputationandgraphics.9.4ASimulationExampleWewouldliketoknowhowwellsuchastudentmightdobyrandomguessing,onamultiplechoicetestconsistingof100questionseachwithfivealternatives.Wecangetanideabyusingsimulation.EachquestioncorrespondstoanindependentBernoullitrialwithprobabilityofsuccessequalto0.2.Wecansimulatethecorrectnessofthestudentforeachquestionbygeneratinganindependentuniformrandomnumber.Ifthisnumberislessthan.2,wesaythatthestudentguessedcorrectly;otherwise,wesaythatthestudentguessedincorrectly.Thiswillwork,becausetheprobabilitythatauniformrandomvariableislessthan.2isexactly.2,whiletheprobabilitythatauniformrandomvariableexceeds.2isexactly.8,whichisthesameastheprobabilitythatthestudentguessesincorrectly.Thus,theuniformrandomnumbergeneratorissimulatingthestudent.Rcandothisasfollows:guesses<-runif(100)correct.answers<-1*(guesses<.2)correct.answersThemultiplicationby1causes(guesses<.2),whichiscalculatedasTRUEorFALSE,tobecoercedto1(TRUE)or0(FALSE).Thevectorcorrect.answersthuscontainstheresultsofthestudent'sguesses.A1isrecordedeachtimethestudentcorrectlyguessestheanswer,whilea0isrecordedeachtimethestudentiswrong.OnecanthuswriteanRfunctionthatsimulatesastudentguessingataTrue-Falsetestconsistingofsomearbitrarynumberofquestions.Weleavethisasanexercise.9.4.1PoissonRandomNumbersOnecanthinkofthePoissondistributionasthedistributionofthetotalforoccurrencesofrareevents.Forexample,theoccurrenceofanaccidentatanintersectiononanyonedayshouldbearareevent.ThetotalnumberofaccidentsoverthecourseofayearmaywellfollowadistributionthatisclosetoPoisson.[HoweverthetotalnumberofpeopleinjuredisunlikelytofollowaPoissondistribution.Why?]WecangeneratePoissonrandomnumbersusingrpois().Itissimilartotherbinomfunction,butthereisonlyoneparameter–themean.SupposeforexampletrafficaccidentsoccuratanintersectionwithaPoissondistributionthathasameanrateof3.7peryear.Tosimulatetheannualnumberofaccidentsfora10-yearperiod,wecanspecifyrpois(10,3.7).WepursuethePoissondistributioninanexercisebelow.74 9.5Exercises1.Usetheroundfunctiontogetherwithrunif()togenerate100randomintegersbetween0and99.Nowlookupthehelpforsample(),anduseitforthesamepurpose.2.Writeafunctionthatwilltakeasitsargumentsalistofresponsevariables,alistoffactors,adataframe,andafunctionsuchasmeanormedian.Itwillreturnadataframeinwhicheachvalueforeachcombinationoffactorlevelsissummarisedinasinglestatistic,forexamplethemeanorthemedian.3.Thesupplieddataframemilkhascolumnsfourandone.Seventeenpeopleratedthesweetnessofeachoftwosamplesofamilkproductonacontinuousscalefrom1to7,onesamplewithfourunitsofadditiveandtheotherwithoneunitofadditive.Hereisafunctionthatplots,foreachpatient,thefourresultagainsttheoneresult,butinsistingonthesamerangeforthexandyaxes.plot.one<-function(){xyrange<-range(milk)#Calculatestherangeofallvalues#inthedataframepar(pin=c(6.75,6.75))#Setplottingarea=6.75in.by6.75in.plot(four,one,data=milk,xlim=xyrange,ylim=xyrange,pch=16)abline(0,1)#Linewherefour=one}Rewritethisfunctionsothat,giventhenameofadataframeandofanytwoofitscolumns,itwillplotthesecondnamedcolumnagainstthefirstnamedcolumn,showingalsotheliney=x.4.Writeafunctionthatprints,withtheirrowandcolumnlabels,onlythoseelementsofacorrelationmatrixforwhichabs(correlation)>=0.9.5.Writeyourownwrapperfunctionforone-wayanalysisofvariancethatprovidesasidebysideboxplotofthedistributionofvaluesbygroups.Ifnoresponsevariableisspecified,thefunctionwillgeneraterandomnormaldata(nodifferencebetweengroups)andprovidetheanalysisofvarianceandboxplotinformationforthat.6.Writeafunctionthataddsatextstringcontainingdocumentationinformationasanattributetoadataframe.7.Writeafunctionthatcomputesamovingaverageoforder2ofthevaluesinagivenvector.Applytheabovefunctiontothedata(inthedatasethuronthataccompaniesthesenotes)forthelevelsofLakeHuron.Repeatforamovingaverageoforder3.8.Findawayofcomputingthemovingaveragesinexercise3thatdoesnotinvolvetheuseofaforloop.9.Createafunctiontocomputetheaverage,varianceandstandarddeviationof1000randomlygenerateduniformrandomnumbers,on[0,1].(Compareyourresultswiththetheoreticalresults:theexpectedvalueofauniformrandomvariableon[0,1]is0.5,andthevarianceofsucharandomvariableis0.0833.)10.Writeafunctionthatgenerates100independentobservationsonauniformlydistributedrandomvariableontheinterval[3.7,5.8].Findthemean,varianceandstandarddeviationofsuchauniformrandomvariable.Nowmodifythefunctionsothatyoucanspecifyanarbitraryinterval.11.Lookupthehelpforthesample()function.Useittogenerate50randomintegersbetween0and99,sampledwithoutreplacement.(Thismeansthatwedonotallowanynumbertobesampledasecondtime.)Now,generate50randomintegersbetween0and9,withreplacement.12.WriteanRfunctionthatsimulatesastudentguessingataTrue-Falsetestconsistingof40questions.Findthemeanandvarianceofthestudent'sanswers.Comparewiththetheoreticalvaluesof.5and.25.13.WriteanRfunctionthatsimulatesastudentguessingatamultiplechoicetestconsistingof40questions,wherethereischanceof1in5ofgettingtherightanswertoeachquestion.Findthemeanandvarianceofthestudent'sanswers.Comparewiththetheoreticalvaluesof.2and.16.14.WriteanRfunctionthatsimulatesthenumberofworkinglightbulbsoutof500,whereeachbulbhasaprobability.99ofworking.Usingsimulation,estimatetheexpectedvalueandvarianceoftherandomvariableX,whichis1ifthelightbulbworksand0ifthelightbulbdoesnotwork.Whatarethetheoreticalvalues?15.Writeafunctionthatdoesanarbitrarynumbernofrepeatedsimulationsofthenumberofaccidentsinayear,plottingtheresultinasuitableway.AssumethatthenumberofaccidentsinayearfollowsaPoissondistribution.Runthefunctionassuminganaveragerateof2.8accidentsperyear.75 16.Writeafunctionthatsimulatestherepeatedcalculationofthecoefficientofvariation(=theratioofthemeantothestandarddeviation),forindependentrandomsamplesfromanormaldistribution.17.Writeafunctionthat,foranysample,calculatesthemedianoftheabsolutevaluesofthedeviationsfromthesamplemedian.*18.Generaterandomsamplesfromnormal,exponential,t(2d.f.),andt(1d.f.),thus:a)xn<-rnorm(100)b)xe<-rexp(100)c)xt2<-rt(100,df=2)d)xt2<-rt(100,df=1)Applythefunctionfromexercise17toeachsample.Comparewiththestandarddeviationineachcase.*19.Thevectorxconsistsofthefrequencies5,3,1,4,6Thefirstelementisthenumberofoccurrencesoflevel1,thesecondisthenumberofoccurrencesoflevel2,andsoon.Writeafunctionthattakesanysuchvectorxasitsinput,andoutputsthevectoroffactorlevels,here111112223...[You’llneedtheinformationthatisprovidedbycumsum(x).Formavectorinwhich1’sappearwheneverthefactorlevelisincremented,andisotherwisezero....]*20.Writeafunctionthatcalculatestheminimumofaquadratic,andthevalueofthefunctionattheminimum.*21.A“betweentimes”correlationmatrix,hasbeencalculatedfromdataonheightsoftreesattimes1,2,3,4,...Writeafunctionthatcalculatestheaverageofthecorrelationsforanygivenlag.*22.Givendataontreesattimes1,2,3,4,...,writeafunctionthatcalculatesthematrixof“average”relativegrowthratesovertheseveralintervals.Applyyourfunctiontothedataframeratsthataccompaniesthesenotes.1dwdlogw[Therelativegrowthratemaybedefinedas.Henceitsisreasonabletocalculatethewdtdtlogwlogw21averageovertheintervalfromt1tot2as.]tt2176 77 *10.GLM,andGeneralNon-linearModelsGLMmodelsareGeneralizedLinearModels.Theyextendthemultipleregressionmodel.TheGAM(GeneralizedAdditiveModel)modelisafurtherextension.10.1ATaxonomyofExtensionstotheLinearModelRallowsavarietyofextensionstothemultiplelinearregressionmodel.Inthischapterwedescribethealternativefunctionalforms.41Thebasicmodelformulationis:Observedvalue=ModelPrediction+StatisticalErrorOftenitisassumedthatthestatisticalerrorvalues(valuesofinthediscussionbelow)areindependentlyandidenticallydistributedasNormal.GeneralizedLinearModels,andtheotherextensionswedescribe,allowavarietyofnon-normaldistributions.Inthediscussionofthissection,ourfocusisontheformofthemodelprediction,andweleaveuntillatersectionsthediscussionofdifferentpossibilitiesforthe“error”distribution.Multipleregressionmodely=+1x1+2x2+...+pxp+Uselm()tofitmultipleregressionmodels.Thevariousothermodelswedescribeare,inessence,generalizationsofthismodel.GeneralizedLinearModel(e.g.logitmodel)y=g(a+b1x1)+Hereg(.)isselectedfromoneofasmallnumberofoptions.Forlogitmodels,y,wherelog()abx111Hereisanexpectedproportion,andlog()logit()islog(odds).1Wecanturnthismodelaround,andwriteexp(abx)11yg(abx)111exp(abx)11Hereg(.)undoesthelogittransformation.Wecanaddmoreexplanatoryvariables:a+b1x1+...+bpxp.Useglm()tofitgeneralizedlinearmodels.AdditiveModely(x)(x)....(x)1122pp41Thismaybegeneralizedinvariousways.Modelswhichhavethisformmaybenestedwithinothermodelswhichhavethisbasicform.Thustheremaybe`predictions’and`errors’atdifferentlevelswithinthetotalmodel.78 Additivemodelsareageneralizationoflmmodels.In1dimensiony(x)11Someofz11(x1),z22(x2),...,zpp(xp)maybesmoothingfunctions,whileothersmaybetheusuallinearmodelterms.Theconstanttermgetsabsorbedintooneormoreofthes.GeneralizedAdditiveModelyg((x)(x)....(x))1122ppGeneralizedAdditiveModelsareageneralisationofGeneralizedLinearModels.Forexample,g(.)maybethefunctionthatundoesthelogittransformation,asinalogisticregressionmodel.Someofz11(x1),z22(x2),...,zpp(xp)maybesmoothingfunctions,whileothersmaybetheusuallinearmodelterms.Wecantransformtogetthemodelyg(zz...z)12pNoticethatevenifp=1,wemaystillwanttoretainboth(.)andg(.),i.e.1yg((x))11Thereasonisthatg(.)isaspecificfunction,suchastheinverseofthelogitfunction.Thefunction(.)does1anyfurthernecessarysmoothing,incaseg(.)isnotquitetherighttransformation.Onewantsg(.)todoasmuchofpossibleofthetaskoftransformation,with(.)givingthetransformationanynecessaryadditional1flourishes.Thefittingofspline(bs()orns())termsinalinearmodelorageneralizedlinearmodelcanbeagoodalternativetotheuseofafullgeneralizedadditivemodel.10.2LogisticRegressionWewillusealogisticregressionmodelasastartingpointfordiscussingGeneralizedLinearModels.Withproportionsthatrangefromlessthan0.1to0.99,itisnotreasonabletoexpectthattheexpectedproportionwillbealinearfunctionofx.Somesuchtransformation(`link’function)asthelogitisrequired.Agoodwaytothinkaboutlogitmodelsisthattheyworkonalog(odds)scale.Ifpisaprobability(e.g.thathorseAwillwintherace),thenthecorrespondingoddsarep/(1-p),andplog(odds)=log()=log(p)-log(1-p)1ppThelinearmodelpredicts,notp,butlog().Figure25showsthelogittransformation.1p79 ds)dO6g(.99lo40.e2i.,).75n00otir2-.1po0or4P-(it61-log.0000.00.20.40.60.81.0ProportionFigure25:Thelogitorlog(odds)transformation.Shownhereisaplotoflog(odds)versusproportion.Noticehowtherangeisstretchedoutatbothends.Thelogitorlog(odds)functionturnsexpectedproportionsintovaluesthatmayrangefrom-to+.Itisnotsatisfactorytousealinearmodeltopredictproportions.Thevaluesfromthelinearmodelmaywelllieoutsidetherangefrom0to1.Itishoweverinordertousealinearmodeltopredictlogit(proportion).Thelogitfunctionisanexampleofalinkfunction.Therearevariousotherlinkfunctionsthatwecanusewithproportions.Oneofthecommonestisthecomplementarylog-logfunction.10.2.1AnestheticDepthExampleThirtypatientsweregivenananestheticagentthatwasmaintainedatapre-determined[alveolar]concentration42for15minutesbeforemakinganincision.Itwasthennotedwhetherthepatientmoved,i.e.jerkedortwisted.Theinterestisinestimatinghowtheprobabilityofjerkingortwistingvarieswithincreasingconcentrationoftheanestheticagent.Theresponseisbesttakenasnomove,forreasonsthatwillemergelater.Thereisasmallnumberofconcentrations;sowebeginbytabulatingproportionthathavethenomoveoutcomeagainstconcentration.AlveolarConcentrationNomove0.811.21.41.62.506422001114442Total756642_____________________________________________Table1:Patientsmoving(0)andnotmoving(1),foreachofsixdifferentalveolarconcentrations.Figure25thendisplaysaplotoftheseproportions.42IamgratefultoJohnErickson(AnesthesiaandCriticalCare,UniversityofChicago)andtoAlanWelsh(CentreforMathematics&itsApplications,AustralianNationalUniversity)forallowingmeuseofthesedata.80 01.4280.ion666t0.opor4rP0.20.500.1.01.52.02.5ConcentrationFigure25:Plot,versusconcentration,ofproportionofpatientsnotmoving.Thehorizontallineistheestimateoftheproportionofmovesonewouldexpectiftheconcentrationhadnoeffect.Wefittwomodels,thelogitmodelandthecomplementarylog-logmodel.Wecanfitthemodelseitherdirectlytothe0/1data,ortotheproportionsinTable1.Tounderstandtheoutput,youneedtoknowabout“deviances”.Adeviancehasaroleverysimilartoasumofsquaresinregression.Thuswehave:RegressionLogisticregressiondegreesoffreedomdegreesoffreedomsumofsquaresdeviancemeansumofsquaresmeandeviance(dividebyd.f.)(dividebyd.f.)WeprefermodelswithasmallWeprefermodelswithasmallmeanmeanresidualsumofsquares.deviance.Ifindividualsrespondindependently,withthesameprobability,thenwehaveBernoullitrials.Justificationforassumingthesameprobabilitywillarisefromthewayinwhichindividualsaresampled.Whileindividualswillcertainlybedifferentintheirresponsethenotionisthat,eachtimeanewindividualistaken,theyaredrawnatrandomfromsomelargerpopulation.HereistheRcode:>anaes.logit<-glm(nomove~conc,family=binomial(link=logit),+data=anesthetic)Theoutputsummaryis:>summary(anaes.logit)Call:glm(formula=nomove~conc,family=binomial(link=logit),data=anesthetic)DevianceResiduals:Min1QMedian3QMax-1.77-0.7440.03410.6872.07Coefficients:ValueStd.Errortvalue(Intercept)-6.472.42-2.68conc5.572.042.7281 (DispersionParameterforBinomialfamilytakentobe1)NullDeviance:41.5on29degreesoffreedomResidualDeviance:27.8on28degreesoffreedomNumberofFisherScoringIterations:5CorrelationofCoefficients:(Intercept)conc-0.981Fig.26isagraphicalsummaryoftheresults:9.90n.8o0tirop.4or0P.101.000.00.51.01.52.02.5ConcentrationFigure26:Plot,versusconcentration,oflog(odds)[=logit(proportion)]ofpatientsnotmoving.Thelineistheestimateoftheproportionofmoves,basedonthefittedlogitmodel.Withsuchasmallsamplesizeitisimpossibletodomuchthatisusefultochecktheadequacyofthemodel.Tryalsoplot(anaes.logit).10.3glmmodels(GeneralizedLinearRegressionModelling)Intheabovewehadanaes.logit<-glm(nomove~conc,family=binomial(link=logit),data=anesthetic)Thefamilyparameterspecifiesthedistributionforthedependentvariable.Thereisanoptionalargumentthatallowsustospecifythelinkfunction.Belowwegivefurtherexamples.10.3.2DataintheformofcountsDatathatareintheformofcountscanoftenbeanalysedquiteeffectivelyassumingthepoissonfamily.Thelinkthatiscommonlyusedhereislog.Theloglinktransformsfrompositivenumberstonumbersintherange-to+thatalinearmodelmaypredict.10.3.3ThegaussianfamilyIfnofamilyisspecified,thenthefamilyistakentobegaussian.Thedefaultlinkisthentheidentity,asforanlmmodel.Thiswayofformulatinganlmtypemodeldoeshoweverhavetheadvantagethatoneisnotrestrictedtotheidentitylink.data(airquality)82 air.glm<-glm(Ozone^(1/3)~Solar.R+Wind+Temp,data=airquality)#Assumesgaussianfamily,i.e.normalerrorsmodelsummary(air.glm)10.4ModelsthatIncludeSmoothSplineTermsThesemakeitpossibletofitsplineandothersmoothtransformationsofexplanatoryvariables.Onecanrequesta`smooth’b-splineorn-splinetransformationofacolumnoftheXmatrix.Inplaceofxonespecifiesbs(x)orns(x).Onecancontrolthesmoothnessofthecurve,butoftenthedefaultworksquitewell.Youneedtoinstallthesplinespackage.Rdoesnotatpresenthaveafacilityforplotsthatshowthecontributionofeachtermtothemodel.10.4.1DewpointDataThedatasetdewpoint43hascolumnsmintemp,maxtempanddewpoint.Thedewpointvaluesareaverages,foreachcombinationofmintempandmaxtemp,ofmonthlydatafromanumberofdifferenttimesandlocations.Wefitthemodel:dewpoint=meanofdewpoint+smooth(mintemp)+smooth(maxtemp)Takingoutthemeanisacomputationalconvenience.Alsoitprovidesamorehelpfulformofoutput.Herearedetailsofthecalculations:dewpoint.lm<-lm(dewpoint~bs(mintemp)+bs(maxtemp),data=dewpoint)options(digits=3)summary(dewpoint.lm)10.5SurvivalAnalysisForexampletimesatwhichsubjectswereeitherlosttothestudyordied(“failed”)mayberecordedforindividualsineachofseveraltreatmentgroups.Engineeringorbusinessfailurescanbemodelledusingthissamemethodology.TheRsurvivalpackagehasstateoftheartabilitiesforsurvivalanalysis.10.6Non-linearModelsYoucanusenls()(non-linearleastsquares)toobtainaleastsquaresfittoanon-linearfunction.10.7ModelSummariesTypeinmethods(summary)togetalistofthesummarymethodsthatareavailable.Youmaywanttomixandmatch,e.g.summary.lm()onanaovorglmobject.Theoutputmaynotbewhatyoumightexpect.Sobecareful!10.8FurtherElaborationsGeneralisedLinearModelsweredevelopedinthe1970s.Theyunifiedahugerangeofdiversemethodology.Theyhavenowbecomeastock-in-tradeofstatisticalanalysts.Theirpracticalimplementationbuiltonthepowerfulcomputationalabilitiesthat,bythe1970s,hadbeendevelopedforhandlinglinearmodelcalculations.Practicaldataanalysisdemandsfurtherelaborations.Animportantelaborationistotheincorporationofmorethanonetermintheerrorstructure.TheRnlmepackageimplementssuchextensions,bothforlinearmodelsandforawideclassofnonlinearmodels.Eachsuchnewdevelopmentbuildsonthetheoreticalandcomputationaltoolsthathavearisenfromearlierdevelopments.Excitingnewanalysistoolswillcontinuetoappearforalongtimeyet.Thisisfortunate.Most43IamgratefultoDrEdwardLinacre,VisitingFellow,GeographyDepartment,AustralianNationalUniversity,formakingthesedataavailable.83 professionalusersofRwillregularlyencounterdatawherethemethodologythatthedataideallydemandsisnotyetavailable.10.9Exercises1.FitaPoissonregressionmodeltothedatainthedataframemothsthatAccompaniesthesenotes.Allowdifferentinterceptsfordifferenthabitats.Uselog(meters)asacovariate.10.10ReferencesDobson,A.J.1983.AnIntroductiontoStatisticalModelling.ChapmanandHall,London.Hastie,T.J.andTibshirani,R.J.1990.GeneralizedAdditiveModels.ChapmanandHall,London.MaindonaldJHandBraunWJ2003.DataAnalysisandGraphicsUsingR–AnExample-BasedApproach.CambridgeUniversityPress.ndMcCullagh,P.andNelder,J.A.,2edn.,1989.GeneralizedLinearModels.ChapmanandHall.Venables,W.N.andRipley,B.D.,2ndedn1997.ModernAppliedStatisticswithS-Plus.Springer,NewYork.84 85 *11.Multi-levelModels,RepeatedMeasuresandTimeSeries11.1Multi-LevelModels,IncludingRepeatedMeasuresModelsModelshavebothafixedeffectsstructureandanerrorstructure.Forexample,inaninter-laboratorycomparisontheremaybevariationbetweenlaboratories,betweenobserverswithinlaboratories,andbetweenmultipledeterminationsmadebythesameobserverondifferentsamples.Ifwetreatlaboratoriesandobserversasrandom,theonlyfixedeffectisthemean.Thefunctionslme()andnlme(),fromthePinheiroandBatesnlmepackage,handlemodelsinwhicharepeatedmeasureserrorstructureissuperimposedonalinear(lme)ornon-linear(nlme)model.Version3oflmeisbroadlycomparabletoProcMixedinthewidelyusedSASstatisticalpackage.Thefunctionlmehasassociatedwithithighlyusefulabilitiesfordiagnosticcheckingandforvariousinsightfulplots.Thereisastronglinkbetweenawideclassofrepeatedmeasuresmodelsandtimeseriesmodels.Inthetimeseriescontextthereisusuallyjustonerealisationoftheseries,whichmayhoweverbeobservedatalargenumberoftimepoints.Intherepeatedmeasurescontexttheremaybealargenumberofrealisationsofaseriesthatistypicallyquiteshort.11.1.1TheKiwifruitShadingData,AgainReferbacktosection5.8.2fordetailsofthesedata.Thefixedeffectsareblockandtreatment(shade).Therandomeffectsareblock(thoughmakingblockarandomeffectisoptional),plotwithinblock,andunitswithineachblock/plotcombination.Hereistheanalysis:>library(nlme)Loadingrequiredpackage:nls>kiwishade$plot<-factor(paste(kiwishade$block,kiwishade$shade,sep="."))>kiwishade.lme<-lme(yield~shade,random=~1|block/plot,data=kiwishade)>summary(kiwishade.lme)Linearmixed-effectsmodelfitbyREMLData:kiwishadeAICBIClogLik265.9663278.4556-125.9831Randomeffects:Formula:~1|block(Intercept)StdDev:2.019373Formula:~1|plot%in%block(Intercept)ResidualStdDev:1.4786393.490378Fixedeffects:yield~shadeValueStd.ErrorDFt-valuep-value(Intercept)100.202501.7616213656.88086<.0001shadeAug2Dec3.030831.86762961.622820.1558shadeDec2Feb-10.281671.8676296-5.505200.0015shadeFeb2May-7.428331.8676296-3.977410.0073Correlation:(Intr)shdA2DshdD2FshadeAug2Dec-0.5386 shadeDec2Feb-0.530.50shadeFeb2May-0.530.500.50StandardizedWithin-GroupResiduals:MinQ1MedQ3Max-2.4153887-0.5981415-0.06899480.78045971.5890938NumberofObservations:48NumberofGroups:blockplot%in%block312>anova(kiwishade.lme)numDFdenDFF-valuep-value(Intercept)1365190.552<.0001shade3622.2110.0012Thiswasabalanceddesign,whichiswhysection5.8.2coulduseaov()forananalysis.Wecangetanoutputsummarythatishelpfulforshowinghowtheerrormeansquaresmatchupwithstandarddeviationinformationgivenabovethus:>intervals(kiwishade.lme)Approximate95%confidenceintervalsFixedeffects:lowerest.upper(Intercept)96.62977100.202500103.775232shadeAug2Dec-1.539093.0308337.600757shadeDec2Feb-14.85159-10.281667-5.711743shadeFeb2May-11.99826-7.428333-2.858410RandomEffects:Level:blocklowerest.uppersd((Intercept))0.54730142.0193737.45086Level:plotlowerest.uppersd((Intercept))0.37025551.4786395.905037Within-groupstandarderror:lowerest.upper2.7706783.4903784.397024Weareinterestedinthethreeestimates.Bysquaringthestandarddeviationsandconvertingthemtovarianceswegettheinformationinthefollowingtable:VariancecomponentNotes2block2.019=4.076Threeblocks2plot1.479=2.1864plotsperblock2residual(withingroup)3.490=12.1804vines(subplots)perplotTheaboveallowsustoputtogethertheinformationforananalysisofvariancetable.Wehave:87 VarianceMeansquareforanovatabled.f.componentblock4.07612.180+42.186+164.0762=86.14(3-1)plot2.18612.180+42.1866=20.92(3-1)(2-1)residual(withingroup)12.18012.1834(4-1)Nowfindseewherethesesamepiecesofinformationappearedintheanalysisofvariancetableofsection5.8.2:>kiwishade.aov<-aov(yield~block+shade+Error(block:shade),data=kiwishade)>summary(kiwishade.aov)Error:block:shadeDfSumSqMeanSqFvaluePr(>F)block2172.3586.174.11760.074879shade31394.51464.8422.21120.001194Residuals6125.5720.93Error:WithinDfSumSqMeanSqFvaluePr(>F)Residuals36438.5812.1811.1.2TheTintingofCarWindowsInsection4.1weencountereddatafromanexperimentthataimedtomodeltheeffectsofthetintingofcar44windowsonvisualperformance.Theauthorsaremainlyinterestedineffectsonsidewindowvision,andhenceinvisualrecognitiontasksthatwouldbeperformedwhenlookingthroughsidewindows.Dataareinthedataframetinting.Inthisdataframe,csoa(criticalstimulusonsetasynchrony,i.e.thetimeinmillisecondsrequiredtorecogniseanalphanumerictarget),it(inspectiontime,i.e.thetimerequiredforasimplediscriminationtask)andagearevariables,whiletint(3levels)andtarget(2levels)areorderedfactors.Thevariablesexiscoded1formalesand2forfemales,whilethevariableagegpiscoded1foryoungpeople(allintheirearly20s)and2forolderparticipants(allintheearly70s).Wehavetwolevelsofvariation–withinindividuals(whowereeachtestedoneachcombinationoftintandtarget),andbetweenindividuals.Soweneedtospecifyid(identifyingtheindividual)asarandomeffect.Plotssuchasweexaminedinsection4.1makeitclearthat,togetvariancesthatareapproximatelyhomogeneous,weneedtoworkwithlog(csoa)andlog(it).Hereweexaminetheanalysisforlog(it).Westartwithamodelthatislikelytobemorecomplexthanweneed(ithasallpossibleinteractions):itstar.lme<-lme(log(it)~tint*target*agegp*sex,random=~1|id,data=tinting,method="ML")Areasonableguessisthatfirstorderinteractionsmaybeallweneed,i.e.it2.lme<-lme(log(it)~(tint+target+agegp+sex)^2,random=~1|id,data=tinting,method="ML")Finally,thereistheverysimplemodel,allowingonlyformaineffects:it1.lme<-lme(log(it)~(tint+target+agegp+sex),44Datarelatetothepaper:Burns,N.R.,Nettlebeck,T.,White,M.andWillson,J.1999.Effectsofcarwindowtintingonvisualperformance:acomparisonofelderlyandyoungdrivers.Ergonomics42:428-443.88 random=~1|id,data=tinting,method="ML")Notethatwehavefittedallthesemodelsbymaximumlikelihood.Thisissothatwecandotheequivalentofananalysisofvariancecomparison.Hereiswhatweget:>anova(itstar.lme,it2.lme,it1.lme)ModeldfAICBIClogLikTestL.Ratiop-valueitstar.lme1268.14618791.4503621.926906it2.lme217-3.74288350.7252318.8714411vs26.110930.7288it1.lme381.13817126.770227.4309152vs322.881050.0065Themodelthatlimitsattentiontofirstorderinteractionsisadequate.Wewillneedtoexaminethefirstorderinteractionsindividually.Forthiswere-fitthemodelusedforit2.lme,butnowwithmethod="REML".it2.reml<-update(it2.lme,method="REML")Wenowexaminetheestimatedeffects:>options(digits=3)>summary(it2.reml)$tTableValueStd.ErrorDFt-valuep-value(Intercept)6.052310.75891457.9754.17e-13tint.L0.226580.08901452.5471.19e-02tint.Q0.171260.09331451.8366.84e-02targethicon-0.240120.1010145-2.3781.87e-02agegp-1.134490.516722-2.1963.90e-02sex-0.745420.516722-1.4431.63e-01tint.L.targethicon-0.091930.0461145-1.9964.78e-02tint.Q.targethicon-0.007220.0482145-0.1508.81e-01tint.L.agegp-0.130750.0492145-2.6588.74e-03tint.Q.agegp-0.069720.0520145-1.3411.82e-01tint.L.sex0.097940.04921451.9914.83e-02tint.Q.sex-0.005420.0520145-0.1049.17e-01targethicon.agegp0.138870.05841452.3761.88e-02targethicon.sex-0.077850.0584145-1.3321.85e-01agegp.sex0.331640.3261221.0173.20e-01Becausetintisanorderedfactor,polynomialcontrastsareused.11.1.3TheMichelsonSpeedofLightDataHereisanexample,usingtheMichelsonspeedoflightdatafromtheVenablesandRipleyMASSpackage.ThemodelallowsthedeterminationtovarylinearlywithRun(from1to20),withtheslopevaryingrandomlybetweenthefiveexperimentsof20runseach.Weassumeanautoregressivedependencestructureoforder1.Weallowthevariancetochangefromoneexperimenttoanother.Maximumlikelihoodtestssuggestthatoneneedsatleastthiscomplexityinthevarianceanddependencestructuretorepresentthedataaccurately.AmodelthathasneitherfixednorrandomRuneffectsseemsallthatisjustifiedstatistically.Totestthis,oneneedstofitmodelswithandwithouttheseeffects,settingmethod=”ML”ineachcase,andcomparethelikelihoods.(Ileavethisasanexercise!)Forpurposesofdoingthistest,afirstorderautoregressivemodelwouldprobablybeadequate.Amodelthatignoresthesequentialdependenceentirelydoesgivemisleadingresults.>library(MASS)#ifneeded>data(michelson)#ifneeded>michelson$Run<-as.numeric(michelson$Run)#EnsureRunisavariable89 >mich.lme1<-lme(fixed=Speed~Run,data=michelson,random=~Run|Expt,correlation=corAR1(form=~1|Expt),weights=varIdent(form=~1|Expt))>summary(mich.lme1)Linearmixed-effectsmodelfitbyREMLData:michelsonAICBIClogLik11131142-546Randomeffects:Formula:~Run|ExptStructure:Generalpositive-definiteStdDevCorr(Intercept)46.49(Intr)Run3.62-1Residual121.29CorrelationStructure:AR(1)Formula:~1|ExptParameterestimate(s):Phi0.527Variancefunction:Structure:DifferentstandarddeviationsperstratumFormula:~1|ExptParameterestimates:123451.0000.3400.6460.5430.501Fixedeffects:Speed~RunValueStd.ErrorDFt-valuep-value(Intercept)86830.519428.46<.0001Run-22.4294-0.880.381Correlation:(Intr)Run-0.934StandardizedWithin-GroupResiduals:MinQ1MedQ3Max-2.912-0.6060.1090.7401.810NumberofObservations:100NumberofGroups:511.2TimeSeriesModelsTheRstatspackagehasanumberoffunctionsformanipulatingandplottingtimeseries,andforcalculatingtheautocorrelationfunction.Thereare(atleast)twotypesofmethod–timedomainmethodsandfrequencydomainmethods.Inthetimedomainmodelsmaybeconventional“shortmemory”modelswheretheautocorrelationfunctiondecaysquiterapidlytozero,ortherelativelyrecentlydeveloped“longmemory”timeseriesmodelswheretheautocorrelationfunctiondecaysveryslowlyasobservationsmoveapartintime.Acharacteristicof“longmemory”modelsis90 thatthereisvariationatalltemporalscales.Thusinastudyofwindspeedsitmaybepossibletocharacterisewindydays,windyweeks,windymonths,windyyears,windydecades,andperhapsevenwindycenturies.Rdoesnotyethavefunctionsforfittingthemorerecentlydevelopedlongmemorymodels.Thefunctionstl()decomposesatimesseriesintoatrendandseasonalcomponents,etc.Thefunctionsar()(for“autoregressive”models)andassociatedfunctions,andarima0()(“autoregressiveintegratedmovingaveragemodels”)fitstandardtypesoftimedomainshortmemorymodels.Notealsothefunctiongls()inthenlmepackage,whichcanfitrelativelycomplexmodelsthatmayhaveautoregressive,arimaandvariousothertypesofdependencestructure.Thefunctionspectrum()andrelatedfunctionsisdesignedforfrequencydomainor“spectral”analysis.11.3Exercises1.Usethefunctionacf()toplottheautocorrelationfunctionoflakelevelsinsuccessiveyearsinthedatasethuron.Dotheplotsbothwithtype=”correlation”andwithtype=”partial”.11.4ReferencesChambers,J.M.andHastie,T.J.1992.StatisticalModelsinS.WadsworthandBrooksColeAdvancedBooksandSoftware,PacificGroveCA.Diggle,Liang&Zeger1996.AnalysisofLongitudinalData.ClarendonPress,Oxford.Everitt,B.S.andDunn,G.1992.AppliedMultivariateDataAnalysis.Arnold,London.Hand,D.J.&Crowder,M.J.1996.Practicallongitudinaldataanalysis.ChapmanandHall,London.Little,R.C.,Milliken,G.A.,Stroup,W.W.andWolfinger,R.D.(1996).SASSystemsforMixedModels.SASInstituteInc.,Cary,NewCarolina.MaindonaldJHandBraunWJ2003.DataAnalysisandGraphicsUsingR–AnExample-BasedApproach.CambridgeUniversityPress.Pinheiro,J.C.andBates,D.M.2000.MixedeffectsmodelsinSandS-PLUS.Springer,NewYork.Venables,W.N.andRipley,B.D.,2ndedn1997.ModernAppliedStatisticswithS-Plus.Springer,NewYork.91 *12.AdvancedProgrammingTopics12.1.MethodsRisanobject-orientedlanguage.Objectsmayhavea“class”.Forfunctionssuchasprint(),summary(),etc.,theclassoftheobjectdetermineswhatactionwillbetaken.Thusinresponsetoprint(x),Rdeterminestheclassattributeofx,ifoneexists.Ifforexampletheclassattributeis“factor”,thenthefunctionwhichfinallyhandlestheprintingisprint.factor().Thefunctionprint.default()isusedtoprintobjectsthathavenotbeenassignedaclass.Moregenerally,theclassattributeofanobjectmaybeavectorofstrings.Ifthereare“ancestor”classes–parent,grandparent,...,thesearespecifiedinorderinsubsequentelementsoftheclassvector.Forexample,orderedfactorshavetheclass“ordered”,whichinheritsfromtheclass“factor”.Thus:>fac<-ordered(1:3)>class(fac)[1]"ordered""factor"Herefachastheclass“ordered”,whichinheritsfromtheparentclass“factor”.Thefunctionprint.ordered(),whichisthefunctionthatiscalledwhenyouinvokeprint()withanorderedfactor,couldberewrittentousethefactthat“ordered”inheritsfrom“factor”,thus:>print.orderedfunction(x,quote=FALSE){if(length(x)<=0)cat("ordered(0) ")elseNextMethod(“print”)cat("Levels:",paste(levels(x),collapse="<")," ")invisible(x)}Thesystemversionofprint.ordered()doesnotuseprint.factor().Thefunctionprint.glm()doesnotcallprint.lm(),eventhoughglmobjectsinheritfromlmobjects.Themechanismisavaialbleforuseifrequired.12.2ExtractingArgumentstoFunctionsHow,insideafunction,canoneextractthevalueassignedtoaparameterwhenthefunctionwascalled?Belowthereisafunctionextract.arg().Whenitiscalledasextract.arg(a=xx),wewantittoreturn“xx”.Whenitiscalledasextract.arg(a=xy),wewantittoreturn“xy”.Hereishowitisdone.extract.arg<-function(a){s<-substitute(a)as.character(s)}>extract.arg(a=xy)[1]“xy”Iftheargumentisafunction,wemaywanttogetattheargumentstothefunction.Hereishowonecandoitdeparse.args<-function(a){s<-substitute(a)92 if(mode(s)=="call"){#thefirstelementofa'call'isthefunctioncalled#sowedon'tdeparsethat,justthearguments.print(paste(“Thefunctionis:“,s[1],”()”,collapse=””))lapply(s[-1],function(x)paste(deparse(x),collapse=" "))}elsestop("argumentisnotafunctioncall")}Forexample:>deparse.args(list(x+y,foo(bar)))[1]"Thefunctionis:list()"[[1]][1]"x+y"[[2]][1]"foo(bar)"12.3ParsingandEvaluationofExpressionsWhenRencountersanexpressionsuchasmean(x+y)orcbind(x,y),therearetwosteps:1.Thetextstringisparsedandturnedintoanexpression,i.e.thesyntaxischeckedanditisturnedintocodethattheRcomputingenginecanmoreimmediatelyevaluate.2.Theexpressionisevaluated.Upontypinginexpression(mean(x+y))theoutputistheunevaluatedexpressionexpression(mean(x+y)).Settingmy.exp<-expression(mean(x+y))storesthisunevaluatedexpressioninmy.exp.Theactualcontentsofmy.exparealittledifferentfromwhatisprintedout.Rgivesyouasmuchinformationasitthinkshelpful.Notethatexpression(mean(x+y))isdifferentfromexpression(“mean(x+y)”),asisobviouswhentheexpressionisevaluated.Atextstringisatextstringisatextstring,unlessoneexplicitlychangesitintoanexpressionorpartofanexpression.Let’sseehowthisworksinpractice>x<-101:110>y<-21:30>my.exp<-expression(mean(x+y))>my.txt<-expression("mean(x+y)")>eval(my.exp)[1]131>eval(my.txt)[1]"mean(x+y)"Whatifwealreadyhave“mean(x+y)”storedinatextstring,andwanttoturnitintoanexpression?Theansweristousethefunctionparse(),butindicatethattheparameteristextratherthanafilename.Thus>parse(text="mean(x+y)")expression(mean(x+y))Westoretheexpressioninmy.exp2,andthenevaluateit93 >my.exp2<-parse(text="mean(x+y)")>eval(my.exp2)[1]131Hereisafunctionthatcreatesanewdataframefromanarbitrarysetofcolumnsofanexistingdataframe.Onceinthefunction,weattachthedataframesothatwecanleaveoffthenameofthedataframe,anduseonlythecolumnnamesmake.new.df<-function(old.df=austpop,colnames=c("NSW","ACT")){attach(old.df)on.exit(detach(old.df))argtxt<-paste(colnames,collapse=",")exprtxt<-paste("data.frame(",argtxt,")",sep="")expr<-parse(text=exprtxt)df<-eval(expr)names(df)<-colnamesdf}Toverifythatthefunctiondoeswhatitshould,typein>make.new.df()NSWACT119043....Thefunctiondo.call()makesitpossibletosupplythefunctionnameandtheargumentlistinseparatetextstrings.Whendo.callisuseditisonlynecessarytouseparse()ingeneratingtheargumentlist.Forexamplemake.new.df<-function(old.df=austpop,colnames=c("NSW","ACT")){attach(old.df)on.exit(detach(old.df))argtxt<-paste(colnames,collapse=",")listexpr<-parse(text=paste("list(",argtxt,")",sep=""))df<-do.call(“data.frame”,eval(listexpr))names(df)<-colnamesdf}12.4PlottingamathematicalexpressionThefollowing,givenwithoutexplanation,illustratessomeofthepossibilities.Itneedsbettererrorcheckingthanithasatpresent:plotcurve<-function(equation="y=sqrt(1/(1+x^2))",...){leftright<-strsplit(equation,split="=")[[1]]left<-leftright[1]#Theparttotheleftofthe"="right<-leftright[2]#Theparttotherightofthe"="expr<-parse(text=right)xname<-all.vars(expr)94 if(length(xname)>1)stop(paste("Therearemultiplevariables,i.e.",paste(xname,collapse="&"),"ontherightoftheequation"))if(length(list(...))==0)assign(xname,1:10)else{nam<-names(list(...))if(nam!=xname)stop("Clashofvariablenames")assign("x",list(...)[[1]])assign(xname,x)}y<-eval(expr)yexpr<-parse(text=left)[[1]]xexpr<-parse(text=xname)[[1]]plot(x,y,ylab=yexpr,xlab=xexpr,type="n")lines(spline(x,y))mainexpr<-parse(text=paste(left,"==",right))title(main=mainexpr)}Tryplotcurve()plotcurve("ang=asin(sqrt(p))",p=(1:49)/50)12.4SearchingRfunctionsforaspecifiedtoken.Atokenisasyntacticentity;forexamplefunctionnamesaretokens.Forexample,wesearchallfunctionsintheworkingdirectory.Thepurposeofusingunlist()inthecodebelowistochangemyfuncfromalistintoavectorofcharacterstrings.mygrep<-function(str){##AssignthenamesofallobjectsincurrentR##workingdirectorytothestringvectortempobj##tempobj<-ls(envir=sys.frame(-1))objstring<-character(0)for(iintempobj){myfunc<-get(i)if(is.function(myfunc))if(length(grep(str,deparse(myfunc))))objstring<-c(objstring,i)}return(objstring)}mygrep(“for”)#Findallfunctionsthatincludeaforloop95 13.RResources13.1RPackagesforWindowsTogetinformationonRpackages,goto:http://cran.r-project.orgTheAustralianlink(accessibleonlytousersinAustralia)is:http://mirror.aarnet.edu.au/pub/CRAN/ForWindows95etcbinaries,lookinhttp://mirror.aarnet.edu.au/pub/CRAN/windows/windows-9x/Lookinthedirectorycontribforpackages.Newpackagesarebeingaddedallthetime.SoitpaystochecktheCRANsitefromtimetotime.Also,watchforannouncementsontheelectronicmailinglistsr-helpandr-announce.13.2LiteraturewrittenbyexpertusersMuchliteraturethathasbeenwrittenforS-PLUSishighlyrelevantforR.Alzola,C.andHarrell,F.1997.AnIntroductiontoSandtheHmiscandDesignPackages.[AvailablefromCRANsites.Theexamplesarelargelymedical.]Burns,P.J.AGuidefortheUnwillingSUser.[AvailablefromCRANsites][Thestyleisleisurely.Howeverthisassumessomepriorknowledgeofcomputinglanguageterms.ItmaysuituserswithsomeinitialknowledgeofR.]Chambers,J.M.1998.ProgrammingwithData.AGuidetotheSLanguage.Springer-Verlag,NewYork.[Thisisabookforspecialists.]Chambers,J.M.andHastie,T.J.1992.StatisticalModelsinS.WadsworthandBrooksColeAdvancedBooksandSoftware,PacificGroveCA.[ThisisthebasicreferenceonRandS-PLUSmodelformulaeandmodels.]Dalgaard,P.2002.IntroductoryStatisticswithR.Springer,NewYork.[ThisisanR-basedintroductorytext,withabiostatisticalemphasis.]Fox,J.2002.AnRandS-PLUSCompaniontoAppliedRegression.SageBooks.MaindonaldJHandBraunWJ2003.DataAnalysisandGraphicsUsingR–AnExample-BasedApproach.CambridgeUniversityPress.[Thisisanintermediateleveltext.]Krause,A.andOlsen,M.1997.TheBasicsofSandS-PLUS.Springer1997.[Thisisanintroductorybook,ataboutthesamelevelasSpector.]Spector,P.1994.AnIntroductiontoSandS-PLUS.DuxburyPress.[Thisisareadableandcompactbeginner’sguidetotheSlanguage.]Venables,W.N.,Smith,D.M.andtheRDevelopmentCoreTeam.AnIntroductiontoR.NotesonR:AProgrammingEnvironmentforDataAnalysisandGraphics.[AcurrentversionisavailablefromCRANsites.ThisisderivedfromanoriginalsetofnoteswrittenbyBillVenablesandDaveSmithfortheSandS-PLUSenvironments.Venables,W.N.andRipley,B.D.,4thedn2002.ModernAppliedStatisticswithS.Springer,NY.[ThishasbecomeatextbookfortheuseofS-PLUSandRforappliedstatisticalanalysis.Itassumesafairlevelofstatisticalsophistication.Explanationiscareful,butoftenterse.Togetherwiththe‘Complements’itgivesbriefintroductionstoextensivepackagesoffunctionsthathavebeenwrittenoradaptedbyRipley,Venables,andanumberofotherstatisticians.Supplementarymaterial(`Complements’)isavailablefromhttp://www.stats.ox.ac.uk/pub/MASS4/.]Venables,W.N.andRipley,B.D.2000.SProgramming.Springer2000.ThisisaterseandcarefulintroductiontothedialectsoftheSlanguage,includingR.RDevelopmentCoreTeam.AnIntroductiontoR.[AvailablefromCRANsites]AvarietyoffurtherdocumentsareavailablefromCRANsites.96 14.Appendix114.1DataSetsReferredtointheseNotesDatasetsincludedintheimagefileusingR.RDataCars93.summaryaisanestheticaustpopdewpointdolphinselasticbandfloridahillshuronislandcitieskiwishadeleafshapemilkmothsoddbooksoringspossumprimatesrainforestseedratestintingDataSetsfromPackagedatasetsairqualityattitudecarsIslandsLakeHuronDataSetsfromPackagelatticebarleyDataSetsfromPackageMASSAids2AnimalsCars93PlantGrowthRubbercementcpusfglmichelsonmtcarspainterspressureships14.2AnswerstoSelectedExercisesSection1.61.plot(distance~stretch,data=elasticband)2.(ii),(iii),(iv)plot(snow.cover~year,data=snow)hist(snow$snow.cover)hist(log(snow$snow.cover))Section2.71.Thevalueofansweris(a)12,(b)22,(c)600.2.prod(c(10,3:5))3(i)bigsum<-0;for(iin1:100){bigsum<-bigsum+i};bigsum3(ii)sum(1:100)4(i)bigprod<-1;for(iin1:50){bigprod<-bigprod*i};bigprod4(ii)prod(1:50)5.radius<-3:20;volume<-4*pi*radius^3/3sphere.data<-data.frame(radius=radius,volume=volume)6.sapply(tinting,is.factor)sapply(tinting[,4:6],levels)97 sapply(tinting[,4:6],is.ordered)Section3.71.plot(Animals$body,Animals$brain,pch=1,xlab="Bodyweight(kg)",ylab="Brainweight(g)")2.plot(log(Animals$body),log(Animals$brain),pch=1,xlab="Bodyweight(kg)",ylab="Brainweight(g)",axes=F)brainaxis<-10^seq(-1,4)bodyaxis<-10^seq(-2,4)axis(1,at=log(bodyaxis),lab=bodyaxis)axis(2,at=log(brainaxis),lab=brainaxis)box()identify(log(Animals$body),log(Animals$brain),labels=row.names(Animals))(Seeproblem4.)3.par(mfrow=c(1,2)),etc.Section7.91.x<-seq(101,112)orx<-101:1122.rep(c(4,6,3),4)3.c(rep(4,8),rep(6,7),rep(3,9))orrep(c(4,6,3),c(8,7,9))4.rep(seq(1,9),seq(1,9))orrep(1:9,1:9)5.Usesummary(airquality)togetthisinformation.6(a)2775121246(b)2986171577.airquality[airquality$Ozone==max(airquality$Ozone),]airquality$Wind[airquality$Ozone>quantile(airquality$Ozone,.75)]8.mean(snow$snow.cover[seq(2,10,2)])mean(snow$snow.cover[seq(1,9,2)])9.summary(airquality);summary(attitude);summary(cpus)Commentonrangesofvalues,whetherdistributionsseemskew,etc.10.mtcars6<-mtcars[mtcars$cyl==6,]11.Cars93[Cars93$Type==”Small”|Cars93$Type==”Sporty”,]12.mat34<-matrix(rep(c(4,6,3),4),nrow=3,ncol=4)13.mat64<-matrix(c(rep(4,8),rep(6,7),rep(3,9)),nrow=6,ncol=4)Additionalsolutionsmaybeincludedinlaterversionsofthisdocument.98

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
大家都在看
近期热门
关闭