Analysis of Phylogenetics And Evolution With R.pdf

Analysis of Phylogenetics And Evolution With R.pdf

ID:33748562

大小:1.08 MB

页数:221页

时间:2019-02-28

上传者:不努力梦想只是梦
Analysis of Phylogenetics And Evolution With R.pdf_第1页
Analysis of Phylogenetics And Evolution With R.pdf_第2页
Analysis of Phylogenetics And Evolution With R.pdf_第3页
Analysis of Phylogenetics And Evolution With R.pdf_第4页
Analysis of Phylogenetics And Evolution With R.pdf_第5页
资源描述:

《Analysis of Phylogenetics And Evolution With R.pdf》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库

UseR!SeriesEditors:RobertGentlemanKurtHornikGiovanniParmigiani UseR!Paradis:AnalysisofPhylogeneticsandEvolutionwithRPfaff:AnalysisofIntegratedandCointegratedTimeSerieswithR EmmanuelParadisAnalysisofPhylogeneticsandEvolutionwithR EmmanuelParadisInstitutdeRecherchepourleDéveloppementUR175CaviarGAMET-BP5095361rueJeanFrançoisBretonF-34196Montpellierc´edex5FranceEmmanuel.Paradis@mpl.ird.frSeriesEditors:RobertGentlemanKurtHornikPrograminComputationalBiologyDepartmentfürStatistikundMathematikDivisionofPublicHealthSciencesWirtschaftsuniversitätWienAugasse2-6FredHutchinsonCancerResearchCenterA-1090Wien1100FairviewAve.N,M2-B876AustriaSeattle,Washington,981029-1024USAGiovanniParmigianiTheSidneyKimmelComprehensiveCancerCenteratJohnsHopkinsUniversity550NorthBroadwayBaltimore,MD,21205-2011USALibraryofCongressControlNumber:2006923823ISBN-0-387-32914-5ISBN-978-0387-32914-7Printedonacid-freepaper.©2006SpringerScience+BusinessMedia,LLCAllrightsreserved.Thisworkmaynotbetranslatedorcopiedinwholeorinpartwithoutthewrittenpermissionofthepublisher(SpringerScience+BusinessMedia,LLC,233SpringStreet,NewYork,NY10013,USA),exceptforbriefexcerptsinconnectionwithreviewsorscholarlyanalysis.Useinconnectionwithanyformofinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynowknownorhereafterdevelopedisforbidden.Theuseinthispublicationoftradenames,trademarks,servicemarks,andsimilarterms,eveniftheyarenotidentifiedassuch,isnottobetakenasanexpressionofopinionastowhetherornottheyaresubjecttoproprietaryrights.PrintedintheUnitedStatesofAmerica.(MVY)987654321springer.com toLaure PrefaceAsaresult,theinferenceofphylogeniesoftenseemsdivorcedfromanyconnectiontoothermethodsofanalysisofscientificdata.FelsensteinOncecalculationbecameeasy,thestatistician’senergiescouldbede-votedtounderstandinghisorherdataset.Venables&RipleyThestudyoftheevolutionoflifeonEarthstandsasoneofthemostcomplexfieldsinscience.Itinvolvesobservationsfromverydifferentsources,andhasimplicationsfarbeyondthedomainofbasicscience.Itisconcernedwithprocessesoccurringonverylongtimespans,andwenowknowthatitisalsoimportantforourdailylivesasshownbytherapidevolutionofmanypathogens.Asafieldecologist,foralongtimeIwasremotelyinterestedinphyloge-neticsandotherapproachestoevolution.MostoftheworkIaccomplishedduringmydoctoralstudiesinvolvedfieldstudiesofsmallmammalsandesti-mationofdemographicparameters.Thingschangedin1996whenmyinterestwasattractedbythequestionoftheeffectofdemographicparametersonbirddiversification.Thiswasanewissueforme,soIsearchedforrelevantdataanalysismethods,butIfailedtofindexactlywhatIneeded.Istartedtoconductmyownresearchonthisproblemtoproposesome,atleastpartial,solutions.Thisworkmademerealizethatthiskindofresearchcriticallyde-pendsontheavailablesoftware,anditwascleartomethatwhatwasofferedtophylogeneticistsatthistimewasinappropriate.IfirstreadaboutRin1998whileIwasworkinginEngland:Ifirsttrieditonmycomputerinearly1999afterIgotapositioninFrance.IquicklythoughtthatRseemedtobethecomputingsystemthatisneededfordevel-opingphylogeneticmethods:versatile,flexible,powerful,withgreatgraphicalpossibilities,andfree. viiiPrefaceWhenIfirstpresentedtheideatodevelopprogramswritteninRforphy-logeneticanalysesin2001,thereactionsfrommycolleaguesweremixedwithenthusiasmandscepticism.Theperspectiveofcreatingasingleenvironmentforphylogeneticanalysiswasclearlyexciting,butsomeconcernswereex-pressedaboutthecomputingperformanceofRwhich,itwasargued,couldnotmatchthoseoftraditionalphylogeneticprograms.Anothercriticismwasthatbiologistswouldbediscouragedfromusingaprogramwithacommand-lineinterface.ThefirstversionoftheRpackageapewaseventuallyreleasedinAugust2002.Thereactionsfromsomecolleaguesshowedmethatrelatedprojectswereundertakenelsewhere.TheprogressaccomplishedhasbeenmuchmorethanIexpected,andtheperspectivesarefarreaching.WritingabookonphylogeneticswithRisanopportunitytobringtogetherpiecesofinformationfromvarioussources,programs,andpackages,aswellasdiscussingafewideas.Irealizethatthescopeofthebookislarge,andthetreatmentmayseemsuperficialinsomeplaces,butitwasimportanttotreatthepresenttopicsinaconcisemanner.ItwasnotpossibletoexploreallthepotentialitiesnowofferedbyRanditspackageswrittenforphylogeneticanalysis.Similarly,Itriedtoexplaintheunderlyingconceptsofthemethods,sometimesillustratedwithRcodes,butImeanttokeepitshortaswell.Imustfirstthankthe“Rcommunity”ofdevelopersandusersfromwhomIlearnedmuchaboutRthroughnumerousexchangesontheInternet:thisdef-initelyhelpedmetofindmywayandenvisionthedevelopmentofape.JulienClaudehassharedtheventureofdevelopingprogramsinRandcontribut-ingtoapesincehewasadoctoralstudent.Agreatthankyoutothosewhocontributedsomecodestoape:KorbinianStrimmer,GangolfJobb,RainerOpgen-Rhein,JulienDutheil,YvonnickNo¨el,andBenBolker.Imustempha-sizethatalltheseauthorsshouldhavefullcreditfortheircontributions.IamgratefultoOlivierFran¸coisandMichaelBlumforshowingmethepossibilitiesoftheirpackageapTreeshape.Severalcolleagueskindlyreadsomepartsofthemanuscript:Loun`esChikki,JulienClaude,JeanLobry,Jean-Fran¸coisRenno,ChristopheTh´ebaud,FabienneThomarat,andseveralcolleagueswhochosetoremainanonymous.Thankstoallofthem!SpecialthankstoSusanHolmesforencouragementandsomecriticalcomments.ThankyoutoElizabethPurdomandJulienDutheilfordiscussionsaboutapeandRprogramming.IamsincerelythankfultoJohnKimmelatSpringerfortheopportunitytowritethisbook,andformanagingallpracticalaspectsofthisproject.Finally,manythankstoDianeSahadeoforhandlingmymanuscripttomakeitanactualbook.JakartaEmmanuelParadisApril2006 Contents1Introduction...............................................11.1StrategicConsiderations..................................11.2Notations..............................................41.3PreparingtheComputer..................................51.3.1Installations......................................51.3.2Configurations....................................72FirstStepsinRforPhylogeneticists.......................92.1TheCommandLineInterface.............................92.2TheDataStructures.....................................112.2.1Vector...........................................112.2.2Factor...........................................142.2.3Matrix...........................................152.2.4DataFrame......................................162.2.5List..............................................172.3TheHelpSystem........................................182.4CreatingGraphics.......................................192.5SavingandRestoringRData.............................202.6UsingRFunctions.......................................202.7RepeatingCommands....................................212.7.1Loops............................................212.7.2Apply-LikeFunctions..............................222.8Exercises...............................................233PhylogeneticDatainR....................................253.1PhylogeneticDataasRObjects...........................253.1.1TheClass"phylo"(ape)...........................263.1.2TheClass"phylog"(ade4).........................273.1.3TheClass"matching"(ape)........................273.1.4TheClass"treeshape"(apTreeshape)...............283.2ReadingPhylogeneticData...............................28 xContents3.2.1Phylogenies.......................................283.2.2ReadingInternetTreeDatabases....................303.2.3MolecularSequences...............................303.3WritingData...........................................333.4ManipulatingData......................................353.4.1BasicTreeManipulation...........................353.4.2RootedVersusUnrootedTrees......................363.4.3DichotomousVersusMultichotomousTrees...........373.4.4SummarizingandComparingTrees..................383.4.5ConvertingObjects................................393.4.6ManipulatingDNAData...........................403.5GeneratingRandomTrees................................443.6CaseStudies............................................463.6.1SylviaWarblers...................................463.6.2PhylogenyoftheFelidae...........................503.6.3SnakeVenomProteome............................523.6.4MammalianMitochondrialGenomes.................553.6.5ButterflyDNABarcodes...........................623.7Exercises...............................................644PlottingPhylogenies.......................................654.1SimpleTreeDrawing.....................................654.1.1AnnotatingTrees..................................714.1.2ShowingClades...................................804.2CombiningPlots........................................834.3LargePhylogenies.......................................894.4Perspectives............................................924.5Exercises...............................................945PhylogenyEstimation.....................................955.1DistanceMethods.......................................965.1.1CalculatingDistances..............................965.1.2SimpleClusteringandUPGMA.....................995.1.3Neighbor-Joining..................................1005.2MaximumLikelihoodMethods............................1005.2.1SubstitutionModels:APrimer......................1015.2.2EstimationwithMolecularSequences................1065.2.3FindingtheMaximumLikelihoodTree...............1105.2.4DNAMiningwithPHYML.........................1115.3BootstrapMethodsandDistancesBetweenTrees............1125.3.1ResamplingPhylogeneticData......................1135.3.2BipartitionsandComputingBootstrapValues........1155.3.3DistancesBetweenTrees...........................1185.3.4ConsensusTrees...................................1185.4MolecularDating........................................119 Contentsxi5.5CaseStudies............................................1215.5.1SylviaWarblers...................................1215.5.2PhylogenyoftheFelidae...........................1255.5.3ButterflyDNABarcodes...........................1295.6Perspectives............................................1315.7Exercises...............................................1316AnalysisofMacroevolutionwithPhylogenies..............1336.1PhylogeneticComparativeMethods........................1336.1.1PhylogeneticallyIndependentContrasts..............1356.1.2PhylogeneticAutoregression........................1386.1.3AutocorrelativeModels............................1396.1.4MultivariateDecomposition.........................1426.1.5GeneralizedLeastSquares..........................1446.1.6GeneralizedEstimatingEquations...................1476.1.7MixedModelsandVariancePartitioning.............1496.1.8TheOrnstein–UhlenbeckModel.....................1516.1.9Perspectives......................................1536.2EstimatingAncestralCharacters..........................1546.2.1ContinuousCharacters.............................1556.2.2DiscreteCharacters................................1566.3AnalysisofDiversification................................1606.3.1GraphicalMethods................................1616.3.2Birth–DeathModels...............................1636.3.3SurvivalModels...................................1676.3.4Goodness-of-FitTests..............................1696.3.5TreeShapeandIndicesofDiversification.............1706.4Perspectives............................................1726.5CaseStudies............................................1736.5.1SylviaWarblers...................................1736.5.2PhylogenyoftheFelidae...........................1766.6Exercises...............................................1807DevelopingandImplementingPhylogeneticMethodsinR.1837.1FeaturesofR...........................................1837.1.1Object-Orientation................................1837.1.2VariableDefinitionandScope.......................1857.1.3HowRWorks.....................................1867.2WritingFunctionsinR...................................1877.3InterfacingRwithOtherLanguages.......................1897.3.1SimpleInterfaces..................................1897.3.2ComplexInterfaces................................1907.4WritingRPackages......................................1927.4.1AMinimalistPackage..............................1927.4.2TheDocumentationSystem........................193 xiiContents7.5PerformanceIssuesandStrategies.........................193References.....................................................199Index..........................................................209 1IntroductionPhylogeneticsisthescienceoftheevolutionaryrelationshipsamongspecies.Recently,thetermhascometoincludebroaderissuessuchasestimatingratesofevolution,datingdivergenceamongspecies,reconstructingancestralchar-acters,orquantifyingadaptation,alltheseusingphylogeniesasframeworks.Computersseemtohavebeenusedbyphylogeneticistsassoontheywereavailableinresearchdepartments[28].Sincethen,progresshasbeenobviousintwoparalleldirections:biologicaldatabases,particularlyformolecularse-quences,haveincreasedinquantityatanexponentialrateand,atthesametime,computingpowerhasgrownatanexpandingpace.Theseconcurrentescalationshaveresultedinthechallengeofanalyzinglargerandlargerdatasetsusingmoreandmorecomplexmethods.Thecurrentcomplexityofphylogeneticanalysesimpliessomestrategicchoices.ThischapterexplainstheadvantagesofRasasystemforphylogeneticanalyses.1.1StrategicConsiderationsHowdataarestored,handled,andanalyzedwithcomputersisacriticalissue.Thisisastrategicchoiceasthisconditionswhatcansubsequentlybedonewithmoreorlessefficiency.Risalanguageandenvironmentforstatisticalandgraphicalanalyses[74].Itisflexible,powerful,andcanbeinterfacedwithseveralsystemsandlanguages.Rhasmanyattractivefeatures:weconcentrateonfourofthemthatarecriticalforphylogeneticanalyses.IntegrationPhylogeneticscoversawideareaofrelatedissues.Analyzingphylogeneticdataoftenimpliesdoingdifferentanalysessuchastreeestimation,datingdi-vergencetimes,andestimatingspeciationrates.Theimplementationofthese 21IntroductionmethodsinRenhancestheirintegrationunderasingleuserinterface.ItshouldbepointedoutthatalthoughthedevelopmentofphylogeneticmethodsinRisrelativelyrecent,aremarkablerangeofmethodsisalreadyavailable.Integrationisnotnewamongphylogeneticanalysisprogramsandthemostwidelyusedonescoverawiderangeofmethods.However,thisfeaturecom-binedwiththosedetailedbelow,hasaparticularimportancenotobservedintheseprograms.Alessobviousaspectofintegrationisthepossibilityofusingdifferentlan-guagesandsystemsfromthesameuserinterface.ThisiscalledintersystemsinterfacesandhasbeenparticularlydevelopedinR[49].ThemostcommonlyusedinterfacesinRarewithprogramswritteninC,C++,orFortran,butthereexistinterfaceswithPerl,Python,andJava.1Thegainfromtheseinterfacesisenormous:developerscanusethelanguagesorsystemstheypre-fertoimplementtheirnewmethods,andusersdonothavetolearnanewinterfacetoaccessthelastmethodologicaldevelopments.InteractivityInteractivityiscriticalintheanalysisoflargedatasetswithagreatvarietyofmethods.Exploratoryanalysesarecrucialforassessingdataheterogeneity.Selectionofanappropriatemodelforestimationoftenneedstointeractivelyfitseveralmodels.Examinationofmodeloutputisalsooftenveryuseful(e.g.,plotofregressiondiagnostics).Inphylogeneticanalyses,theusualcomputerprogramstrategyfollowsa“blackbox”modelwheresomedata,storedinfiles,arereadbyaspecificprogram,somecomputationsaremade,andtheresultsarewrittenintoafileonthedisk.Whathappensintheprogramcannotbeaccessedbytheuser.Severalprogramexecutionscanbecombinedusingascriptinglanguage,butsuchprogrammingtasksaregenerallylimited.Rdoesnotfollowthismodel.InR,thedataarereadfromfilesandstoredinactivememory:theycanbemanipulated,plotted,analyzed,orwrittenintofiles.Theresultsofanalysesaretreatedexactlyinthesamewayasdata.InR’sjargon,thedatainmemoryarecalledobjects.Consideringdataasobjectsmakesgoodsenseinphylogeneticsbecausethisallowsustomanipulatediffer-entkindsofdata(trees,phenotypicaldata,geographicaldata)simultaneouslyandinteractively.ProgrammabilityDataanalysesarealmostalwaysmadeofaseriesofmoreorlesssimpletasks.Theseanalysesneedtoberepeatedformanyreasons.Themostusualsitu-ationisthatnewdatahavebeencollectedandpreviousanalysesneedtobeupdated.Itisthusveryusefultoautomatesuchanalyses,particularlyiftheyarecomposedofalongseriesofsmalleranalyses.1http://www.omegahat.org/. 1.1StrategicConsiderations3Risaflexibleandpowerfullanguagethatcanbeusedforsimpletasksaswellascombiningaseriesofanalyses.TheprogrammabilityofRcanbeusedatamorecomplexleveltodevelopnewmethods(Chapter7).Risaninterpretedlanguagemeaningthatthereisnoneedtodevelopafullprogramtoperformananalysis.Asimplecommandmayneedasingleline.Programmabilityisimportantinthecontextofscientificrepeatability.Writingprogramsthatperformdataanalyses(oftencalledscripts)ensuresbetterreadability,andimprovesrepeatabilitybyothers[49].Inthiscontext,thereexistsomesophisticatedsolutions,suchasSweave(inthepackageutils)whichmixesdataanalysiscommandswithRandtextprocessingwithLATEX[91](seealso?SweaveinR).EvolvabilityPhylogeneticmethodshaveconsiderablyevolvedforseveraldecades,andthisislikelytogooninthefuture.Anefficientdataanalysissystemneedstoevolvewithrespecttothenewmethodologicaldevelopments.ProgramswritteninRareeasytomaintainbecauseprogramminginthislanguageisverysimple.Bugsaremucheasiertobefoundandfixedthaninacompiledlanguageinasmuchasthereisnoneedtomanagememoryallocation(oneofthemaintime-consumingtasksofprogammers).R’ssyntaxandfunctiondefinitionsensurecompatibilitythroughtimeinmostcases.Forinstance,considerafunctioncalledfoowhichhasasingleargumentx.Thustheuserwillcallthisfunctionwithsomethingsuchas:foo(x=mydata)If,foranyreason,foochangestoincludeotheroptionsthathavedefaultvalues,sayy=TRUEandz=FALSE,thentheabovecommandwillstillworkwiththenewversionoffoo.Inaddition,theinternalstructureandfunctionalitiesofRevolvewithrespecttotechnologicaldevelopments.ThususingRasacomputingenviron-menteasestrackingnoveltiesinthisarea.Rhasotherstrengthsasacomputingenvironment.Itisscalable:itcanrunonavarietyofcomputerswitharangeofhardware,andcanbeadaptedfortheanalysisoflargedatasets.Onthelowerbound,Rcanrunwithasfewas16MbofRAM,2whereasontheupperboundRcanbecompiledandrunon64-bitcomputersandthususemorethan4GbofRAM.Furthermore,therearepackagestorunRonmultiprocessorcomputers.Rhasverygoodcomputingperformance:mostofitsoperationsarevec-torized,meaningthataslittletimeaspossibleisspentontheevaluationofcommands.ThegraphicalenvironmentofRisflexibleandpowerfulgivingmanypossibilitiesforgraphicalanalyses(Chapter4).Risanenvironmentsuitablebothforbasicusers(e.g.,biologists)andfordevelopers.Thisconsiderablyenhancesthetransferofnewmethodological2Forinstance,RcanberununderLinuxwith32MbofRAM. 41Introductiondevelopments.Rcanrunonmostcurrentoperatingsystems:allcommandsarefullycompatibleacrosssystems(theyareportableincomputersjargon).Finally,RisdistributedunderthetermsoftheGNUGeneralPublicLi-cense,meaningthatitsdistributionisfree,itcanbefreelymodified,anditcanberedistributedundercertainconditions.3Therehavebeennumerousdiscussions,particularlyontheInternet,abouttheadvantagesandinconve-niencesoffreesoftware.ThecrucialpointsarenotthatRisfreetodownloadandinstall(thisistrueformuchindustrialsoftware),butthatitcanbemod-ifiedbytheuser,anditsdevelopmentisopentocontributions.4Althoughitishardtoassess,itisreasonabletoassumethatsuchanopenmodelofsoftwaredevelopmentismoreefficient—butnotalwaysmoreattractivetoallusers—thanaproprietarymodel(see[49]forsomeviewsonthisissue).Allcomputerprogramspresentedinthisbookarefreelydistributed.1.2NotationsCommandstypedinRareprintedwithafixed-spacedfont,usuallyonseparatelines.ThesamefontisusedforthenamesofobjectsinR(functions,data,options).Namesofpackagesareprintedwithasans-seriffont.Whennecessary,acommandisprecededbythesymbol>,whichistheusualpromptinR,todistinguishwhatistypedbytheuserfromwhatisprinted(orreturned)byR.Forinstance:>x<-1>x[1]1IntheRlanguage,#specifiesacomment:everythingafterthischaracterisignoreduntilthenextline.Thisissometimesusedintheprintedcommands:mean(x)#getthemeanofxWhenanoutputfromRistoolong,itiscutafter“....”.Forinstance,ifwelookatthecontentofthefunctionplot:>plotfunction(x,y,...){if(is.null(attr(x,"class"))&&is.function(x)){....Namesoffilesarewithin‘singlequotes’.Theircontentsareindicatedwithinaframe:3Seethefile“RHOME/COPYING”fordetails.4Forobviouspracticalreasons,alimitednumberofpersons,namely,themembersoftheRCoreTeam,canmodifytheoriginalsources. 1.3PreparingtheComputer5xy13.526.91.3PreparingtheComputerRisamodularsystem:abaseinstallationiscomposedofafewdozenpackagesforreading/writingdata,classicaldataanalysesmethods,andcomputationalstatisticalutilities.Severalhundredcontributedpackagesaddmanyspecial-izedmethods.5NotethatinR’sterminology,apackageisasetoffilesthatperformsomespecifictaskswithinR,andthatincludetherelateddocumen-tationandanyneededfiles.AnRpackagerequiresRtorun.Rcanbeinstalledonawiderangeofoperatingsystems:sourcesandpre-compiledversions,aswellastheinstallationinstructions,canbefoundattheComprehensiveRArchiveNetwork(CRAN):http://cran.r-project.org/1.3.1InstallationsPhylogeneticanalysesinRuse,ofcourse,thedefaultRpackages,butalsoafewspecializedonesthatneedtobeinstalledbytheuser.Table1.1liststhepackagesthatarediscussedinthisbook.Table1.1.Rpackagesusedforphylogeneticanalyses.Thepackagesmarkedwith(d)areinstalledbydefaultwithR.The“Requires”columnindicatesthenondefaultRpackagesthatareneededNameTitleRequiresbase(d)Rbasepackage—stats(d)Rstatspackage—graphics(d)Rgraphicspackage—nlme(d)Mixed-effectsmodels—lattice(d)Latticegraphics—apeAnalysesofphylogeneticsandevolutiongeeapTreeshapeAnalysesofphylogenetictreeshapeapeade4Analysisofenvironmentaldata—seqinrExploratoryanalysesofmolecularsequences—5AcompletelistofR’spackageswithdescriptionscanbefoundathttp://cran.r-project.org/src/contrib/PACKAGES.html. 61IntroductionTheinstallationofRpackagesdependonthewayRwasinstalled,butusu-allythefollowingcommandinRwillworkprovidedthecomputerisconnectedtotheInternet:install.packages("ape")andthesameforallneededpackages.Oncethepackagesareinstalled,theyareavailableforuseafterbeingloadedinmemorywhichisusuallydonebytheuser:>library(ape)Loadingrequiredpackage:geeLoadingrequiredpackage:nlmeLoadingrequiredpackage:lattice>library(ade4)>library(seqinr)apeisdedicatedtophylogeneticandevolutionaryanalyses,thuswecon-centratealargepartofourattentiononthispackage.apTreeshapedealswiththeanalysisoftreeshapeandhasseveralfunctionstoquerytreedatabasesthroughtheInternet.ade4isdedicatedtotheanalysisofenvironmentaldata,butithasseveralfunctionalitiesthatcomplementape.seqinr(sequencesinR)isapackageforreadingandhandlingmolecularsequences(proteinandDNA).Ithassomefunctionsforgraphicalandexploratoryanalysesofthiskindofdata.MostRpackagesincludeafewdatasetstoillustratehowthefunctionscanbeused.Thesedataareloadedinmemorywiththefunctiondata.Weofcourseusetheminourexamples.Additionallytotheseadd-onpackages,itisusefultohavethecomputerconnectedtotheInternetbecausesomefunctionsconnecttoremotedatabases(e.g.,apeandseqinrcanreadDNAsequencesinGenBank).Otherprogramsmayberequiredinsomeapplications.PHYMLiscalledbyapewithitsfunctionphymltest;itisavailableathttp://atgc.lirmm.fr/phyml/AmultiplesequencealignmentprogramisalsoveryusefulbecausethisoperationisnotreallyfeasibleinR.ClustalX[153]iswidelyusedandavailableformostoperatingsystems.TherearealsoseveralinterfacestotheClustalcomputingengine,suchastheWeb-interfaceClustalWWW[17].ClustalXisavailableathttp://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/NotetheexistenceoftheRpackagednabyJimLindseywhichincludesaversionofClustalW(thecomputingengineofClustalX).Itisavailableathttp://popgen.unimaas.nl/˜jlindsey/rcode.html 1.3PreparingtheComputer7WeuseClustalXbecausesequencealignmentscanbegraphicallyvisualized.Additionallytotheserequiredprograms,afewothersareusefulwhenusingR.Emacsisaflexibletexteditorthatrunsundermostoperatingsystems.ItcanbeusedtoeditRprograms.InstallingtheESS(EmacsSpeaksStatistics)packageallowssyntaxhighlighting,andotherfacilitiessuchasrunningRwithinEmacs.EmacsandESScanbedownloadedat,respectivelyhttp://www.gnu.org/software/emacs/emacs.htmlhttp://ess.r-project.org/GhostscriptandGSviewaretwoprogramstoviewandconvertfilesinPostScriptandPDFformats:theycanbeusedtoviewthefiguresmadewithR(Chapter4).Theycanbedownloadedathttp://www.cs.wisc.edu/˜ghost/Finally,aWebbrowserisusefultoviewtheRhelppagesinHTMLformat.1.3.2ConfigurationsOnceallpackagesandsoftwareareinstalled,thecomputerisready.Thereisnospecialneedforthelocationofdatafiles:theyareaccessedintheusualwaybyR.WhenRisstarted,aworkingdirectoryisset.UnderUNIX-likesystems,thisisusuallythedirectorywhereRhasbeenlaunched.UnderWindows,thisisthedirectorywheretheexecutableislocated,orifRisstartedfromashort-cut,thedirectoryspecifiedinthe“Start-in”fieldofthisshort-cut.6Onallsystems,theworkingdirectoryisdisplayedinRwiththefunctiongetwd()(getworkingdirectory);itcanbemodifiedwithsetwd:>setwd("/home/paradis/phylo/data")#Linux>setwd("D:/data/phylo/data")#WindowsNotetheuseoftheforwardslashes(/),evenunderWindows,becausethebackslashes()haveaspecificmeaningincharacterstrings.Ifafileisnotintheworkingdirectory,itcanbeaccessedbyaddingthefullpathinthefileargument,forinstance,whenreadingatree(seeSection3.2):>tr<-read.tree("/home/paradis/phylo/data/treeb1.txt")Thesamecommentapplieswhenwritingintoafile:thefileiswritteninthecurrentworkingdirectoryunlessapathisgiveninthefileargumentexactlyinthesamewayasabove.EmacsandESSneedslightlymoreconfigurationiftheuserwantstorunRwithinEmacs.Thisisessentiallysystemdependent;thecriticalstepisto6Thiscanbemodifiedbytheuserbyeditingthepropertiesoftheshort-cut,usuallybyright-clickingonitsicon.AstandardinstallationunderWindowsputsashort-cutofRontheDesktop. 81IntroductiontellEmacswheretofindR’sexecutable.ESSisdistributedwithseveraldoc-umentationfilesdetailingtheinstallationandconfigurationforthedifferentoperatingsystems. 2FirstStepsinRforPhylogeneticistsItisclearthatsomeexperiencewithRgreatlyhelpsinhandlingthematerialspresentedinthisbook.ThegoalofthischapteristogivethefirststepsfornewusersofR.Itisfocusedonthetopicsrequiredforthepresentbook,anddoesnotcoverallintroductoryconceptsandnotionsaboutR.AgenerallydeterringfactfornewusersofRisthatitisalmostimpossibletofigureoutwhattodoiftheuserhasnonotionoflanguages,commands,orRitself.Alearningstepmustbetakenandthisobviouslyhasacost.ProgressingintheuseofRinvolvessuccessivelearningsteps.Ofcourse,therearebenefitstotakingthesesteps.Rhasspreadthroughthefieldofcomputationalstatistics,andthereisnowawiderangeofpackagesformanynumerical,analytical,andgraphicalmethods.ThefieldsofapplicationofRincludeanalysisofDNAmicroarraydata,1genetics(quantitativetraitloci,populationanalyses,etc.),morpho-metrics,ecologicalanalyses,drawingmapsincludingtheuseofgeographicinformationsystems(GIS)data,andinteractingwithavarietyofotherpro-gramssuchasSQLdatabases.ThuslearningRforaspecifictaskisverylikelytoberewardingveryrapidly.IfyoudonotknowR,donothaveknowledgeofcomputerlanguages,anddonotwanttoreadintroductorydocumentsonR2(orcannot),thenyoushouldread,certainlycarefully,thischapter.IfyoualreadyhaveanideaofcomputerprogrammingbutnotR,readingthischaptershouldbeeasyandwillpointtotheparticularitiesofR.2.1TheCommandLineInterfaceTheusercaninteractwithRinseveralways.Themostinteractivewayistousethecommandlineinterface(CLI).Rcanalsoberuninbatchmode1http://www.bioconductor.org/.2Seehttp://cran.r-project.org/manuals.htmlandhttp://cran.r-project.org/other-docs.html. 102FirstStepsinRforPhylogeneticists(i.e.,noninteractive)fromasystemshell.Thereareseveralgraphicaluserinterfaces(GUIs),buttheyarerestrictedtotraditionalstatisticalmethods(seetheRcmdrpackage),andsodonotcoverthewiderangeofmethodsavailableinR.Finally,thereexistseveralWebserverstorunRthroughtheInternet.Inthisbook,weconcentrateontheCLIbecauseitisinteractive,versatile,andportable(i.e.,thecommandswillrunonalloperatingsystems).Allactionsaredoneondatastoredintheactivememoryofthecomputer.Thesedataarestoredasobjects.Tocharacterizesomedata,andthusana-lyzethemrelevantly,itisoftennecessarytohaveadditionalinformation.Forinstance,consideranumericvariabletakingthevalues0or1:isitacount(i.e.,aquantitativevariable)oracodeforaqualitativevariable?InRtherequiredinformationisprovidedbytheattributesoftheobjects.Weshowsomeexamplesinthenextsection.CommandsinRaremadeoffunctionsand/oroperators(+,-,*,etc).Acommandreturnsanobjectthatiseitherdisplayedonthescreen(andnotstoredinmemory),orstoredinmemoryusingtheoperator“assign”<-.Thelatterrequiresgivinganametotheobject.Anobjectmaybedisplayedbytypingitsnameasacommand:>2+7[1]9>x<-2+7>x[1]9Rhasawiderangeoffunctionsandoperatorstocreateregularandrandomsequences.Therearealsoseveralfunctionstoreaddatafromfilesonthedisk:themostusefulforusareillustratedinSection3.6.Theuserdoesnotseeherdataasinaspreadsheeteditorbecausemanyobjectswithdifferentstructurecanbestoredandmanipulatedatthesametime,andthiscannotberepresentedasaspreadsheet.Thereare,ofcourse,severalfunctionstomanagetheobjectsinmemory.lsdisplaysasimplelistoftheobjectscurrentlyinmemory.>ls()character(0)>n<-5>ls()[1]"n">x<-"acgt">ls()[1]"n""x"Aswehaveseenabove,typingthenameofanobjectasacommanddis-playsitscontent.Inordertodisplaysomeattributesoftheobject,onecanusethefunctionstr(structure): 2.2TheDataStructures11>str(n)num5>str(x)chr"acgt"Thisshowsthatnisanumericobject,andxisacharacterone.Bothlsandstrcanbecombinedbyusingthefunctionls.str:>ls.str()n:num5x:chr"acgt"Todeleteanobjectinmemory,thefunctionrmmustbeused:>ls()[1]"n""x">rm(n)>ls()[1]"x"ThereareonefunctionandoneoperatorthataregoodtolearnveryearlybecausetheyareusedveryofteninR:cconcatenatesseveralelementstoproduceasingleone,and:returnsaregularserieswheretwosuccessiveelementsdifferbyone.Herearesomeexamples:>x<-c(2,6.6,9.6)>x[1]2.06.69.6>y<-2.2:5.2>y[1]2.23.24.25.2>c(x,y)[1]2.06.69.62.23.24.25.2>1:10[1]123456789102.2TheDataStructuresWeshowherehowdataarestoredinR,andhowtomanipulatethem.2.2.1VectorVectorsarethebasicdatastructuresinR.Avectorisaseriesofelementsthatareallofthesametype.Avectorhastwoattributes:themode,whichcharacterizesthetypeofdata,andthelength,whichisthenumberofele-ments.Therearefourmodes:numeric,logical(TRUEorFALSE),character,andcomplex.Thelastmodeisseldomusedandisnotdiscussedhere. 122FirstStepsinRforPhylogeneticistsWhenavectoriscreatedormodified,thereisnoneedtospecifyitsmodeandlength:thisisdealtwithbyR.Itispossibletochecktheseattributeswiththefunctionsofthesamenames:>x<-1:5>mode(x)[1]"numeric">length(x)[1]5Logicalvectorsarecreatedbytyping“FALSE”or“TRUE”:>y<-c(FALSE,TRUE)>y[1]FALSETRUE>mode(y)[1]"logical">length(y)[1]2Inmostcases,alogicalvectorresultsfromalogicaloperation,suchasthecomparisonoftwovaluesortwoobjects:>1>0[1]TRUEAvectorofmodecharacterisaseriesofcharacterstrings(andnotofsinglecharacters):>z<-c("order","family","genus","species")>mode(z)[1]"character">length(z)[1]4>z[1]"order""family""genus""species"WehavejustseenhowtocreatevectorsbytypingthemontheCLI,butitisclearthatinthevastmajorityofcasestheywillbecreatedbyreadingdatafromfiles.Asalreadymentioned,afunctionreturnsanobjectthatisitselfcharac-terizedbyitsmode.Fromtheexamplesjustabove,itcanbeseenthatmodereturnsavectorofmodecharacter,whereaslengthreturnsoneofmodenu-meric.Thesameappliestothefunctionsintroducedabove,andinparticularlswhichreturnsavectorofmodecharacter.Rhasapowerfulandflexiblemechanismtomanipulatevectors(andotherobjectsaswell):theindexingsystem.Therearethreekindsofindexing:nu-meric,logical,andwithnames. 2.2TheDataStructures13Thenumericindexingworksbygivingtheindicesoftheelementsthatmustbeselected.Ofcourse,thiscanbegivenasanumericvector:>z[1:2][1]"order""family">i<-c(1,3)>z[i][1]"order""genus"Thiscanbeusedtorepeatagivenelement:>z[c(1,1,1)][1]"order""order""order">z[c(1,1,1,4)][1]"order""order""order""species"Iftheindicesarenegative,thenthecorrespondingvaluesareremoved:>z[-1][1]"family""genus""species">j<--c(1,4)>z[j][1]"family""genus"Positiveandnegativeindicescannotbemixed.Ifapositiveindexisoutofrange,thenamissingvalue(NA,fornotavailable)isreturned,butiftheindexisnegative,anerroroccurs:>z[5][1]NA>z[-5]Error:subscriptoutofboundsTheindicesmaybeusedtoextractsomedata,butalsotochangethem:>x[c(1,4)]<-10>x[1]1023105Thelogicalindexingworksdifferentlythanthenumericone.Logicalvaluesaregivenasindices:theelementswithanindexTRUEareselected,andthosewithFALSEareremoved.Ifthenumberoflogicalindicesisshorterthanthevector,thentheindicesarerepeatedasmanytimesasnecessary;forinstance,thetwocommandsbelowarestrictlyequivalent:>z[c(TRUE,FALSE)][1]"order""genus">z[c(TRUE,FALSE,TRUE,FALSE)][1]"order""genus" 142FirstStepsinRforPhylogeneticistsAswithnumericindexing,thelogicalindicescanbegivenasalogicalvector.Thelogicalindexingisapowerfulandsimplewaytoselectsomedatafromavector:forinstance,ifwewanttoselectthevaluesgreaterthanorequaltofiveinx:>x[x>=5][1]10105Theindexingsystemwithnamesbringsustointroduceanewconcept:avectormayhaveanattributecallednamesthatisavectorofmodecharacterofthesamelength,andservesaslabels.Itiscreatedorextractedwiththefunctionnames.Anexamplecouldbe:>x<-4:1>names(x)<-z>xorderfamilygenusspecies4321>names(x)[1]"order""family""genus""species"Thesenamescanthenbeusedtoselectsomeelementsofavector:>x[c("order","genus")]ordergenus42Insomesituationsitisusefultodeletethenamesofavector;thisisdonebygivingthemthevalueNULL:>names(x)<-NULL>x[1]43212.2.2FactorAfactorisadatastructurederivedfromavector,butitisnotthesamestrictlyspeaking.Itcanbeofmodenumericorcharacter,andhasanattribute"levels"whichisavectorofmodecharacterandspecifiesthepossiblevaluesthefactorcantake.Ifafactoriscreatedwiththefunctionfactor,thenthelevelsaredefinedwithallvaluespresentinthedata:>f<-c("Male","Male","Male")>f[1]"Male""Male""Male">f<-factor(f)>f[1]MaleMaleMaleLevels:Male 2.2TheDataStructures15Tospecifythatotherlevelsexistalthoughtheyarenotobservedinthepresentdata,theoptionlevelscanbeused:>ff<-factor(f,levels=c("Male","Female"))>ff[1]MaleMaleMaleLevels:MaleFemaleThisisacrucialpointwhenanalyzingthiskindofdata,forinstance,ifwecomputethefrequenciesineachcategorywiththefunctiontable:>table(f)fMale3>table(ff)ffMaleFemale30Factorscanbeindexedandhavenamesexactlyinthesamewayasvectors.Whendataarereadfromafileonthediskwiththefunctionread.table,thedefaultistotreatallcharacterstringsasfactors(seeChapter3.6forexamples).Thiscanbeavoidedbyusingtheoptionas.is=TRUE.2.2.3MatrixAmatrixcanbeseenasavectorarrangedinatabularway.Itisactuallyavectorwithanadditionalattributecalleddim(dimensions)whichisitselfanumericvectorwithlength2,anddefinesthenumbersofrowsandcolumnsofthematrix.Therearetwobasicwaystocreateamatrix:eitherbyusingthefunc-tionmatrixwiththeappropriateoptionsnrowandncol,orbysettingtheattributedimofavector:>matrix(1:9,3,3)[,1][,2][,3][1,]147[2,]258[3,]369>x<-1:9>dim(x)<-c(3,3)>x[,1][,2][,3][1,]147[2,]258[3,]369 162FirstStepsinRforPhylogeneticistsThenumericandlogicalindexingsystemsworkinexactlythesamewayasforvectors.Becauseamatrixhastwodimensions,itcanbeindexedwithtwointegersseparatedbyacomma:>x[3,2][1]6Ifonewantstoextractonlyaroworacolumn,thentheappropriateindexmustbeomitted(withoutforgettingthecomma):>x[3,]#extractthe3rdrow[1]369>x[,2]#extractthe2ndcolumn[1]456Incontrasttovectors,asubscriptoutofrangeresultsinanerror.Matricesdonothavenamesinthesamewayasvectors,buthaverow-names,colnames,orboth:>rownames(x)<-c("A","B","C")>colnames(x)<-c("v1","v2","v3")>xv1v2v3A147B258C369Selectionofrowsand/orcolumnsfollowsinnearlythesamewaysasseenbefore:>x[,"v1"]ABC123>x["A",]v1v2v3147>x[c("A","C"),]v1v2v3A147C3692.2.4DataFrameAdataframeissuperficiallysimilartoamatrixinthesensethatitisatabularrepresentationofdata.Thedistinctionisthatadataframeisasetofdistinctvectorsand/orfactorsallofthesamelength,butpossiblyofdifferentmodes.DataframesarethemainwaytorepresentdatasetsinRbecausethiscorrespondsroughlytoaspreadsheetdatastructure.Thisisthetypeofobjects 2.2TheDataStructures17returnedbythefunctionread.table(seeSection3.6forexamples).Theotherwaytocreatedataframesiswiththefunctiondata.frame:>DF<-data.frame(z,y=0:3,4:1)>DFzyX4.11order042family133genus224species31>rownames(DF)[1]"1""2""3""4">colnames(DF)[1]"z""y""X4.1"Thisexampleshowshowcolnamesarecreatedindifferentcases.Bydefault,therownames"1","2",...aregiven,butthiscanbechangedwiththeoptionrow.namesofdata.frame,ormodifiedsubsequentlyasseenaboveformatrices.Ifoneofthevectorsisshorter,thenitisrecycledalongthedataframebutthismustbeanintegernumberoftimes:>data.frame(1:4,9:10)X1.4X9.1011922103394410>data.frame(1:4,9:11)Errorindata.frame(1:4,9:11):argumentsimplydifferingnumberofrows:4,3Allwehaveseenaboutindexing,colnames,andrownamesformatricesapplyinexactlythesamewaytodataframeswiththedifferencethatcolnamesandrownamesaremandatoryfordataframes.Anadditionalfeatureofdataframesisthepossibilityofextractingacolumnselectivelywiththeoperator$:>DF$y[1]01232.2.5ListListsarethemostgeneraldatastructureinR:theycancontainanykindofobjects,evenlists.Theycanbeseenasvectorswheretheelementscanbeanykindofobject.Theyarebuiltwiththefunctionlist: 182FirstStepsinRforPhylogeneticists>L<-list(z=z,1:2,DF)>L$z[1]"order""family""genus""species"[[2]][1]12[[3]]zyX4.11order042family133genus224species31>length(L)[1]3>names(L)[1]"z"""""Mostoftheconceptswehaveseenonindexingvectorsapplyalsotolists.Additionally,anelementofalistmaybeextractedeitherwithitsindexwithindoublesquarebrackets,orwiththeoperator$:>L[[1]][1]"order""family""genus""species">L$z[1]"order""family""genus""species"2.3TheHelpSystemEveryfunctioninRisdocumentedthroughasystemofhelppagesavailableindifferentformats:•SimpletextthatcanbedisplayedfromtheCLI;•HTMLthatcanbebrowsedwithaWebbrowser(withhyperlinksbetweenpageswhereavailable);•PDFthatconstitutesthemanualofthepackage.Thecontentsofthesedifferentdocumentsarethesame.ThroughtheCLIahelppagemaybedisplayedwiththefunctionhelportheoperator?(thelatterdoesnotworkwithspecialcharacterssuchastheoperators):help("ls")?ls 2.4CreatingGraphics19Bydefault,helponlysearchesinthepackagesalreadyloadedinmemory.Theoptiontry.all.packages=TRUEallowsustosearchinallpackagesinstalledonthecomputer.Ifonedoesnotknowthenameofthefunctionthatisneeded,asearchwithkeywordsispossiblewiththefunctionhelp.search.Thislooksforaspecifiedtopic,givenasacharacterstring,inthehelppagesofallinstalledpackages.Forinstance:help.search("tree")willdisplayalistofthefunctionswherehelppagesmention“tree”.Ifsomepackageshavebeenrecentlyinstalled,itmaybenecessarytorefreshthedatabaseusedbyhelp.searchusingtheoptionrebuild=TRUE.AnotherwaytolookforafunctionistobrowsethehelppagesinHTMLformat.ThiscanbelaunchedfromRwiththecommand:help.start()ThisloadsinthelocalWebbrowserapagewithlinkstoallthedocumentationinstalledonthecomputer,includinggeneraldocumentsonR,anFAQ,linkstoInternetresources,andthelistoftheinstalledpackages.Thislisthasitselflinkstothelistofallfunctionswiththeirhelppages.2.4CreatingGraphicsThegraphicalfunctionsinRneedaspecialmentionbecausetheyworksome-whatdifferentlyfromtheothers.Agraphicalfunctiondoesnotreturnanobject(thoughthereareafewexceptions),butsendsitsresultstoagraphicaldevicewhichiseitheragraphicalwindow(bydefault)oragraphicalfile.Thegraphicalformatsdependontheoperatingsystems,butmostlythefollowingareavailable:encapsulatedPostScript(EPS),PDF,JPEG,PNG,andbitmap(BMP).Additionally,xfigispossibleunderLinux,andEMFunderWindows.Therearetwowaystowritegraphicsintoafile.Themostgeneralandflexiblewayistoopentheappropriatedeviceexplicitly,forinstance,ifwewriteintoanEPSfile:postscript("plot.eps")thenallsubsequentgraphicalcommandswillbewritteninthefile‘plot.eps’.Theoperationisterminated(i.e.,thefileisclosedandwrittenonthedisk)withthecommand:dev.off()ThefunctionpostscripthasmanyoptionstosettheEPSfiles.Allthefiguresofthisbookhavebeenproducedwiththisfunction. 202FirstStepsinRforPhylogeneticistsThesecondwayistocopythecontentofthewindowdeviceintoafileusingthefunctiondev.copywheretheusermustspecifythetargetdevice.Twovariantsofthisfunctionaredev.printwhichprintsintoanEPSfile,anddev.copy2epswhichdoesthesamebysettingthepageinportraitformat.2.5SavingandRestoringRDataRusestwobasicformatstosavedata:ASCII(simpletext)andXDR(externaldatarepresentation3).Theyarebothcross-platform.TheASCIIformatisusedtosaveasingleobject(vector,matrix,ordataframe)intoafile.Twofunctionscanbeused:write(forvectorsandmatrices)andwrite.table(fordataframes).TheXDRformatcanbeusedtosaveanykindandanynumberofobjects.Itisusedwiththefunctionsave,forinstance,tosavethreeobjects:save(x,y,z,file="xyz.RData")Thesedatacanthenberestoredwith:load("xyz.RData")2.6UsingRFunctionsNowthatwehaveseenafewinstancesofRfunctionuses,wecandrawsomegeneralconclusionsonthispoint.Toexecuteafunction,theparenthesesarealwaysneeded,evenifthereisnoargumentinside(typingthenameofafunctionwithoutparenthesesprintsitscontents).Theargumentsareseparatedwithcommas.Therearetwowaystospecifyargumentstoafunction:bytheirpositionsorbytheirnames(alsocalledtaggedarguments).Forexample,letusconsiderahypotheticalfunctionwiththreearguments:fcn(arg1,arg2,arg3)fcncanbeexecutedwithoutusingthenamesarg1,...,ifthecorrespondingobjectsareplacedinthecorrectposition,forinstance,fcn(x,y,z).How-ever,thepositionhasnoimportanceifthenamesoftheargumentsareused,forexample,fcn(arg3=z,arg2=y,arg1=x).AnotherfeatureofR’sfunctionsisthepossibilityofusingdefaultvalues(alsocalledoptions),forinstance,afunctiondefinedas:fcn(arg1,arg2=5,arg3=FALSE)3http://www.faqs.org/rfcs/rfc1832.html. 2.7RepeatingCommands21Bothcommandsfcn(x)andfcn(x,5,FALSE)willhaveexactlythesameresult.Ofcourse,taggedargumentscanbeusedtochangeonlysomeoptions(e.g.,fcn(x,arg3=TRUE)).ManyfunctionsinRactdifferentlywithrespecttothetypeofobjectgivenasarguments:thesearecalledgenericfunctions.Theyactwithrespecttoanoptionalobjectattribute:theclass.ThemaingenericfunctionsinRareprint,summary,andplot.InR’sterminology,summaryisagenericfunc-tion,whereasthefunctionsthatareeffectivelyused(e.g.,summary.phylo,summary.default,summary.lm,etc.)arecalledmethods.Inpractice,theuseofclassesandgenericsisimplicit,butweshowinthenextchapterthatdifferentwaystocodeatreeinRcorrespondtodifferentclasses.Theadvantageofthegenericfunctionshereisthatthesamecommandisusedforthedifferentclasses(e.g.,plot(tr)todrawatree).2.7RepeatingCommandsWhenitcomestorepeatingsomeanalyses,severalstrategiescanbeused.Thesimplestwayistowritetherequiredcommandsinafile,andreadtheminRwiththefunctionsource.Itisusualtonamesuchfileswiththeextension‘.R’.Forinstance,thefile‘mytreeplot.R’couldbe:tree1<-read.tree("tree1.tre")postscript("tree1.eps")plot(tree1)dev.off()Thesecommandswillbeexecutedbytypingsource("mytreeplot.R")inR.2.7.1LoopsAswithanylanguage,Rhascontrolandprogrammingstructurestoexecuteaseriesofcommands.Themostoften-usedoneisthefor4statement,whosegeneralsyntaxis:for(xiny)whereyisanobject,andxsuccessivelytakesthedifferentvaluesofy.Itisnotrequiredtousethesevaluesin(e.g.,for(iin1:5)print("done")).Aforloopmayencompassmorethanonecommandinwhichcaseitisnecessarytogroupthemwithinbraces:4ThefollowingwordsarereservedtotheRlanguageandcannotbeusedtonameobjects:for,in,if,else,while,next,break,repeat,function,NULL,NA,NaN,Inf,TRUE,andFALSE. 222FirstStepsinRforPhylogeneticistsfor(xiny){..........}ymaybeavectorofanymode,afactor(inwhichcasethenumericalcodingwillbeused),amatrix(treatedasavector),adataframe(xwillbesubstitutedbythedifferentcolumnsofy),oralist(xwillbesubstitutedbythedifferentelementsofy).Twocommandsmaybeusefulhere:nextstopsthecurrentiterationandmovestothenextvalueofx,andbreakabortstheloop.Theyareusuallycombinedwithanifstatementwhichtakesasinglelogicalvalueasargument,forexample:for(iin1:10){if(x[i]<0)break.....}2.7.2Apply-LikeFunctionsInmanysituations,thereisaneasierandmoreefficientalternativetotheuseofloopsandcontrolstatements:theapply-likefunctions.applyappliesafunctiontoallcolumnsand/orrowsofamatrixoradataframe.Itssyntaxis:apply(X,MARGIN,FUN,...)whereXisamatrixoradataframe;thesecondargumentindicateswhethertoapplythefunctionontherows(1),thecolumns(2),orboth(c(1,2));FUNisthefunctiontobeused;and...anyargumentthatmaybeneededforFUN.lapplydoesthesameasapplybutondifferentelementsofalist.Itssyntaxis:lapply(x,FUN,...)Thisfunctionreturnsalist.sapplyhasnearlythesameactionaslapplybutitreturnsitsresultsasamorefriendlywayasavectororamatrixwithrownamesandcolnames.tapplyactsonavectorandappliesafunctiononsubsetsdefinedbyanadditionalargumentINDEX:tapply(X,INDEX,FUN=NULL,...)Typically,INDEXdefinesgroups,andthefunctionFUNisappliedtoeachgroup.Bydefault,theindicesofthegroupsdefinedbyINDEXarereturned.Finally,replicatereplicatesacommandagivennumberoftimes,return-ingtheresultsasavector,amatrix,oralist;forexample, 2.8Exercises23>replicate(5,rnorm(1))[1]-1.4246998240.6950663670.9581530280.002594864[5]-0.879007194>replicate(4,rnorm(3))[,1][,2][,3][,4][1,]0.7743082-0.7689951-0.43326751.58177859[2,]-0.7495421-0.5846179-1.05814480.03818309[3,]0.16327600.88189270.6218508-1.376484672.8Exercises1.StartRandprintthecurrentworkingdirectory.Supposeyouwanttoreadsomedatainthreedifferentfileslocatedinthreedifferentdirectoriesonyourcomputer:findtwowaystodothis.2.Createamatrixwiththreecolumnsand1000rowswhereeachcolumncontainsarandomvariablethatfollowsaPoissondistributionwithrates1,5,and10,respectively(see?PoissonforhowtogeneraterandomPoissonvalues).Findtwowaystocomputethemeansofeachcolumnofthismatrix.3.Createavectorof10randomnormalvaluesusingthethreefollowingmethods.(a)Createandconcatenatesuccessivelythe10randomvalueswithc.(b)Createanumericvectoroflength10andchangeitsvaluessuccessively.(c)Usethemostdirectmethod.Comparethetimingsofthesethreemethods(see?system.time)andexplainthedifferences.Repeatthisexercisewith10,000values.4.Createthefollowingfile:Mus_musculus10Homo_sapiens70000Balaenoptera_musculus120000000(a)Readthisfilewithread.tableusingthedefaultoptions.Lookatthestructureofthedataframeandexplainwhathappened.Whatoptionshouldhavebeenused?(b)Fromthisfile,createadatastructurewiththenumericvaluesthatyoucouldthenindexwiththespeciesnames,forexample,>x["Mus_musculus"][1]10Findtwowaystodothis,andexplainthedifferencesinthefinalresult. 242FirstStepsinRforPhylogeneticists5.Createthesetwovectors(source:[5]):Archaea<-c("Crenarchaea","Euryarchaea")Bacteria<-c("Cyanobacteria","Spirochaetes","Acidobacteria")(a)CreatealistnamedTreeOfLifesothatwecandoTreeOfLife$Archaeatoprintthecorrespondinggroup.(b)UpdateTreeOfLifebyaddingthefollowingvector:Eukaryotes<-c("Alveolates","Cercozoa","Plants","Opisthokonts")ItshouldappearatthesamelevelasArchaeaandBacteria.(c)UpdateArchaeabyadding"Actinobacteria".(d)Printallthelowest-leveltaxa. 3PhylogeneticDatainRThischapterdetailshowphylogeneticdataarehandledinR.Theissuesdis-cussedherewillinterestallusers.Issuesrelativetoimplementationandpro-grammingarediscussedinChapter7.3.1PhylogeneticDataasRObjectsOnestrengthofRistheflexibilityofitsdatastructures.Inmostphylogeneticprograms,thedatastructuresarecompletelyopaquetotheuser.Thisisbecausecomplexdatastructuresinlow-levellanguages(suchasCorC++)needalotofprogrammingwork.ThisisnotthecaseinRwherethelistdatastructureprovidesanefficientandflexiblewaytobuildcomplexdatastructuresusinganykindofelement.Foratreecodedwithalist,thecriticaladvantageisthattheusercaneasilyaccessitscomponents,andmanipulateoranalyzethemwithR’sfunctionsandoperators.Asasimpleexample,consideratreereadinRwithape:thiswillbestoredinRasanobjectofclass"phylo".Ifthisobjectisnamedtr,thenitsbranchlengthswillbeaccessedsimplywithtr$edge.length.AnysubsequentanalysiscanbeconductedwiththeusualRfunctions;asillustrations,thefollowingcommandswillcomputethemean,somesummarystatistics,plotafrequencyhistogram,andfinallycopythesebranchlengthsintoanobjectnamedx.mean(tr$edge.length)summary(tr$edge.length)hist(tr$edge.length)x<-tr$edge.lengthTreescanbecodedindifferentwaysinRwhichreflectsthechoicesoftheauthorswhodesignedthesedifferentclasses.Theclassofanobjectistheattributethatsignsitsparticularities.Somefunctionstreatobjectsdifferentlywithrespecttotheirclass(Section2.6). 263PhylogeneticDatainRapeusesaclasscalled"phylo"todescribephylogenetictrees.Theprinci-pleofitsdesignistostoreindifferentelementsadescriptionofitshierarchicalstructure,thenamesofthetaxa,thebranchlengths,andotherinformationthatmaybenecessary.Thestructureofanobjectofclass"phylo"isdetailedbelow.apehasanotherclasscalled"matching",butitsuseisrestrictedtoafewsituations;itisdescribedbelow.ade4usesanotherclasscalled"phylog".Ithasthesamegoalas"phylo"butitsdesignisradicallydifferent.Anobjectofclass"phylog"storesmoreinformationthantheobjectofclass"phylo"representingthesamephylogeny,andthusitrequiresmorememory.Theclass"phylog"isoutlinedbelow.apTreeshapeusesaclasscalled"treeshape"thatcodesdichotomoustreeswithnobranchlengths.WemayrecallthatinR,allactionsaredoneonobjectsstoredintheactivememoryofthecomputer.Consequently,theclassesdescribedabovearenotdesignedtobenewtreefileformats,butrathertohandleandanalyzephylogeneticdataefficiently.Thepackagestatshastwoclassesworthmentioninghere:"hclust"and"dendrogram".Theseclassesaredesignedtocodehierarchicalclusters,andthuscontainlessinformationthanthetwoclassesdescribedabove(theymaybeappropriatetocodeultrametrictrees).However,becauseobjectsofclass"hclust"and"dendrogram"areproducedbyclusteringanalysesinR,itmaybeusefultoconverttheminobjectsofclass"phylo"whichiswhatcanbedonebysomefunctionsasshowninSection3.4.3.1.1TheClass"phylo"(ape)Anobjectofclass"phylo"isalistwiththefollowingcomponents.edgeatwo-columnmatrixwhereeachrowrepresentsabranch(oredge)ofthetree;thenodesandthetipsaresymbolizedwithnumbers;thenodesarerepresentedwithnegativenumbers(therootbeing"-1"),andthetipsarerepresentedwithpositivenumbers.Foreachrow,thefirstcolumngivestheancestor.Thisrepresentationallowsaneasymanipulationofthetree.edge.length(optional)anumericvectorgivingthelengthsofthebranchesgivenbyedge.tip.labelavectorofmodecharactergivingthenamesofthetips;theorderofthenamesinthisvectorcorrespondstothe(positive)numbersinedge.node.label(optional)avectorofmodecharactergivingthenamesofthenodes.root.edge(optional)anumericvaluegivingthelengthofthebranchattherootifitexists.AclassinRcanbeeasilyextendedtoincludeotherelements,providingthenamesalreadydefinedarenotreused.Forinstance,a"phylo"objectcould 3.1PhylogeneticDataasRObjects27includeanumericvectortip.dategivingthedatesofthetipsiftheyarenotallcontemporary(e.g.,forviruses);thiswillnotchangethewayotherelementsareaccessedormodified.Anotherpotentialextensionistocodenetworksorreticulogramsbecausethiswouldrequiresimplyaddingtheappropriaterowsinthematrixedge.Theclass"phylo"isasomehowminimalistrepresentationofaphylo-genetictree.Otherinformationthatmaybeneededinsomeanalyses(e.g.,branchingtimes,numberofdescendantsforeachnode,etc.)mustbecom-putedbythefunctionsthatneedthem.3.1.2TheClass"phylog"(ade4)Theclass"phylog"takesadifferentapproachthanthe"phylo"one:inad-ditiontothebasicstructureofthetree,otherinformationisstored.Thishastheadvantagethatsomecomputationsarefaster,buttheoverheadisthatmorememoryisneededtostorea"phylog"objectthana"phylo"one.A"phylog"objectisalistwith20elements.Thestructureofthetreeisstoredinthreeofthem:treisacharacterstringrepresentingthetreeinNewickformatwithoutthebranchlengths,leavesisanamednumericvectorwherethevaluesaretheterminalbranchlengthsandthenamesarethetiplabels,andnodesissimilartoleavesbutfortheinternalbranches.The17otherelementsstorevariousinformationwhichisneededbysomefunctionsinade4(thedetailsmaybefoundonthehelppagesofthispackage,e.g.,?phylog).3.1.3TheClass"matching"(ape)MatchingshavebeenintroducedbyDiaconisandHolmes[25]asarepresen-tationofbinaryphylogenetictrees.Theideaistoassigntoeachtipandnodeapositivenumber,andthentorepresentthetopologyasaseriesofpairsofthesenumbersthataresiblings(thematchings).Interestingly,ifsomeconven-tionsaregiven,thisresultsinauniquerepresentationbetweenagiventreeandagivenmatching[25].Anobjectofclass"matching"isalistwiththefollowingcomponents.matchingathree-columnnumericmatrixwherethefirsttwocolumnsrepre-sentthesiblingpairs(thematching),andthethirdonethecorrespondingancestor.edge.length(optional)anumericvectorrepresentingthebranchlengthswheretheithelementisthelengthofthebranchbelowtheelementnum-berediinmatching.tip.label(optional)acharactervectorgivingthetiplabelswheretheithelementisthelabelofthetipnumberediinmatching.node.label(optional)acharactervectorgivingthenodelabelsinthesameorderasinmatching(i.e.,theithelementisthelabelofthenodenum-beredi+ninmatching,withnthenumberoftips). 283PhylogeneticDatainRAnobjectofclass"matching"isnotamatchinginDiaconisandHolmes’s[25]sensebecauseitincludesextrainformation.Thelattercanbeprintedfromtheformer,sayx,withx$matching[,1:2].Theclass"matching"isusedessentiallyintheestimationofphylogeniesbecausethisisanefficientrepresentationforbinarytrees(Chapter5).3.1.4TheClass"treeshape"(apTreeshape)Theclass"treeshape"isderivedfromthe"hclust"one.Anobjectofthisclassisalistwithtwoelements:mergeatwo-columnnumericmatrixwhereeachrowrepresentsapairing:anegativenumberrepresentsatip,andapositivenumberrepresentsagroupoftipsasidentifiedbythelinenumberofthismatrix.Forinstance,arowwith(-8,1)meansthattheeighthtipispairedwiththegroupoftipsdefinedbythefirstrowofthismatrix.names(optional)avectorofmodecharactergivingthenamesofthetips.Anobjectofclass"treeshape"canbebuiltwiththefunctiontreeshapewhichtakesasargumentsthesetwoelements.3.2ReadingPhylogeneticData3.2.1PhylogeniesTreelikedatastructuresareverycommonincomputerscience,andtherearemanywaystostoretheminfiles.Fortunately,biologists,systematists,andphylogeneticistsseemtoagreeontheuseofasingledataformatfortrees:thenestedparenthesesformat,knownastheNewickorNewHampshireformat.1Thisformathasmanyadvantages:itisflexible,canbeinterpreteddirectlybyhumans(ifnottoolong),hasacloselinkwiththehierarchicalnatureofevolutionaryrelationships,andcanstorelargetreesusinglittleresourcesonacomputerdisk.AcommonextensionoftheNewickformatistheNEXUSformatwhichcanalsoincludeotherdata(usuallymatricesofspeciescharacters),andsystemcommandssuchascallstootherprograms[95].apehastwofunctionstoreadtreesinNewickandNEXUSformats:read.treeandread.nexus.Bothfunctionshaveafilename(givenasacharacterstringoravariableofmodecharacter)asmainargument:tr<-read.tree("treefile.tre")trx<-read.nexus("treefile.nex")1http://evolution.genetics.washington.edu/phylip/newicktree.html. 3.2ReadingPhylogeneticData29Thesefunctionsignoreallwhitespacesandnewlinesinthetreefile.Thelattermaycontainseveraltreesthatareallread:thereturnedobjectisofclassc("multi.tree","phylo"),andisalistofobjectsofclass"phylo".Ifnofilenameisgiven,read.treereadsthetreeinNewickformatfromthestandardinput,sothattheusercantypetheparenthetictreedirectlyonthekeyboard(theinputisterminatedbyablankline).Forinstance,ifwejusttypetr<-read.tree(),Rthenpromptstheusertoenterthetree(thiscanbecopied/pastedfromatextfile).Eachlineoftextisnumbered1:,2:,andsoon.>tr<-read.tree()1:(a:1,b:1);2:>ls()[1]"tr"Alternatively,itispossibletostoretheNewicktreeinavariableofmodecharacterandthenusetheoptiontext:>a<-"(a:1,b:1);">tr<-read.tree(text=a)Bothread.treeandread.nexuscreateanobjectofclass"phylo".Ad-ditionally,read.nexuskeepstrackoftheoriginalfileinanattributenamedorigin.ade4hasthefunctionnewick2phylogthatcreatesanobjectofclass"phylog"fromaNewicktreestoredinacharactervariable.>b<-"((a:1,b:1):1,c:2);">trg<-newick2phylog(b)TheNewicktreecanbereadinafileusingthefunctionscanwiththeappro-priateoption.>trh<-newick2phylog(scan("treefile.tre",what=""))Notethatnewick2phylogcannotreadstarliketrees,whereasread.treecannotreadtreesonlyspecifiedasa“skeleton”madeofparenthesesandcommas.>trg<-newick2phylog("((((,,),,(,)),),(,));")However,inthatcasearbitraryvaluesofonearegiventothebranchlengths,aswellas“Ext1”,“Ext2”,...astiplabels.2Bothfunctionscanreadatreewithnobranchlengthssuchas"((a,b),c);".2newick2phylogalsogivesarbitrarylabelstothenodesiftheyarenotintheNewicktree:“I1”,“I2”,andsoon. 303PhylogeneticDatainR3.2.2ReadingInternetTreeDatabasesapTreeshapecanreadtreesfromthePANDIT3andTreeBASE4Internetdatabaseswiththefunctionspanditandtreebase,respectively.Thesefunctionsrequireknowledgeofthenumbersofthetreeintheirrespectivedatabases:thetreesarethenreturnedinRasobjectsofclass"phylo"(thedefault)or"treeshape"iftheoptionclass="treeshape"isused.ThesetwofunctionsarethususefulforreadingtreesforfurtheranalysesinR.Asasimpleexample,wecanreadthesecondtreeinPandit,andplotitdirectly(Fig.3.1):plot(pandit(2),font=1)MTHDROME/211−480GPR64HUMAN/625−886BAI1HUMAN/944−1191BAI2HUMAN/917−1197LPHN3BOVIN/942−1198O97802BOVIN/769−1016Q17505CAEEL/548−799EMR1HUMAN/599−851CD97HUMAN/544−793CD97MOUSE/526−777CELR3RAT/2534−2777CELR1MOUSE/2480−2723SEB1CAEEL/164−436CALCRRAT/145−435CALRLHUMAN/138−391CRFR1RAT/116−370CRFR2XENLA/115−368DIHRMANSE/83−351DIHRACHDO/130−393GLP1RRAT/141−409GIPRHUMAN/134−399GLRHUMAN/138−407GLP2RRAT/175−443PTHR1HUMAN/184−466PTHR2HUMAN/141−420GHRHRMOUSE/126−383O73768CARAU/133−390SCTRRABIT/135−391PACRMOUSE/150−435VIPR2HUMAN/123−382VIPRCARAU/100−359VIPR1RAT/140−397Q9YHC6RANRI/126−382Fig.3.1.Thetree#2inthePanditdatabase3.2.3MolecularSequencesDNAsequencescanbereadwiththeapefunctionread.dnawhichreadsfilesinFASTA,interleaved,orsequentialformat(theseformatsaredescribedinthehelppageofread.dna).Thefunctionread.GenBankcanreadsequences3http://www.ebi.ac.uk/goldman-srv/pandit/.4http://www.treebase.org/. 3.2ReadingPhylogeneticData31intheGenBankdatabasesviatheInternet:itsmainargumentisavectorofmodecharactergivingtheaccessionnumbersofthenucleotidesequences.Theseaccessionnumbersareused,bydefault,asnamesfortheindividualsequences.Iftheoptionspecies.names=TRUEisused,whichisthedefault,thenthespeciesnames(asreadinthe“ORGANISM”fieldintheGenBankdata)arereturnedinanattributecalled"species".Thesetwofunctionsreturnalistofvectorsofsinglecharactersgivingthenucleotideateachpositionofthesequence,thus,forinstance,thetwentiethnucleotideofthesecondsequencewillbeaccessedwithx[[2]][20].AlltricksofR’sindexingsystems(p.12)canbeusedhere.seqinrhasmoreflexibilitythanapeforreadingmolecularsequences(pro-teinsandDNA).Twofunctionscanreadsequencesstoredinlocalfiles.read.fastareadssequencesinFASTAformat.Ithastwoarguments:Filetospecifythenameofthedatafile,andseqtypetospecifythetypeofthesequencewhichiseither"DNA"(thedefault)or"AA"(forproteins).Aswithread.dna,read.fastareturnsalistofsequencesbutthereareafewaddi-tionalattributesincludingaclass"SeqFastadna"or"SeqFastaAA"dependingonthetypeofthesequence.read.alignmentreadsalignedsequences.Therearetwoarguments:Fileandformatwhichcanbe"mase","clustal","phylip","fasta",or"msf".Ifformat="phylip",thefunctiondetectswhethertheformatissequentialorinterleaved.Thesequencesarestoredinadifferentwaythanread.fastaandread.dnado:eachsequenceisstoredasasinglecharacterstring,whereasforthelattereachsequenceisavectorofstringsmadeofsinglecharacters(eachbeingapositioninthesequence).Thedatareturnedbyread.alignmentareofclass"alignment".seqinrhasanelaboratemechanismforretrievingsequencesfrommolecu-lardatabanks.ThisworksthroughtheACNUCrepository.5ThedatabanksavailablearelistedinRwiththefunctionchoosebankusedwithoutargument(thisworksonlyifthecomputerisconnectedtotheInternet):>choosebank()[1]"genbank""embl""emblwgs""swissprot"[5]"ensembl""emglib""nrsub""nbrf"[9]"hobacnucl""hobacprot""hovernucl""hoverprot"[13]"hogennucl""hogenprot""hoverclnu""hoverclpr"[17]"HAMAPnucl""HAMAPprot""hoppsigen""nurebnucl"[21]"nurebprot""taxobacgen""greview"ThesedatabanksaremirroredonthePBILserverinLyon.Theuserselectsoneofthesebankswiththesamefunction:>s<-choosebank("genbank")5http://pbil.univ-lyon1.fr. 323PhylogeneticDatainRItisthenpossibletoquerythebankfortheavailablesequences.Forinstance,togetthelistofthesequencesofthebirdgenusRamphocelus[59]:>query(s$socket,"rampho","sp=Ramphocelus@")$socket:descriptionclass"->pbil.univ-lyon1.fr:5558""socket"modetext"a+""text"openedcanread"opened""yes"canwrite"yes"$banque:genbank$call:query(socket=s$socket,listname="rampho",query="sp=Ramphocelus@")$name:[1]"rampho"listlengthmodecontent1$req20charactersequencesThiscommandreturnsanobjectnamed"rampho"whichliststhesequencesmeetingtheselectioncriteria.6Thespecialcharacter"@"meetsanysetofcharacters(see?queryforthedetailsofthesyntaxofthisfunction).Theresultdisplayedbyqueryshowsthat20sequenceswerefound."rampho"isalistwiththeaccessionnumbersandtheconnectiondetails(servername,portnumber,etc.)toretrievethesequenceseffectively:>rampho$req[[1]][1]"AF310048"attr(,"class")[1]"SeqAcnucWeb"attr(,"socket")descriptionclass"->pbil.univ-lyon1.fr:5558""socket"modetext"a+""text"openedcanread"opened""yes"canwrite"yes"6ThesyntaxisunusualinRwhereobjectsareoftencreatedwiththeassignoper-ator<-. 3.3WritingData33ThesequencesarethenextractedfromACNUCwiththegenericfunctiongetSequence:>x<-getSequence(rampho$req[[1]])>length(x)[1]921>x[1:20][1]"g""g""a""t""c""c""t""t""a""c""t""a""g"[14]"g""c""c""t""a""t""g"Asacomparison,thesamesequencecanbeextractedwiththeapefunctionread.GenBank:>y<-read.GenBank("AF310048")>length(y[[1]])[1]921>attr(y,"species")[1]"Ramphocelus_carbo">identical(x,y[[1]])[1]TRUE3.3WritingDataWehaveseenthatRworksondatastoredintheactivememoryofthecom-puter.Itisobviouslynecessarytobeabletowritedata,atleastfortworeasons.Theusermaywantatanytimetosavealltheobjectspresentinmemorytopreventdatalossfromacomputercrash,orbecausehewantstoquitRandcontinuehisanalyseslater.TheotherreasonisthattheuserwantstoanalyzesomedatastoredinRwithotherprogramswhichinmostcasesneedtoreadthedatafromfiles(unlessthereisalinkbetweenthesoftwareandR;seeChapter7).AnykindofdatatypeinRcanbesavedinabinaryfileusingthesavefunction;theobjectstobesavedaresimplylistedasargumentsseparatedbycommas.save(x,y,tr,file="mydata.RData")The“.RData”suffixisaconventionandisassociatedwithRonsomeoper-atingsystems(e.g.,Windows).Thebinaryfilescreatedthiswayareportableacrossplatforms.Thecommandsave.image()(usedwithoutoptions)isashort-cuttosaveallobjectsinmemory(theworkspaceisR’sjargon)inafilecalled‘.RData’.ItiseventuallycalledbyRwhentheuserquitsthesystemandchoosestosaveanimageoftheworkspace.apehasseveralfunctionsthatwritetreesandDNAsequencesinformatssuitableforothersystems.write.treewritesatreeinNewickformat.Ittakesasmainargumenta"phylo"object.BydefaulttheNewicktreeisreturnedasacharacterstring,andthuscanbeusedasavariableitself: 343PhylogeneticDatainR>tr<-read.tree(text="(a:1,b:1);")>write.tree(tr)[1]"(a:1,b:1);">x<-write.tree(tr)>x[1]"(a:1,b:1);"Tosavethetreeinafile,oneneedstousetheoptionfile:>write.tree(tr,file="treefile.tre")Theoptionappend(FALSEbydefault)controlswhethertodeleteanypre-viousdatainthefile.Forlargertrees,thecharacterstringissplitwithlinebreaks.Thisbehaviorcanbeavoidedwiththeoptionmulti.line=FALSE.Oneorseveral"phylo"objectscanalsobewritteninaNEXUSfileusingwrite.nexus.Thisfunctionbehavessimilarlytowrite.treeinthatitprintsbydefaultthetreeontheconsole(butthiscannotbereusedasavariable).>write.nexus(tr)#NEXUS[R-packageAPE,MonDec2011:18:232004]BEGINTAXA;DIMENSIONSNTAX=2;TAXLABELSab;END;BEGINTREES;TRANSLATE1a,2b;TREE*UNTITLED=[&R](1:1,2:1);END;Theoptionsofwrite.nexusaretranslate(defaultTRUE)whichre-placesthetiplabelsintheparentheticrepresentationwithnumbers,andoriginal.data(defaultTRUE)towritetheoriginaldataintheNEXUSfile(inagreementwiththeNEXUSstandard[95]).Ifseveraltreesarewritten,theymusthavethesametiplabels,andmustbegiveneitherasaseries,orasalist:>write.nexus(tr1,tr2,tr3,file="treefile.nex")>L<-list(tr1,tr2,tr3)>write.nexus(L,file="treefile.nex") 3.4ManipulatingData35DNAsequencesarewrittenintofileswithwrite.dna.Itsoptionformatcantakethevalues"interleaved"(thedefault),"sequential",or"fasta".Thereareseveraloptionstocustomizetheformattingoftheoutputsequences(see?write.dnafordetails).3.4ManipulatingDataManipulatingphylogenetictreesisdifficultbecauseofthecomplexityofsuchdatastructures.Thismaybeoneofthereasonswhysofewprogramsofferthispossibility.Anotherreasonmaybethatonceaphylogenyhasbeenobtained,sometimesafteralongprocessofvariousanalyses,theuserisnotwillingtochangeit.Therearegoodreasonsformakingsuchmanipulationspossible,though,forinstanceifsomecomparativeanalysesaretobedone(seeSection6.1).Itisalsosometimesneededto“arrange”atreebeforeplottingit,suchasrotatingabranchordroppingatip.Otherlesstrivialmanipulationsincludeextractingbranchlengths,computingbranchingtimesorcoalescentintervals,(un)rootingatree,testingwhethertwotreesareidentical,andsoon.3.4.1BasicTreeManipulationapehasseveralfunctionstomanipulate"phylo"objects.Theyarelistedbelow.Intheexamples,thetreesarewrittenasNewickstringsforconvenience;theresultscouldalsobevisualizedwithplotinsteadofwrite.tree.drop.tipremovesoneorseveraltipsfromatree.Theformerarespecifiedeitherbytheirlabelsortheirpositions(indices)inthevectortip.label.Bydefault,theterminalbranchesandthecorrespondinginternalonesareremoved.Thishastheeffectofkeepingthetreeultrametricinthecaseitwasbeforehand.Thisbehaviorcanbealteredbysettingtheoptiontrim.internal=FALSE.>tr<-read.tree(text="((a:1,b:1):1,(c:1,d:1):1);")>write.tree(drop.tip(tr,c("a","b")))[1]"(c:1,d:1);">write.tree(drop.tip(tr,1:2))#sameasabove[1]"(c:1,d:1);">write.tree(drop.tip(tr,1:2,trim.internal=FALSE))[1]"(NA:1,(c:1,d:1):1);"bind.treeisusedtobuildatreefromtwotrees.Theargumentsaretwo"phylo"objects.Bydefault,thesecondtreeisboundontherootofthefirstone;adifferentnodemaybespecifiedusingtheoptionnode.Ifthesecondtreehasaroot.edgethiswillbeused.Thusthebindingoftwobinary(dichotomous)treeswillresultinatrichotomyoratetrachotomy 363PhylogeneticDatainR(ifthereisnorootedge)inthereturnedtree.Thismaybeavoidedbyusingtheoptionbranchinsteadofnode.Thesyntaxisnearlythesame:thedistinctionbeingthatthesecondtreeisboundbelowthenodegiveninbranch.Thefurtherargumentpositionspecifieswhereonthebranchthetreeistobebound.>t1<-read.tree(text="(a:1,b:1):1;")>t2<-read.tree(text="(c:1,d:1):1;")>write.tree(bind.tree(t1,t2))[1]"(a:1,b:1,(c:1,d:1):1):1;">write.tree(bind.tree(t1,t2,branch=-1,position=1))[1]"((a:1,b:1):1,(c:1,d:1):1):0;rotaterotatestheinternalbranchbelowthemostrecentcommonancestorofamonophyleticgroupgivenbytheargumentgroup.Theresultingtreeisexactlyequivalenttotheoriginalone.Thisfunctionisconvenientwhenplottingatreeifitisneededtochangetheorderofthetipsontheplot.Ontheotherhand,themodificationisnotapparentwhenwritingthetreeinNewickformatbecausethetipsarewrittenaccordingtothenumbersinthe"phylo"object.compute.brlenmodifiesorcreatesthebranchlengthsofatreewithrespecttothesecondargument,method,whichmaybeoneofthefollowing.•Acharacterstringspecifyingthemethodtobeused(e.g.,"Grafen").•AnRfunctionusedtogeneraterandombranchlengths(e.g.,runif).•Oneorseveralnumericvalues(recycledifnecessary).Forinstance,ifwewanttosetallbranchlengthsequaltoone[48,54]:tr<-compute.brlen(tr,1).Thisislikelytobeusefulincomparativeanalyseswhenaphylogenywithnobranchlengthsisavailable.3.4.2RootedVersusUnrootedTreesTheNewickparentheticformatcanrepresentbothrootedandunrootedtrees.Inthelatter,allnodeshaveatleastthreeconnectingbranches.Thus,intheNewickrepresentationofanunrootedtree,itisnecessarythatthebasalgroupinghas(atleast)threesiblinggroups:((...),(...),(...));SuchatreereadinRwithread.treewouldresultinanobjectofclass"phylo"whoseroothasthreedescendants.Inthiscase,theroothasnobio-logicalinterpretation:itdoesnotrepresentacommonancestorofalltips.Thefunctionis.rootedtestswhetheranobjectofclass"phylo"repre-sentsarootedtree.ItreturnsTRUEifeitheronlytwobranchesconnecttotheroot,orifthereisaroot.edgeelement.>ta<-read.tree(text="(a,b,c);")>tb<-read.tree(text="(a,b,c):1;") 3.4ManipulatingData37>tc<-read.tree(text="((a,b),c);")>is.rooted(ta)[1]FALSE>is.rooted(tb)[1]TRUE>is.rooted(tc)[1]TRUEThepresenceofazeroroot.edgeallowsustohavearootedtreewithabasaltrichotomy:>td<-read.tree(text="(a,b,c):0;")>is.rooted(td)[1]TRUEBothobjectstaandtdaregraphicallysimilar;thedifferencebetweenthemisthattherootoftdcanbeinterpretedbiologicallyasacommonancestorofa,b,andc.Thefunctionrootrerootsatreegivenanoutgroup,madeofoneorseveraltips;astheargumentoutgroup.Ifthetreeisrooted,itisunrootedbeforebeingrerooted,sothatifoutgroupisalreadyanoutgroup,thenthereturnedtreeisnotthesameastheoriginalone.Thespecifiedoutgroupmustbemonophyletic,otherwisetheoperationfailsandanerrormessageisprinted.Thefunctionunroottransformsarootedtreeintoitsunrootedcounter-part.Ifthetreeisalreadyunrooted,itisreturnedunchanged.3.4.3DichotomousVersusMultichotomousTreesTheNewickformatrepresentsmultichotomiesbyhavingmorethantwosiblinggroups:(A,B,C);Thisisrepresentedexplicitlyintheclass"phylo"bylettinganodehaveseveraldescendantsintheelementedge,forinstance:...-21-22-23...where1,2,3wouldbethenumbersofthetipsA,B,C.Asshowninthenextchapters,somemethodsdealonlywithdichotomous(i.e.,binary)trees,thusitmaybeusefultoresolvemultichotomiesintodi-chotomieswithinternalbranchesoflengthzero.Ontheotherhand,whenadichotomoustreehasinternalbranchesoflengthzeroitmaybeneededto 383PhylogeneticDatainRcollapsetheminamultichotomy.Thesetwooperationsmaybeperformedwiththefunctionsmulti2dianddi2multi,respectively.Theybothtakeanobjectofclass"phylo"asmainargument;di2multihasasecondargumenttolthatspecifiesthetolerancetoconsiderbranchlengthssignificantlygreaterthanzero(10−8bydefault).Thereareseveralwaystosolveamultichotomyresultingindifferenttopologies.Thenumberofpossibilitiesgrowsveryfastwiththenumberofbranches,n,involvedinthemultichotomy:itisgivenbyn!/2(factorial(n)/2inR).Foronlythreepossibilitieswithn=3,thereare60withn=5,and1,814,400withn=10.multi2dihasasecondargument,random,whichspec-ifieswhethertosolvethemultichotomiesinarandomorder(thedefault),orinanarbitraryorderifrandom=FALSE.Repeatingtheuseofmulti2dionatreewiththedefaultoptionwilllikelyyielddifferenttopologies.Specify-ingrandom=FALSEmaybepreferrediftheoperationisrepeatedanditisnecessaryalwaystohavethesametopology.apTreeshapehasadifferentmechanismtosolvemultichotomiesrandomly.Itisusedwhenreadingtreesfromdatabases(Section3.2.2),orwhenconvert-ingtreesofclass"phylo"withmultichotomies(Section3.4.5).Thefunctionspandit,treebase,andas.treeshapehavetheoptionmodelthatcantakethefollowingvalues:"biased","pda",or"yule".Thisspecifiesthemodelusedtoresolvethemultichotomies.ThesemodelsareexplainedinSection3.5.Inarooteddichotomoustreethenumberoftipsisequaltothenumberofnodesminusone,whereasthisisminustwoforanunrootedtree(becausetherootnodehasbeenremoved).Thefunctionis.binary.treetestswhetheratree,eitherrootedorunrooted,isdichotomous,andreturnsalogicalvalue.3.4.4SummarizingandComparingTreesThereisasummarymethodfor"phylo"objects.Thisfunctionprintsabriefsummaryofthetreeincludingthenumbersofnodesandtips.is.ultrametrictestsifatreeisultrametric(alltipsequallydistantfromtheroot),andreturnsalogicalvalue.Thisisdonetakingthenumericalpre-cisionofthecomputerintoaccount.balancereturns,forafullybinarytree,thenumberofdescendantsofbothsister-lineagesfromeachnode(seeSection6.3.5foranalysesoftreeshape).Oncethebranchlengthsofa"phylo"objecthavebeenextractedasshownabove,anycomputationcanbedoneonthem.Therearespecialfunctionstoperformsomeparticularoperations.branching.timesreturns,foranultra-metrictree,thedistancesfromthenodestothetipsusingitsbranchlengths.coalescent.intervalscomputesthecoalescencetimesforanultrametrictreeandreturns,intheformofalist,asummaryofthesecomputationswiththenumberoflineagespresentateachinterval,thelengthsoftheintervals,thetotalnumberofintervals,andthedepthofthetree.Itisoftennecessarytocomparetwophylogenetictreesbecausetherecouldbe,foragivenformat,severalrepresentationsofthesametree.Thisisthecase 3.4ManipulatingData39withtheNewickformat,andalsoforthe"phylo"classofobjects.Thegenericfunctionall.equaltestswhethertwoobjectsare“approximatelyequal”.Forinstance,forsimplenumericdatathecomparisonisdoneconsideringthenumericalprecisionofthecomputer.For"phylo"objects,onlythelabeledtopologiesarecompared:ifbothrepresentionsarethesame,TRUEisreturned,otherwiseasummaryofthecomparisonisprinted.Hereisanexamplewithasimplecaseoftworepresentationsforthesamerootedtree:>t1<-read.tree(text="((a:1,b:1):1,c:2);")>t2<-read.tree(text="(c:2,(a:1,b:1):1);")>all.equal(t1,t2)[1]TRUEIfbothtreeshavesimilarlabeledtopologies,theirbranchlengthscanbecom-paredwiththesamegenericfunction:>all.equal(t1$edge.length,t2$edge.length)[1]"Meanrelativedifference:0.6666667">all.equal(sort(t1$edge.length),sort(t2$edge.length))[1]TRUETwoobjectsofclass"treeshape"canalsobecomparedwithall.equal.Becauseall.equaldoesnotalwaysreturnalogicalvalue,itshouldnotbeusedinprogrammingaconditionalexecution.Thefollowingshouldbeusedinstead:identical(all.equal(t1,t2),TRUE)3.4.5ConvertingObjectsWehaveseenthatatreemaybecodedindifferentwaysinRthatcorrespondtodifferentclassesofobjects.Itisobviouslyusefultobeabletoconvertamongthesedifferentclassesbecausesomeoperationscanbedoneonsomeclassesbutnotothers.Table3.1givesdetailsonhowtoconvertamongthesixclassesdiscussedhere.Theentriesmarkedndinthistableindicatethattheconversioncannotbedonedirectly,anditmustbedoneintwo(ormore)steps.Forinstance,toconverta"phylo"objectina"dendrogram"one,wewilldo:as.dendrogram(as.hclust(x))Thereiscurrentlynowaytoconverta"dendrogram"objecttoanotherclass.ThisclasshasbeenrecentlyintroducedinRandisstillunderdevelop-ment. 403PhylogeneticDatainRTable3.1.ConversionamongthedifferentclassesoftreeobjectsinR(xistheobjectoftheoriginalclass).ndmeansthereisnodirectwaytodotheconversion,anditmustbedoneviaanotherclassFromToCommandaphylophylognewick2phylog(write.tree(x))matchingas.matching(x)treeshapeas.treeshape(x)cindas.treeshapehclustas.hclust(x)dendrogramndphylogphyloas.phylo(x)matchingndtreeshapendhclustnddendrogramndmatchingphyloas.phylo(x)phylogndtreeshapendhclustnddendrogramndtreeshapephyloas.phylo(x)phylogndmatchingndhclustnddendrogramndhclustphyloas.phylo(x)phyloghclust2phylog(x)matchingndtreeshapenddendrogramas.dendrogram(x)aItmaybenecessarytousetheoptionmulti.line=FALSE3.4.6ManipulatingDNADataADNAsequencereadwithread.dnaisstoredinRasavectorwhereeachelementisasingle(lowercase)letterrepresentinganucleotidesite.ThisallowsaneasymanipulationofDNAdatawithlittleprogrammingoverhead.Hereareafewexamples.•ReadsasequenceinFASTAformat,storesitinx,andrevertsit:x<-read.dna("dnafile.fas",format="fasta")rev(x)•Extractsthethirdposition(assumingthatthereadingframeofthese-quenceiscorrect):x[seq(3,length(x),by=3)] 3.4ManipulatingData41•Createsavectorzofthesamesequencebutwithnucleotidesgroupedbycodon(assumingthatthereadingframeofthesequenceiscorrect):z<-character(length(x)%/%3)for(iin1:length(z))z[i]<-paste(x[(3*i-2):(3*i)],collapse="")Asetofsequencescanbestoredasalist,amatrix,oradataframe.Thelasttwokindsofstructuresareappropriateonlyforalignedsequencesbecauseallrowsmusthavethesamenumberofelements.Toapplytheaboveoperationsonsuchsetsofsequences,onecanusethefunctionsapplyorlapply.Forinstance,inthecaseofamatrixX,thefollowingwillrevertallrows:apply(X,1,rev)andforalist:lapply(X,rev)Formorecomplexoperations,onemayfirstcreateafunctionthatenclosesalltheneededcommands,andthenusetheappropriateapply-likefunction:foo<-function(x){z<-character(length(x)%/%3)for(iin1:length(z))z[i]<-paste(x[(3*i-2):(3*i)],collapse="")z#neededtoreturnthevector}lapply(X,foo)seqinrhasmoresophisticatedfunctionsformanipulatingmolecularse-quences.inversrevertsasequenceinthesamewayasrevabove.compre-turnsthecomplementofaDNAsequence:>x<-scan(what="")1:acgtggtcat11:Read10items>x[1]"a""c""g""t""g""g""t""c""a""t">comp(x)[1]"t""g""c""a""c""c""a""g""t""a"Thefunctionsc2sands2ctransformavectorofsinglecharactersintoastring,andviceversa: 423PhylogeneticDatainR>c2s(x)[1]"acgtggtcat">s2c(c2s(x))[1]"a""c""g""t""g""g""t""c""a""t"splitseqsplitsasequenceintoportionswithrespecttotwooptions:framespecifyinghowmanysitestoskipbeforestartingtoreadthesequence(defaultis0),andwordgivingthelengthoftheportions(defaultis3,i.e.,acodonforaDNAsequence):>splitseq(x)[1]"acg""tgg""tca">splitseq(x,frame=1)[1]"cgt""ggt""cat">splitseq(x,word=5)[1]"acgtg""gtcat"translatetranslatesaDNAsequenceintoanaminoacid(AA)one.Theoptionframemaybeusedasabove.Twootheroptionsaresens,whichcanbe"F"(forward,thedefault)or"R"(reverse)specifyingthedirectionofthetranslation,andnumcodewhichtakesanumericvaluespecifyingthegeneticcodetobeused(bydefaulttheuniversalcodeisused):>translate(x)[1]"T""W""S">translate(x,frame=1)[1]"R""G""H">translate(x,frame=2)[1]"V""V">translate(x,frame=3)[1]"W""S">translate(x,frame=4)[1]"G""H"ThefunctionsaaaandaconvertAAsequencesfromtheone-lettercodingtothethree-letterone,andviceversa:>aaa(translate(x))[1]"Thr""Trp""Ser">a(aaa(translate(x)))[1]"T""W""S"apehasafewfunctionsforsummarizinginformationfromasetofDNAsequences.•base.freqcomputestheproportionsofeachofthefourbases;theresultsarereturnedasatable(i.e.,atablewithnames"A","C","G",and"T"). 3.4ManipulatingData43•GC.contentisbasedonthepreviousfunction,andcomputesthepropor-tionofguanineandcytosine;asinglenumericvalueisreturned.•seg.sitesreturnstheindicesofthesegregatingsites,thatis,thesitesthatarepolymorphic.seqinrhasseveralfunctionsforsummarizingmolecularsequences.countcomputesthefrequenciesofallpossiblecombinationsofnnucleotides,wherenisspecifiedwiththeargumentword(thereisalsoanoptionframeusedinthesamewayasabove):>count(x,word=1)acgt2233>count(x,word=2)aaacagatcacccgctgagcgggttatctgtt0101101000120110>count(x,word=3)aaaaacaagaatacaaccacgactagaagcaggagtataatcatg000000100000000attcaacaccagcatccacccccgcctcgacgccggcgtctactc000010000000100ctgcttgaagacgaggatgcagccgcggctggaggcgggggtgta000000000000010gtcgtggtttaatactagtattcatcctcgtcttgatgctggtgt110000010000010ttattcttgttt0000ThethreefunctionsGC,GC2,andGC3computetheproportionofguanineandcytosineoverthewholesequence,overthesecondpositions,andoverthethirdones,respectively:>GC(x)[1]0.5>GC2(x)[1]0.9999>GC3(x)[1]0.6666Therearetwosummarymethodsfortheclasses"SeqFastaAA"and"SeqFastadna":theyprintasummaryofthefrequenciesofthedifferentaminoacidsorbases,andotherinformationsuchasthelengthsofthesequences.AAstathasthesameeffectassummary.SeqFastaAA,butadditionallyagraphisplottedofthepositionofthedifferentcategoriesofAAs.Forinstance,takingaproteinsequencedistributedwithseqinr(Fig.3.2): 443PhylogeneticDatainRAcidicBasicChargedPolarNon.polarAromaticAliphaticSmallTiny0153045607590105135165PositionoftheresiduesalongthesequenceFig.3.2.Plotofthedistributionofaminoacidcategoriesalongthesequenceofaprotein>ss<-read.fasta(system.file("sequences/seqAA.fasta",+package="seqinr"),+seqtype="AA")>AAstat(ss[[1]])$Compo*ACDEFGHIKLMNPQRSTVW18661868191429571091316763Y1....3.5GeneratingRandomTreesapehastwofunctionstogeneraterandomtreesunderassumptions.rtreegeneratesatreebyrandomsplitting;itsinterfaceis:rtree(n,rooted=TRUE,tip.label=NULL,br=runif,...)wherenspecifiesthenumberoftips.Thetreeisrootedbydefault.Iftip.labelisleftNULL,thelabels"t1","t2",...,aregiventothetips.brspecifiesthefunctiontogeneraterandombranchlengths:furtherargumentsforthisfunctionaregiveninplaceofthe“dot-dot-dot”(...).Bydefault,auniformdistributionbetween0and1isused.Usebr=NULLforatreewithnobranchlength. 3.5GeneratingRandomTrees45rcoalgeneratesa“coalescent”treebyrandomclusteringoftips;itsin-terfaceis:rcoal(n,tip.label=NULL,br=rexp,...)wheretheoptionsaresimilartortree.Notethatthedefaultforbristheexponentialdistribution:thisisusedtogeneratenodeheights(branchlengthsarecomputedfromtheseheights).Bothrtreeandrcoalgenerateasingletree:theymustbecalledrepeat-edlytogenerateasampleofrandomtrees.apTreeshapehasthefunctionrtreeshapethatgeneratestreetopologiesundervariousmodels.Itsinterfaceis:rtreeshape(n,tip.number,p=0.3,model="",FUN="")wherenisthenumberofgeneratedtrees(unliketheabovetwofunctions),tip.numberisthenumberoftips,pisaparameterusedifmodel="biased"(seebelow),modelspecifiesthemodeltobeused,andFUNgivesafunctiontogeneratetreesaccordingtoAldous’sMarkovbranchingmodel[2].EithermodelorFUNmustbespecified,butnotboth.NotethattheargumentsarenotrecycledinR’susualway:forinstance,rtreeshape(2,c(5,10),model="yule")willgeneratefourtrees(twowithfivetips,andtwowithten).Thethreemodelsthatcanbespecifiedwiththeargumentmodelare:•TheYulemodel(model="yule")whereeachspecieshasthesameprob-abilityofsplittingintwospecies;•ThePDA(proportionaltodistinguishablearrangements)model(model="pda")whereeachtopologyisequiprobable;•Thebiasedmodel(model="biased")whereaspecieswithsplittingprobabilityrgives,ifitsplits,twodaughter-specieswithsplittingprob-abilityprand1−pr,respectively[83].Thevalueofpisgivenbytheargumentp.InAldous’s[2]model,thesplittingprobabilitiesarespecifiedthroughafunctiondenotedQn(i)whichgivestheprobabilitythatacladewithntipsismadeoftwosiblinggroupswithiandn−itips,respectively.WespecifytheseprobabilitieswiththeargumentQwhichisanRfunctionoftheformQ(n,i).Forinstance,foracompletelyunbalancedtreeweusethefollowing:Q<-function(n,i)if(i==1)1else0rtreeshape(1,10,FUN=Q)whichsaysthatacladeofsizeniscertaintobemadeoftwosubcladeswithoneandn−1tips,respectively.TheprobabilitiesgiveninFUNdonotneedtosumtoone,sothatitiseasytospecifyagivenmodel.Aninterestingmodelmaybetohavesplittingprobabilitiesproportionaltothesizeoftheclade:Q<-function(n,i)if(i>0&&ix<-paste("AJ5345",26:49,sep="")>x<-c("Z73494",x) 3.6CaseStudies47>x[1]"Z73494""AJ534526""AJ534527""AJ534528""AJ534529"[6]"AJ534530""AJ534531""AJ534532""AJ534533""AJ534534"[11]"AJ534535""AJ534536""AJ534537""AJ534538""AJ534539"[16]"AJ534540""AJ534541""AJ534542""AJ534543""AJ534544"[21]"AJ534545""AJ534546""AJ534547""AJ534548""AJ534549"Wethenreadthesequences.Ofcourse,thecomputermustbeconnectedtotheInternet:sylvia.seq<-read.GenBank(x)Wecheckthatthedatahavebeencorrectlydownloadedbylookingatthestructureofthereturnedobject:>str(sylvia.seq)Listof25$Z73494:chr[1:1143]"a""t""g""g"...$AJ534526:chr[1:1143]"a""t""g""g"...$AJ534527:chr[1:1143]"a""t""g""g"...$AJ534528:chr[1:1143]"a""t""g""g"...$AJ534529:chr[1:1143]"a""t""g""g"...$AJ534530:chr[1:1143]"a""t""g""g"...$AJ534531:chr[1:1143]"a""t""g""g"...$AJ534532:chr[1:1143]"a""t""g""g"...$AJ534533:chr[1:1143]"a""t""g""g"...$AJ534534:chr[1:1143]"a""t""g""g"...$AJ534535:chr[1:1143]"a""t""g""g"...$AJ534536:chr[1:1143]"a""t""g""g"...$AJ534537:chr[1:1143]"a""t""g""g"...$AJ534538:chr[1:1143]"a""t""g""g"...$AJ534539:chr[1:1143]"a""t""g""g"...$AJ534540:chr[1:1143]"a""t""g""g"...$AJ534541:chr[1:1143]"a""t""g""g"...$AJ534542:chr[1:1143]"a""t""g""g"...$AJ534543:chr[1:1143]"a""t""g""g"...$AJ534544:chr[1:1143]"a""t""g""g"...$AJ534545:chr[1:1143]"a""t""g""g"...$AJ534546:chr[1:1143]"a""t""g""g"...$AJ534547:chr[1:1143]"a""t""g""g"...$AJ534548:chr[1:1041]"g""g""a""t"...$AJ534549:chr[1:1041]"g""g""a""t"...-attr(*,"species")=chr[1:25]"Sylvia_atricapilla_atricapilla""Chamaea_fasciata""Sylvia_nisoria""Sylvia_layardi"...Wehaveeffectivelyalistwith25sequences:23ofthemhave1143nucleotides,and2have1041.ThisnecessitatesanalignmentoperationwithClustalX.TodothiswefirstwritethedatainafileinFASTAformat: 483PhylogeneticDatainRwrite.dna(sylvia.seq,"sylviaseq.fas",format="fasta")Thefirstthreelinesofthefile‘sylviaseq.fas’are:>Z73494atggctctcaatcttcgaaaaaaccaccctatcctaaaagtcatcaacgacgccctaatcgacctaccaacgccgtctaacatctctacttgatgaaacttcggctcactcctaggtctt....Thealignmentoperationshowsthatthereare102missingnucleotidesinthelasttwosequences.ThealignmentmadebyClustalXissavedin“Phylip”formatwhichisactuallytheinterleavedformatofPhylip[38].ThedataarereadbackintoRusingread.dna:sylvia.seq.ali<-read.dna("sylviaseq.phy")Notethatwekepttheoriginal(unaligned)sequencesfromGenBankbecausetheyhavethespeciesnames.Tosavesomememory,wecankeeptheminaseparatevectorwhosenamesaretheaccessionnumbers,7anderasetheoriginalsequences:>taxa.sylvia<-attr(sylvia.seq,"species")>names(taxa.sylvia)<-names(sylvia.seq)>rm(sylvia.seq)Wethenseethattwoofthesenameshavetobefixed:>sylvia.seq[c(1,24)]Z73494"Sylvia_atricapilla_atricapilla"AJ534548"Illadopsis_abyssinica"B¨ohning-Gaeseetal.[10]wrotethatIlladopsisabyssinicahadadifferentgenericstatus,buttheyconsidereditasbelongingtoSylvia:wechangethisaccordinglyforconsistency.Wealsoremovethesubspeciesnameofthefirstsequence,andprintallthespeciesnames:>taxa.sylvia[1]<-"Sylvia_atricapilla">taxa.sylvia[24]<-"Sylvia_abyssinica">taxa.sylviaZ73494AJ534526"Sylvia_atricapilla""Chamaea_fasciata"AJ534527AJ534528"Sylvia_nisoria""Sylvia_layardi"AJ534529AJ5345307Weshowlatertheadvantageofusingthisstructure. 3.6CaseStudies49"Sylvia_subcaeruleum""Sylvia_boehmi"AJ534531AJ534532"Sylvia_buryi""Sylvia_lugens"AJ534533AJ534534"Sylvia_leucomelaena""Sylvia_hortensis"AJ534535AJ534536"Sylvia_crassirostris""Sylvia_curruca"AJ534537AJ534538"Sylvia_nana""Sylvia_communis"AJ534539AJ534540"Sylvia_conspicillata""Sylvia_deserticola"AJ534541AJ534542"Sylvia_balearica""Sylvia_undata"AJ534543AJ534544"Sylvia_cantillans""Sylvia_melanocephala"AJ534545AJ534546"Sylvia_mystacea""Sylvia_melanothorax"AJ534547AJ534548"Sylvia_rueppelli""Sylvia_abyssinica"AJ534549"Sylvia_borin"Theecologicaldataareinafile‘sylviadata.txt’whosefirstthreelinesare:mig.distmig.behavgeo.rangeSylvia_abyssinica0residtropSylvia_atricapilla5000shorttemptrop....Wereadthesedatasimplywithread.table,andcheckthereturnedobject:>sylvia.eco<-read.table("sylvia_data.txt")>str(sylvia.eco)‘data.frame’:26obs.of3variables:$mig.dist:int0500075005900550034002600000...$mig.behav:Factorw/3levels"long","resid",..$geo.range:Factorw/3levels"temp","temptrop",......Notethatthespeciesnamesareusedasrownamesinthisdataframe:>rownames(sylvia.eco)[1]"Sylvia_abyssinica""Sylvia_atricapilla"[3]"Sylvia_borin""Sylvia_nisoria"[5]"Sylvia_curruca""Sylvia_hortensis"[7]"Sylvia_crassirostris""Sylvia_leucomelaena" 503PhylogeneticDatainR[9]"Sylvia_buryi""Sylvia_lugens"[11]"Sylvia_layardi""Sylvia_subcaeruleum"[13]"Sylvia_boehmi""Sylvia_nana"[15]"Sylvia_deserti""Sylvia_communis"[17]"Sylvia_conspicillata""Sylvia_deserticola"[19]"Sylvia_undata""Sylvia_sarda"[21]"Sylvia_balearica""Sylvia_cantillans"[23]"Sylvia_mystacea""Sylvia_melanocephala"[25]"Sylvia_rueppelli""Sylvia_melanothorax"ThedataarereadyandcanbesavedinanRworkspacebeforebeinganalyzed:save(sylvia.seq.ali,taxa.sylvia,sylvia.eco,file="sylvia.RData")3.6.2PhylogenyoftheFelidaeJohnsonandO’Brien[75]studiedthephylogeneticrelationshipsofallextantspeciesoffelidsandcatsusingsequencesfromtwomitochondrialgenes:16SrRNAandNADH-5.Forsimplicity,weuseonlythefirstsetofsequences.TheprocedureofgettingandpreparingthesedatafollowsthesamelinesaswiththeSylviacase.TheaccessionnumbersinGenBankrangefromAF006387toAF006459withonlytheoddnumbers:x<-paste("AF006",seq(387,459,2),sep="")felidseq16S<-read.GenBank(x)Thesequencesarenotofthesamelengths(someinsertions/deletionshavebeenreportedin[75]):>table(unlist(lapply(felidseq16S,length)))372373374375376391591>str(felidseq16S[1:5])Listof5$AF006387:chr[1:374]"t""t""t""g"...$AF006389:chr[1:375]"t""t""t""g"...$AF006391:chr[1:372]"c""t""t""g"...$AF006393:chr[1:375]"t""t""t""g"...$AF006395:chr[1:374]"t""t""t""g"...WewritethesequencesinafileinFASTAformattoalignthemwithClustalX:write.dna(felidseq16S,"felidseq16S.fas",format="fasta") 3.6CaseStudies51Wealsosavethenamesofthespecieswiththeaccessionnumbers:taxa.felid<-attr(felidseq16S,"species")names(taxa.felid)<-names(felidseq16S)ThealignedsequencesarereadbackinR:felidseq16Sali<-read.dna("felidseq16S.phy")Andwemaycheckthattheyhaveallthesamelength:>table(unlist(lapply(felidseq16Sali,length)))38237Inadditiontothesequencedata,weusedataonbodymass(source[143]).Thefirstthreelinesfromthefile‘felidbodymass.txt’are:Acinonyx_jubatus50000Caracal_caracal13749.9Catopuma_badia2500....Wereadthistreewithread.table:DF<-read.table("felid_bodymass.txt")Becausethereisonlyonevariable,itissimplertokeepitasavectorwithnamessetasthespeciesnames:>felid.body.mass<-DF$V2>names(felid.body.mass)<-DF$V1>felid.body.massAcinonyx_jubatusCaracal_caracal50000.0013749.90Catopuma_badiaCatopuma_temminckii2500.0011500.00....Wesavethealignedsequences,thespeciesnames,andthebodymassdataforfurtheranalyses:save(felidseq16Sali,taxa.felid,felid.body.mass,file="felid.RData") 523PhylogeneticDatainR3.6.3SnakeVenomProteomeFry[41]madeanextensiveanalysisoftherelationshipsamongsnakevenomproteinsandrelatednontoxicproteins.Welimitourselvestoasingledataset:thepseutarinCoftheEasternbrownsnake(Pseudonajatextilis)andtherelatedmammaliancoagulationfactorV[41,Fig.3B].Thegoalofthepresentapplicationistogettheproteinsequencedata.TheoriginaldatacomefromtheSWISSPROTdatabaseofproteinse-quences.The22accessionnumbersandthecorrespondingspeciesnamesarestoredinafilecalled‘venomfactorV.txt’thefirstthreelinesofwhichare:NospeciesQ9BQS7Homo_sapiensQ9Z0Z4Mus_musculus....Wereadthemwithread.tablesettingas.is=TRUEtoavoidthesecharac-terstringsbeingtreatedasfactors,andheader=TRUEtospecifythatthefirstlinecontainsthenamesofthecolumns;wethendisplaythefirsttworowsofthedataframe:>venom.no<-read.table("venom_factorV.txt",as.is=TRUE,header=TRUE)>venom.no[1:2,]Nospecies1Q9BQS7Homo_sapiens2Q9Z0Z4Mus_musculusWereadthesedatawithseqinrwhichweloadinmemory,andthenweselecttheSWISSPROTdatabase.library(seqinr)s<-choosebank("swissprot")Wecannowquerythedatabaseusingtheaccessionnumbers.Thisisdonewiththe“ac”keywordofthefunctionquery.Forinstance,ifwewanttoretrievethesequenceoftheEasternbrownsnake(no.Q7SZN0),wedo:>query(s$socket,"venom","ac=Q7SZN0")$socket:descriptionclass"->pbil.univ-lyon1.fr:5558""socket"modetext"a+""text"openedcanread"opened""yes" 3.6CaseStudies53canwrite"yes"$banque:swissprot$call:query(socket=s$socket,listname="venom",query="ac=Q7SZN0")$name:[1]"venom"listlengthmodecontent1$req1charactersequencesWethenretrievethesequenceitselfwithgetSequence(weprintthefirsttwentyaminoacidstocheck):>X<-getSequence(venom$req[[1]])>X[1:20][1]"M""G""R""Y""S""V""S""P""V""P""K""C""L""L"[15]"L""M""F""L""G""W"Toretrieveseveralsequencesatthesametimewiththeiraccessionnum-bers,weneedtousethekeyword“OU”;forinstance,togetthefirsttwosequenceswearelookingforwecoulddo:query(s$socket,"venom","ac=Q9BQS7OUac=Q9Z0Z4")Weagainusethefunctionpastetoputthe21numberstogether:>paste("ac",venom.no$No,sep="=")[1]"ac=Q9BQS7""ac=Q9Z0Z4""ac=Q7ZU12""ac=Q61147"[5]"ac=P00450""ac=Q804W6""ac=Q804X3""ac=Q7TN96"[9]"ac=Q06194""ac=P00451""ac=P12263""ac=O62730"[13]"ac=Q804W5""ac=Q90X47""ac=Q7SZN0""ac=Q804X4"[17]"ac=P12259""ac=Q28107""ac=Q9GLP1""ac=Q7TPK2"[21]"ac=O88783"Butthistimeweneedtohaveallthesenumbersinasinglecharacterstring.Thisisdonewiththeoptioncollapse.Toseetheresult,letusdoitwiththefirstfournumbers:>paste("ac",venom.no$No[1:4],sep="=",collapse="OU")[1]"ac=Q9BQS7OUac=Q9Z0Z4OUac=Q7ZU12OUac=Q61147"Becausethereisnoneedtoprintthewholestringwiththe21numbers,westoreitinanobjectcalledno4query,anduseitasanargumenttoquery:>no4query<-paste("ac=",venom.no$No,sep="",+collapse="OU")>query(s$socket,"venom",no4query) 543PhylogeneticDatainR$socket:descriptionclass"->pbil.univ-lyon1.fr:5558""socket"modetext"a+""text"openedcanread"opened""yes"canwrite"yes"$banque:swissprot$call:query(socket=s$socket,listname="venom",query=no4query)$name:[1]"venom"listlengthmodecontent1$req21charactersequencesThelastlineoftheprintedoutputshowsthatthe21sequenceshavebeenfoundinACNUC.Wearenowreadytodownloadthem.WedoitbyapplyingthefunctiongetSequencetoeachelementofthelistvenom$req:venom.seq<-lapply(venom$req,getSequence)Wechecktheresultsbylookingatthestructureofvenom.seq:>str(venom.seq)Listof21$:chr[1:1065]"M""K""I""L"...$:chr[1:1062]"M""K""F""L"...$:chr[1:2211]"M""F""L""A"...$:chr[1:2224]"M""F""P""G"...$:chr[1:2258]"M""F""P""A"...$:chr[1:2351]"M""Q""I""E"...$:chr[1:2319]"M""Q""I""A"...$:chr[1:2133]"M""Q""L""E"...$:chr[1:1158]"M""E""S""G"...$:chr[1:1157]"M""K""A""G"...$:chr[1:2343]"M""Q""V""E"...$:chr[1:2183]"M""L""L""V"...$:chr[1:1460]"M""G""R""Y"...$:chr[1:2258]"M""R""A""A"...$:chr[1:2102]"M""Q""S""S"...$:chr[1:1087]"M""K""G""L"...$:chr[1:1802]"F""S""P""T"...$:chr[1:1639]"M""R""T""D"... 3.6CaseStudies55$:chr[1:1377]"V""W""T""L"...$:chr[1:745]"C""F""Q""V"...$:chr[1:2119]"M""K""L""R"...Becausetheretrievaloperationlosttheaccessionnumbers,weassignthemasnamestothelistvenom.seq:names(venom.seq)<-venom.no$No3.6.4MammalianMitochondrialGenomesGibsonetal.[51]madeacomprehensiveanalysisofthemitochondrialgenomesof69speciesofmammals.Theyexploredthevariationsinbasecompositionindifferentregionsofthisgenome.Welimitourselvestosimpleranalyses.Thegoalistoshowhowtoreadheterogeneousdatainabigfile,andmanipulateandpreparetheminR.TheoriginaldatacomefromtheOGRe(OrganellarGenomeRetrievalsystem)database.8AllmammalianmtGenomesavailableinthedatabaseweredownloadedinApril2005.Thisrepresents109species.Thedataweresavedinasinglefilecalled‘mammalmtGenome.fasta’.Thefirstsixlinesofthisfileshowhowthedataarepresented:#################################################OGResequences##################################################DASNOVMIT:_Dasypusnovemcinctus_(nine-bandedarmadillo)...#TAMTETMIT:_Tamanduatetradactyla_(southerntamandua):.......Afterthe109speciesnamesandcodes,thesequencesareprintedinFASTAformat.Forinstance,thelines116–118are:>DASNOVMIT(ATP6)atgaacgaaaacctatttgcctcattcgctacccctaccataataggcct...caagtattcttttccctacccctaaacggataattaccaaccgagtggta.......Thusthespeciescodesusedinthefirstpartofthefileareusedforthesequencenamestogetherwiththenamesofthegenesinparentheses.Conse-quentlyweneedtogetthecorrespondencebetweenthesespeciescodesandthespeciesnames.Thankstotheflexibilityofread.tablewedothisrela-tivelystraightforwardly.Ifweexaminethefirstlinesfromthefileabove,wenoticethatthecommandneededtoreadthespeciesnamesandcodeswillneedto:8http://ogre.mcmaster.ca/. 563PhylogeneticDatainR•Skipthefirstfourlines,•Readonly109lines,•Usetheunderscore""asthecharacterseparatingthetwocolumns,•Ignorewhatcomesafterthescientificnameoneachline.Thecorrespondingcommandis(weagainusetheas.is=TRUEoptionforthesamereason):mtgen.taxa<-read.table("mammal_mtGenome.fasta",skip=4,nrows=109,sep="_",comment.char="(",as.is=TRUE)Notethatwetakeadvantageofthefactthatthecommonnamesarewithinparentheses:thisisdonewiththeoptioncomment.char(whosedefaultvalueis"#").Welookatthefirstfiverows:>mtgen.taxa[1:5,]V1V2V31#DASNOVMIT:DasypusnovemcinctusNA2#TAMTETMIT:TamanduatetradactylaNA3#ORYCUNMIT:OryctolaguscuniculusNA4#OCHCOLMIT:OchotonacollarisNA5#LEPEURMIT:LepuseuropaeusNAThereareafewundesirableside-effectstoourcommand,butthisiseasilysolved.Thefactthatwesetsep=""resultedinthespaceafterthesecondunderscorebeingreadasavariable.Wecandeleteitwith:mtgen.taxa$V3<-NULLThefirstcolumncontainingthespeciescodeshaveafewextracharactersthatwewishtoremove.Wecandothisoperationwithgsub.mtgen.taxa$V1<-gsub("#","",mtgen.taxa$V1)mtgen.taxa$V1<-gsub(":","",mtgen.taxa$V1)Finallywechangethenamesofthecolumnsandchecktheresults:>colnames(mtgen.taxa)<-c("code","species")>mtgen.taxa[1:5,]codespecies1DASNOVMITDasypusnovemcinctus2TAMTETMITTamanduatetradactyla3ORYCUNMITOryctolaguscuniculus4OCHCOLMITOchotonacollaris5LEPEURMITLepuseuropaeusAfterthissmallstringmanipulation,wecanreadthesequenceswithread.dna.Thisfunctionalsohasanoptionskipthatweusehere.Wethencheckthenumberofsequencesread: 3.6CaseStudies57>mtgen<-read.dna("mammal_mtGenome.fasta",format="fasta",skip=115)>length(mtgen)[1]4033Wealsocheckthenamesofthefirsttensequences:>names(mtgen)[1:10][1]"DASNOVMIT(ATP6)""TAMTETMIT(ATP6)""ORYCUNMIT(ATP6)"[4]"OCHCOLMIT(ATP6)""LEPEURMIT(ATP6)""OCHPRIMIT(ATP6)"[7]"BERBAIMIT(ATP6)""BALMUSMIT(ATP6)""PONBLAMIT(ATP6)"[10]"CAPHIRMIT(ATP6)"Itwouldbeinterestingnowtogetonlythenameofthegeneforeachse-quenceinaseparatevector.Againwecanusegsubforthis,butthecommandisslightlymorecomplicatedbecausewewanttoremoveallcharactersoutsidetheparentheses,andthelatteraswell.Weusethefactthatgsubcantreatregularexpressions.Forinstance,wecandothis:genes<-gsub("ˆ[[:alnum:]]{1,}\(","",names(mtgen))where"ˆ[[:alnum:]]{1,}\("means“acharacterstringstartingwithoneormorealphanumericcharacter(s)andfollowedbyaleftparenthesis”.Weneedtocallgsubasecondtimetoremovethetrailingrightparenthesis:genes<-gsub("\)$","",genes)Noteinthesetwoexampleshowthecaretˆandthedollar$areusedtospecifythatthecharacterswearelookingforstartorendthestring,respectively.9Afterthisoperationitappearsthatsomevaluesingenesindicatethatthesequenceisactuallyempty:>unique(genes)[11][1]"ND4Sequencedoesnotexist"Toremovethesemissingsequences,wefindthemusinggrep:>i<-grep("Sequencedoesnotexist",names(mtgen))>i[1]99833713375Therearethusthreemissingsequencesinthedataset.Weremovethemwith:>mtgen<-mtgen[-i]Andwerepeattheoperationofextractingthesequencenames:genes<-gsub("ˆ[[:alnum:]]{1,}\(","",names(mtgen))genes<-gsub("\)$","",genes)9ThesyntaxofregularexpressionsusedbyRisdetailedinahelppage:?regexp. 583PhylogeneticDatainRWecannowlookathowmanysequencesthereareforeachgene:>table(genes)genesATP6ATP8COX1COX2109109109109COX3CYTBND1ND2109109109109ND3ND4ND4LND5109108109109ND6RNLRNStRNA-Ala109109109109tRNA-ArgtRNA-AsntRNA-AsptRNA-Cys109109109109tRNA-GlntRNA-GlutRNA-GlytRNA-His109109109109tRNA-IletRNA-Leu(CUN)tRNA-Leu(UUR)tRNA-Lys109109109109tRNA-MettRNA-PhetRNA-ProtRNA-Ser(AGY)109109107109tRNA-Ser(UCN)tRNA-ThrtRNA-TrptRNA-Tyr109109109109tRNA-Val109Weseethatwemissonesequenceof“ND4”andtwoof“tRNA-Pro”(thiscanbeseenmoreclearlywithsort(table(genes))).Wearenowreadytodoallsortsofanalyseswiththisdataset.Weseehowtoanalyzebasefrequenciesatthreelevelsofvariation:•Betweenspecies(allgenespooled);•Betweengenes(allspeciespooled);•Betweensitesforasingleprotein-codinggene(allspeciespooled).Tocalculatethebasefrequenciesforeachspecies,wefirstcreateamatrixwith109rowsand4columnsthatwillstoretheresults:BF.sp<-matrix(NA,nrow=109,ncol=4)Wesetitsrownameswiththespeciesnames,andthecolnameswiththefourbasesymbols:rownames(BF.sp)<-mtgen.taxa$speciescolnames(BF.sp)<-c("A","C","G","T")Weputineachrowofthismatrixthefrequencyofeachbase.Thisinvolves:1.Selectingonlythesequenceswiththecorrespondingspeciescodeusinggrep; 3.6CaseStudies592.Computingthebasefrequenciesfortheselectedsequenceswiththefunc-tionbase.freq;3.Repeatingthesetwooperationsforall109species.Asimpleapproachistouseaforloopwhereavariable,sayi,willvaryfrom1to109:thiswillbeusedasindexforbothBF.spandmtgen.taxa$code.Thecommandsarerelativelystraightforwardandusesomeelementsseenabove.Forclarity,wewritetwoseparatecommandswithintheloop(theindicesoftheselectedgenesarestoredinx):for(iin1:109){x<-grep(mtgen.taxa$code[i],names(mtgen))BF.sp[i,]<-base.freq(mtgen[x])}Tovisualizetheresults,weusethegraphicalfunctionmatplotwhichplotsthecolumnsofamatrix.Weaddtheoptionstype="l"tohavelines(thedefaultispoints),andcol=1toavoidcolors.Wefurtheraddalegend(Fig.3.4):matplot(BF.sp,type="l",col=1,xlab="Species",ylab="Basefrequency")legend(0,0.23,c("A","C","G","T"),lty=1:4,bty="n")ABasefrequencyCGT0.150.200.250.300.35020406080100SpeciesFig.3.4.Plotofthebasefrequenciesofthemitochondrialgenomeof109speciesofmammalsThesecondanalysis—betweengenesforallspeciespooled—willfollowthesamelinesasthepreviousone.Thematrixusedtostoretheresultswillhave37rows,anditsrownameswillbethenamesofthegenes. 603PhylogeneticDatainRAsubtletyhereistheneedtousetheoptionfixed=TRUEingrep:thereasonisthatsomegenenamescontainparenthesesandthesecharactershaveaspecialmeaninginregularexpressions.Theoptionusedhereforcesgreptotreatitsfirstargumentasasimplecharacterstring,andthusavoidsthisannoyance.Thefullsetofcommandsis:BF.gene<-matrix(NA,nrow=37,ncol=4)rownames(BF.gene)<-unique(genes)colnames(BF.gene)<-c("A","C","G","T")for(iin1:37){x<-grep(rownames(BF.gene)[i],names(mtgen),fixed=TRUE)BF.gene[i,]<-base.freq(mtgen[x])}Werepresenttheresultsinadifferentwaybyusingthefunctionbarplotwhich,bydefault,makesastackedbarplotoftherowsforeachcolumn:wethusneedtotransposethematrixBF.genefirst.Becausesomegenenamesaresomewhatlong,wemodifythemargins;wealsousetheoptionslas=2toforcethelabelsonthex-axistobevertical,andlegend=TRUEtoaddalegend(Fig.3.5):par(mar=c(8,3,3,2))barplot(t(BF.gene),las=2,legend=TRUE)1.0TGCA0.80.60.40.20.0ND1ND2ND3ND4ND5ND6RNLRNSATP6ATP8COX1COX2COX3CYTBND4LtRNA−AlatRNA−ArgtRNA−AsntRNA−AsptRNA−CystRNA−GlntRNA−GlutRNA−GlytRNA−HistRNA−IletRNA−LystRNA−MettRNA−PhetRNA−ProtRNA−ThrtRNA−TrptRNA−TyrtRNA−ValtRNA−Leu(CUN)tRNA−Leu(UUR)tRNA−Ser(AGY)tRNA−Ser(UCN)Fig.3.5.Plotofthebasefrequenciesofthemitochondrialgenomeof109speciesofmammalsforeachgene 3.6CaseStudies61Forthethirdanalysis—betweensitesforasinglegene—wefocusonthegenesofthecytochromebwhosecodeisCYTB.Wefirstextractthesequencesofthisgenebytakingtheappropriateindicesinthewayseenabove:cytb<-mtgen[grep("CYTB",names(mtgen))]Wenowlookatthelengthofeachsequenceusinglapplyandlength,andsummarizetheresultswithtable:>table(unlist(lapply(cytb,length)))1135113711381139114011411143114411461149143178102172Themajorityofthesesequenceshas1140sitesandthusitislikelythattheyareproperlyaligned.Furthermorealookatthefirstfewnucleotides(whichcanbedonewithstr(cytb))suggeststhisistrueforthewhole109sequences.Forsimplicityweassumethistobecorrect,althoughamorerigorouscheckofthealignment,asdonefortheothercasesabove,ispossible.Toextractthefirst,second,orthirdcodonpositionweneedtodotheoperationwithineachsequence.Thusweneedacommandsuchas(forthefirstposition)cytb[[1]][c(TRUE,FALSE,FALSE)]repeatedforeachsequenceincytb.Thereareseveralsolutionsforthis:wechooseonewheretheextractionusinglogicalindexingisincludedinafunctionthatweapplytoeachelementofcytb.ThisisdonethreetimeswithmovingthepositionofTRUE:cytb1<-lapply(cytb,function(x)x[c(TRUE,FALSE,FALSE)])cytb2<-lapply(cytb,function(x)x[c(FALSE,TRUE,FALSE)])cytb3<-lapply(cytb,function(x)x[c(FALSE,FALSE,TRUE)])cytb1,cytb2,andcytb3arethreelistscontainingthefirst,second,andthirdpositions,respectively.Wecannowproceedinasimilarwayasdoneabove:>BF.cytb<-matrix(NA,3,4)rownames(BF.cytb)<-c("1stcodonposition","2ndcodonposition","3rdcodonposition")colnames(BF.cytb)<-c("A","C","G","T")BF.cytb[1,]<-base.freq(cytb1)BF.cytb[2,]<-base.freq(cytb2)BF.cytb[3,]<-base.freq(cytb3)BF.cytbACGT1stcodonposition0.29026030.26282900.215035340.23187532ndcodonposition0.20181490.24919150.136361440.41263213rdcodonposition0.40433950.37786310.036009940.1817875 623PhylogeneticDatainRWeplottheresultsagainusingbarplotbutaddingafewannotationstopresentthefigure(Fig.3.6):barplot(t(BF.cytb),main="Cytochromeb",ylab="Basefrequency")text(0.7,BF.cytb[1,1]/2,"A",cex=2)text(0.7,BF.cytb[1,1]+BF.cytb[1,2]/2,"C",cex=2)text(0.7,sum(BF.cytb[1,1:2])+BF.cytb[1,3]/2,"G",cex=2)text(0.7,sum(BF.cytb[1,1:3])+BF.cytb[1,4]/2,"T",cex=2)CytochromebTGCBasefrequencyA0.00.20.40.60.81.01stcodonposition2ndcodonposition3rdcodonpositionFig.3.6.Plotofthebasefrequenciesatthethreecodonpositionsofthegeneofthecytochromebfor109speciesofmammalsforeachgene3.6.5ButterflyDNABarcodesHebertetal.[67]analyzedthemolecularvariationintheneotropicalskipperbutterflyAstraptesfulgeratorinordertoassessthespecieslimitsamongdif-ferentformsknowntohavelarvalstagesfeedingondistincthostplants.TheysequencedaportionofthemitochondrialgenecytochromeoxydaseI(COI)of466individualsbelongingto12larvalforms.ThegoalofthisapplicationistopreparealargedatasetofDNAsequences,andalignthemforfurtheranalyses(Chapter5).TheGenBankaccessionnumbersareAY666597–AY667060,AY724411,andAY724412(thereisaprintingerrorin[67]fortheselasttwonumbers).Wereadthesequenceswithread.GenBankinthesamewayasseenfortheSylviaorFelidaedata.x<-paste("AY66",6597:7060,sep="") 3.6CaseStudies63x<-c(x,"AY724411","AY724412")astraptes.seq<-read.GenBank(x)Wethenlookathowthesequencelengthsaredistributed:>table(unlist(lapply(astraptes.seq,length)))208219227244297370373413440548555573582599600111111111131111601603608609616619620623626627628629630631632221121114153437633634635636638639111262389Thesequencesclearlyneedtobealigned.WeresorttoClustalXoncemorebyfirstsavingthesequencesinFASTAformat:write.dna(astraptes.seq,"astraptesseq.fas",format="fasta")Asbefore,thealignmentmadebyClustalXissavedin“Phylip”(interleaved)format,andarereadbackintoR:astraptes.seq.ali<-read.dna("astraptesseq.phy")WecheckthespeciesnamesofthesequencesdownloadedfromGenBank:>table(attr(astraptes.seq,"species"))Astraptes_sp._BYTTNERAstraptes_sp._CELT423Astraptes_sp._FABOVAstraptes_sp._HIHAMP3116Astraptes_sp._INGCUPAstraptes_sp._LOHAMP6547Astraptes_sp._LONCHOAstraptes_sp._MYST413Astraptes_sp._NUMTAstraptes_sp._SENNOV4102Astraptes_sp._TRIGOAstraptes_sp._YESENN5179AllspecimenswerethusattributedtoAstraptessp.withfurtherinformationgivenasacode(explainedin[67]).Wedothesameoperationasabovetostorethetaxonnameswiththeaccessionnumbers:taxa.astraptes<-attr(astraptes.seq,"species")names(taxa.astraptes)<-names(astraptes.seq)Wefinallysavethedataforfurtheranalyses:save(astraptes.seq.ali,taxa.astraptes,file="astraptes.RData") 643PhylogeneticDatainR3.7ExercisesExercises1–3aimatfamiliarizingthereaderwithtreedatastructuresinR;Exercises4–6givemoreconcreteapplicationsoftheconceptsfromthischapter.1.Createarandomtreewith10tips.(a)Extractthebranchlengths,andstoretheminavector.(b)Deletethebranchlengths,andplotthetree.(c)Givenew,randombranchlengthsfromauniformdistributionU[0,10].Dothisinawaythatworksforanynumberoftips.(d)Restoretheoriginalbranchlengthsofthetree.2.Createarandomtreewith5tips,printit,andplotit.Findthewaytodeletetheclassofthisobject,andprintitagain.Trytoprintitagain:commentonwhathappens.Findawaytoforcetheplotofthetreeasbefore.3.Generatethreerandomtreeswith10tips.Writetheminafile.ReadthisfileinR.Printasummaryofeachtree.Writeasmallprogramthatwilldotheseoperationsforanynumberoftrees(sayN)andanynumberoftips(n).4.Extractthetree#1000inTreeBASE.Makethreecopiesofthistree,andgivethembranchlengths(i)allequaltoone,(ii)sothatthenodeheightsareproportionaltothenumberofspecies,and(iii)randomlyextractedfromauniformdistributionU[0,0.1].5.Extractthesequencesofthecytochromebgenewiththeaccessionnum-bersU15717–U15724(source:[59]).(a)Printthespeciesnamesofeachsequence.(b)Print,withasinglecommand,thelengthofeachsequence.(c)Arrangethedatainamatrix.(d)Extractandstoreinthreematricesthefirst,thesecond,andthethirdcodonpositionsofallsequences.Computetheirbasefrequencies.Whatdoyouconclude?(e)Savethethreematricesinthreedifferentfiles.Readthesefiles,andconcatenatethethreesetsofsequences.6.GetthefollowingsequencesfromGenBank:•AF518328–AF51837(source:[89]),•AF141220,AF004572,AF141219,AF004586,AF141217,AF004587,AB033713,AB033699,AB032853,AB033695.PreparethemalongthesamelinesasinSection3.6. 4PlottingPhylogeniesDrawingphylogenetictreeshasbeenimportantforalongtimeinthestudyofbiologicalevolution,asillustratedbyDarwin’sonlyfigureinhisOriginofSpecies[22].Aplottedphylogenyistheusualwaytosummarizetheresultsofaphylogeneticanlysis.Thisalsogivestheessenceoftheevolutionaryprocessesandpatterns.Quitesurprisingly,graphicaltoolshavebeensomewhatneglectedintheanalysisofphylogeneticdata.Thereisaverylimitedtreatmentongraphicsinrecentphylogeneticstextbooks[39,60,106].Ontheotherhand,animportantareaofstatisticalresearchhasbeendevelopedonthegraphicalanalysisandexplorationofdata.SomeofthesedevelopmentshavebeenimplementedinR(e.g.,seethelatticepackage).Ralsohasaflexibleandprogrammablegraphicalenvironment.Thereareundoubtedlyvaluesinthegraphicalexplorationofphylogeneticdata.Charactermappinghasbeendoneforsometimeinsomeissues,anditwillbevaluabletohaveamoregeneralapproachforgraphicalanalysisandexplorationofphylogeneticdata.Inthischapter,Iexploresomeoftheseideas,aswellasexplaininghowtoplotphylogenetictreesinsimpleways.Inasmuchastherearemanyillustrationsthroughoutthechapter,therearenocasestudies.4.1SimpleTreeDrawingplot.phyloinapecandrawfourkindsoftrees:phylograms(alsocalledrectangularcladograms),cladograms(triangularcladograms),unrootedtrees(dendrograms),andradial(circular)trees.Thisfunctionisamethod:itusesR’ssyntaxofthegenericfunctionplot,andactsspecificallyon"phylo"ob-jects.Ithasseveraloptions;allofthemaredefinedwithdefaultvalues.Initsmostdirectuse(i.e.,plot(tr))aphylogramisplottedonthecurrentgraph-icaldevicefromleft(root)toright(tips)withsomespacearound(asdefinedbythecurrentmargins).Thebranchlengths,ifavailable,areused.Thetip 664PlottingPhylogeniesDiplothrixlegataRattusnorvegicusMuscaroliMusmusculusTokudaiaminutusApodemusagrariusApodemussemotusApodemuspeninsulaeApodemusmystacinusApodemushermonensisApodemussylvaticusApodemusflavicollisApodemusuralensisApodemusalpicolaFig.4.1.Asimpleuseofplot(tr)labelsareprintedinitalics,left-justifiedfromthetipsoftheirrespectiveter-minalbranches.Thenodelabelsandtherootedge,ifavailable,areignored.Asanexample,Fig.4.1showsatreenamedtrshowingtherelationshipsamongsomespeciesofwoodmice(Apodemus)andafewcloselyrelatedspeciesofrodentspublishedbyMichauxetal.[100].Thetreewasplotted,afterbeingreadwithtr<-read.tree("rodent.tre")(Section3.2),bysimplytypingplot(tr).1Theoptionsalterthesesettings.TheyaredescribedinTable4.1.Mostoftheseoptionshaveintuitiveeffects(e.g.,type,font,etc.),whereassomehaveaNULLvaluebydefault.Thismeansthat,unlesstheusergivesaspecificvalue,itisdeterminedwithrespecttootherarguments.Wehaveseenanillustrationofthismechanismabovewiththesimplecommandplot(tr).Anobviouscasewhereoneoptionaltersthedefaultvalueofanotheriswhenthetreeisplottedleftwardsusingdirection="l":thelabelsarenowright-justified,whichseemsanobviousconsequenceofthechangeindirection.Forinstance,aleftwardscladogramofthesametreemaybeobtainedwith(theresultingplotisinFig.4.2):plot(tr,type="c",use.edge.length=FALSE,direction="l")Iftheuserwantstokeepthelabelsleft-justified,thentheoptionadjmustbeused(Fig.4.3):plot(tr,type="c",use.edge.length=FALSE,1Inthischapter,theboxdelimitingthefiguresindicatesthepresenceofmarginsaroundthetree. 4.1SimpleTreeDrawing67Table4.1.Theoptionsofplot.phylo.Thevaluesmarkedwith(d)arethedefaultonesOptionEffectPossiblevaluestypeTypeoftree"p"(d),"c","u","r"use.edge.lengthWhethertousebranchlengthsTRUE(d),FALSEnode.posVerticalpositionofthenodeswithNULL(d),1,2respecttothepositionsofthetipsshow.tip.labelWhethertoshowtiplabelsTRUE(d),FALSEshow.node.labelWhethertoshownodelabelsFALSE(d),TRUEedge.colorThelinecolorsoftheedgesNULL(d),avectorofstringsgivingthecolorsedge.widthThelinethicknessoftheedgesNULL(d),avectorofnumericvaluesfontThefontofthelabels1(normal),2(bold)3(italics)(d),4(bolditalics)cexRelativecharactersizeAnumericvalue(default:1)adjHorizontalandverticalNULL(d),oneortwonumericadjustmentofthelabelsvaluessrtRotationofthelabelsAnumericvalue(default:0)no.marginLeavesomespacearoundthetreeFALSE(d),TRUEroot.edgeDrawtherootedgeFALSE(d),TRUElabel.offsetSpacebetweenthetips0(d),anumericvalueandthelabelsunderscoreDisplaytheunderscoresFALSE(d),TRUEintiplabelsx.limLimitsonthehorizontalaxisNULL(d),twonumericvaluesy.limLimitsontheverticalaxisNULL(d),twonumericvaluesdirectionDirectionofthetree"r"(d),"l","u","d"lab4utStyleoflabelsforunrootedtrees"horizontal"(d),"radial"direction="l",adj=0)ManypublishersofjournalsorbooksprefertoreceivefiguresinEncapsu-latedPostScript(EPS)format.ThefunctionpostscriptinRmaybeusedtoproducesuchfiles.NotethatwhenthetreeisplottedinaPostScriptfile,thedefaultistoprintinlandscapeformatsothatthetreewillbeverticalifthepageisviewedinportraitformat.Tosetthepageinportraitformat,youmustsethorizontal=FALSEinthefunctionpostscript.InR,itispossibletoaddfurthergraphicalelementstoanexistingplotusingthelow-levelplottingcommands(see,e.g.,[154,Chap.4],forfurtherdetailsonhowRgraphicswork).plot.phyloexploitsthisbylettingtheusermanagethespacearoundthetree.Thiscanbeaccomplishedintwonon-exclusiveways:eitherthroughsettingthemargins,orbychangingthescalesoftheaxes. 684PlottingPhylogeniesDiplothrixlegataRattusnorvegicusMuscaroliMusmusculusTokudaiaminutusApodemusagrariusApodemussemotusApodemuspeninsulaeApodemusmystacinusApodemushermonensisApodemussylvaticusApodemusflavicollisApodemusuralensisApodemusalpicolaFig.4.2.AleftwardscladogramwithdefaultlabeljustificationDiplothrixlegataRattusnorvegicusMuscaroliMusmusculusTokudaiaminutusApodemusagrariusApodemussemotusApodemuspeninsulaeApodemusmystacinusApodemushermonensisApodemussylvaticusApodemusflavicollisApodemusuralensisApodemusalpicolaFig.4.3.Aleftwardscladogramwithleft-justifiedlabelsWhenplottingatree,thecurrentmarginsareused.Thesizeofthelatter,innumberoflines,canbefoundbyqueryingthegraphicalparameterswiththecommandpar("mar").Bydefault,thisgives:>par("mar")[1]5.14.14.12.1Thesecanbechangedwith,forinstance:par(mar=rep(1,4)) 4.1SimpleTreeDrawing69Theoptionno.margin=TRUEinplot.phylohasthesameeffectasdoing:par(mar=rep(0,4))Themarginsofagraphicareusuallyusedtoaddtextaroundaplot:thisisdonewiththefunctionmtext(marginaltext).Theaxescanalsobedrawnwiththefunctionaxis,butthisislikelytobeinformativeonlyfortheaxisparalleltothebranches.Alsothedefaultdisplayofthetickmarksmaynotbeappropriateforthetree(seethefunctionsaxisPhyloandadd.scale.bar,Section4.1.1).Finally,thefunctionboxaddsaboxdelimitingthemarginsfromtheplotregionwherethetreeisdrawn.Theotherwaytomanagespacearoundthetreeistoalterthescalesoftheplottingregionitself.plot.phylodrawstheedgesusingthelengthsofthe"phylo"objectdirectly,thencomputeshowmuchspaceisneededforthelabels,andsetstheaxessothattheplottingregionisoptimallyused.Unlesstheaxesaredisplayedexplicitlywiththeaxisfunction,theuserdoesnotknowthesizeoftheplottingregion.However,plot.phyloinvisiblyreturns(meaningthatitisnotnormallydisplayed)alistwiththeoptionvalueswhenitwascalled.Thislistcanbeaccessedbyassigningthecall;itselementsarethenextractedintheusualway:>tr.sett<-plot(tr)>names(tr.sett)[1]"type""use.edge.length""node.pos"[4]"show.tip.label""show.node.label""edge.color"[7]"edge.width""font""cex"[10]"adj""srt""no.margin"[13]"label.offset""x.lim""y.lim"[16]"direction">tr.sett$x.lim[1]0.00000000.1229417>tr.sett$y.lim[1]114ThisshowsthatthehorizontalaxisoftheplotinFig.4.1rangesfrom0to0.123.Todrawthesametreebutleavingabouthalfthespaceoftheplotregionfreeeitherontheright-handside,orontheleft-handside,onecando:>plot(tr,x.lim=c(0,0.246))>plot(tr,x.lim=c(-0.123,0.123))Drawingunrootedtreesisadifficulttaskbecausetheoptimalpositionsofthetipsandnodescannotbefoundinastraightforwardway.plot.phylousesasimplealgorithm,inspiredbytheprogramdrawtreeinPhylip,wherecladesareallocatedangleswithrespecttotheirnumberofspecies[39].Withthisscheme,edgesshouldnevercross.Theoptionlab4ut(labelsforunrootedtrees)allowstwopositionsforthetiplabels:"horizontal"(thedefault)or 704PlottingPhylogeniesBucerotiformesGalbuliformesUpupiformesPiciformesTrogoniformesTurniciformesCoraciiformesAnseriformesColiiformesGalliformesCuculiformesCraciformesPsittaciformesTinamiformesApodiformesStruthioniformesTrochiliformesPasseriformesMusophagiformesCiconiiformesStrigiformesColumbiformesGruiformesFig.4.4.Anunrootedtreeofthebirdfamilies"radial".Usingthelatterandadjustingthefontsizewith"cex"islikelytogivereadabletreesinmostsituations,eveniftheyarequitelarge.Figure4.4showsanunrootedtreeoftherecentbirdorders[140].Thecommandusedis:plot(bird.orders,type="u",font=1,no.margin=TRUE)Circulartreesaredrawnwithade4usingthefunctionradial.phylog.AfterconvertingouroriginaltreeasexplainedinSection3.4.5,asimplecalltothisfunctionresultsinFig.4.5.Inade4,thetiplabelsaredrawnonthesamelevel;thetipsthemselvesaremarked(bydefault)withblackcircles.apecanalsoplotcirculartreesbyusingtheoptiontype="radial"inplot.phylobutthisdoesnottakebranchlengthsintoaccount.Alltipsareplacedequispacedonacircle,therootbeingatthecenterofthiscircle.Thenodesarethenplacedonconcentriccircleswithdistancesfromtheoutercircledependingonthenumberofdescendanttips.Figure4.6showsanexamplewiththefamiliesofbirds[140].Thisrepresentationcanbeusedforrootedandunrootedtrees.Ithastheadvantagesofbeingeasilycomputed;thelineshavenochancetocrossandthetipsareequallyspaced.apTreeshapehasitsownplotmethodfortreesofclass"treeshape"(plot.tresshape):itresultsinasimpleplotofatreesimilartothede-faultbehaviorofplot.phylo(Fig.3.1).Anoriginalfeatureofthismethodisthepossibilityofdirectlyplottingtwotreesonthesamegraphicaldevice 4.1SimpleTreeDrawing71ApodemussylvaticusApodemusflavicollisApodemushermonensisApodemusuralensisApodemusmystacinusApodemusalpicolaApodemuspeninsulaeDiplothrixlegataApodemussemotusRattusnorvegicusApodemusagrariusMuscaroliTokudaiaminMusmusculuusFig.4.5.Acirculartreewithradial.phylogwithplot(t1,t2).Section4.2explainshowtodosimilarplotswithobjectsofclass"phylo".4.1.1AnnotatingTreesplot.phyloallowsustodisplaynodelabelswiththeoptionshow.node.label:thissimplyprintsthelabelsusingthesamefontandjustificationasforthetips.Thisoptionisverylimited,anditisoftenneededtohaveamoreflex-iblemechanismtodisplaycladenames,bootstrapvalues,estimateddiver-gencedates,andsoon.Furthermore,thecharacterstringsdisplayedwithshow.node.label=TRUEarefromthenode.labelelementofthe"phylo"object,whereasitmaybeneededtodisplayvaluescomingfromsomeotherdata.NodeAnnotationThefunctionnodelabelsoffersaflexiblewaytoaddlabelsonatree.Itisalow-levelplottingfunction:thelabelsareaddedonapreviouslyplottedtree.Itcanprinttext(likethefunctiontext),plottingsymbols(likepoints),or“thermometers”(likesymbols)onallorsomeselectednodes.Theformattingallowsustoplacethelabelsexactlyonthenode,oratapointaroundit,thusgivingthepossibilityofaddinginformation.Thetextcanbeframedwithrectanglesorcircles,andcolorscanbeused.Thenumberofoptionsofnodelabelsisquitesmall(Table4.2),butittakesadvantageofthe...(pronounced“dot-dot-dot”)argumentofR’smeth-ods.This“mysterious”argumentmeansthatallargumentsthatarenotpre-defined(i.e.,thosenotinTable4.2inthepresentcase)arepassedinternally 724PlottingPhylogeniesOpisthocomidaeDacelonidaeCentropidaeCrotophagidaeCerylidaeCoccyzidaeCuculidaeNeomorphidaeColiidaeHemiprocnidaePsittacidaeApodidaeAlcedinidaeMusophagidaeTodidaeMomotidaeMeropidaeTrochilidaeLeptosomidaeCoraciidaeTrogonidaeRhinopomastidaePhoeniculidaeUpupidaeTytonidaeBucorvidaeAegothelidaeBucerotidaeStrigidaeBucconidaeBatrachostomidaePodargidaeGalbulidaeRamphastidaeSteatornithidaeLybiidaeMegalaimidaePicidaeEurostopodidaeNyctibiidaeIndicatoridaeTurnicidaeCaprimulgidaeAnatidaeDendrocygnidaeColumbidaeAnseranatidaeEurypygidaeAnhimidaeOdontophoridaeOtididaeNumididaeGruidaeHeliornithidaePhasianidaeMegapodiidaePsophiidaeCracidaeCariamidaeTinamidaeRhynochetidaeApterygidaeRallidaeCasuariidaePteroclidaeRheidaeStruthionidaeThinocoridaeFringillidaePedionomidaePasseridaeScolopacidaeParamythiidaeRostratulidaeMelanocharitidaeJacanidaeNectariniidaeAlaudidaeChionididaeSylviidaeBurhinidaeZosteropidaeCharadriidaeCisticolidaeGlareolidaePycnonotidaeLaridaeRegulidaeAccipitridaeHirundinidaeAegithalidaeSagittariidaeParidaeFalconidaeCerthiidaeSittidaePodicipedidaeSturnidaeSulidaePhaethontidaeMuscicapidaeCinclidaeAnhingidaeBombycillidaeCorvidaeArdeidaeVireonidaeLaniidaeScopidaePomatostomidaePhalacrocoracidaeOrthonychidaeIrenidaeEopsaltriidaePardalotidaeMeliphagidaeMaluridaePhoenicopteridaeConopophagidaeRhinocryptidaeClimacteridaeMenuridaePtilonorhynchidaePelecanidaeCiconiidaeThreskiornithidaeFregatidaeGaviidaePittidaeSpheniscidaeProcellariidaeTyrannidaeAcanthisittidaeEurylaimidaeFurnariidaeFormicariidaeThamnophilidaeFig.4.6.Acirculartreeusingtype="radial"inplot.phylotoanotherfunction,inthepresentcaseeithertextorpoints(seebelow).Particularly,texthasafewoptionstodefinefont,characterexpansion,andpositionofthetext(someexamplesaregiveninTable4.2)whichthusmaybeusedinnodelabels. 4.1SimpleTreeDrawing73Table4.2.Theoptionsofnodelabels.Thevaluesmarkedwith(d)arethedefaultonesOptionEffectPossiblevaluestextTexttobeprintedAvectorofstrings;canbeleftmissing(d)nodeNodeswheretoprintAvectorofnumericsorstrings;canbeleftmissing(d)adjPositionwithrespectOneortwonumericvaluestothenodeframeTypeofframearoundtext"r"(d),"c","n"pchThetypeofplottingsymbolAnintegerbetween1and25,oracharacterstringthermoDrawfilledthermometersAnumericvectorormatrixwithoneortwolevelscolColorfortextorsymbolAcharacterstringoracolorcodebgColorforthebackgroundid.(default:"lightblue")oftheframeorthesymbol...Furtherargumentscex=,font=,vfont=offset=,pos=TheoptionpchisdefinedasNULLbydefault,meaningthatsometextwillbeprintedbydefault;ifpchisgivenavalue,thentextisignored.Thenodeswherethelabelsareprintedarespecifiedwithnode:thisisdoneusingthenumbersoftheedgeelementofthe"phylo"object.Thenumbersspecifiedcanbeeitherpositive(1,2,...)ornegative(−1,−2,...),andcanalsobegivenascharacterstrings("1","2",...,or"-1","-2",...).Obviously,itseemsnecessarytoknowthesenodenumberstousenodelabels,butthisisnotadifficulty:theycanbedisplayedonthescreenusingthisfunctionwithnoargument(i.e.,nodelabels();Fig.4.7).Anotherwaytoproceedistoassumethatthevectoroflabels(orsymbolstoplot)isalreadyorderedalongthenodes:theywillbedisplayedonthenodesinthecorrectorder.Foraverysimpleoperationalexample,considerplottingatreeshowingtheestimateddivergencedatesamonggorillas,chimpanzees,andhumans.WetakethedatesestimatedbyStaufferetal.[145]:trape<-read.tree(text="((Homo,Pan),Gorilla);")plot(trape,x.lim=c(-0.1,2.2))nodelabels("6.4Ma",1,frame="c",bg="white")nodelabels("5.4Ma",2,frame="c",bg="white")Becausethelabelsneedsomespace,wehavetoleavealittleextraspacebetweentherootandtheleft-handsidemargin,hencetheuseofthex.limoption(Fig.4.8).Weknowthattherootisnumbered−1,sothefirstdateisprintedbysimplygiving1assecondargument.Similarly,thesecondnode 744PlottingPhylogeniesDiplothrixlegata−13RattusnorvegicusMuscaroli−1−12Musmusculus−2TokudaiaminutusApodemusagrarius−3−10Apodemussemotus−11Apodemuspeninsulae−4Apodemusmystacinus−5Apodemushermonensis−6Apodemussylvaticus−7Apodemusflavicollis−8Apodemusuralens−9ApodemusalpicolaFig.4.7.Displayofnodenumberswithnodelabels()Gorilla6.4MaPan5.4MaHomoFig.4.8.Addingdateswithnodelabelsisobviouslynumbered−2.Ifthenodenumbersareomitted,thelabelsareprintedsuccessivelyonallnodes.Thus,thesamefigurecouldhavebeenob-tainedwith:plot(trape,x.lim=c(-0.1,2.2))nodelabels(c("6.4Ma","5.4Ma"),frame="c",bg="white")Thisisclearlyusefulifonehasalargenumberofvaluestoaddonthetree.Itisalsooftenneededtoprintnumericvaluescloseto,butnotexactlyon, 4.1SimpleTreeDrawing75thenodes,forinstance,bootstrapvalues.Usually,suchvaluesarearrangedinavector(saybs)andorderedalongthenodenumbers,becausethisistheinterfaceofthe"phylo"objects.Itiscommontoprintthebootstrapvaluesrighttothenodesandwithoutframeswhichcanbedonesimplywith:plot(tr)nodelabels(bs,adj=0,frame="n")Insomecases,thismayneedtobetunedslightlybecausethelabelswillbestucktothenodesandthefontsizemaybetoolarge(ortoosmall):theformercanbemovedslightlyrightwardsbygivingasmallnegativevaluetoadj(e.g.,adj=-0.2),andthefontsizecanbesetbyusingtheoptioncex.Notethathereasinglevaluehasbeengiventoadj:thissetsthehorizontaljustificationonly,andthisconformstostandardR’sgraphicalfunctions(see?parinRfordetails).IfaprogramoutputsbootstrapvaluesasnodelabelsinaNewicktree,thenthiscanbehandledeasilybecauseoncethetreehasbeenreadwithread.treethesevaluesarestoredinthenode.labelelementofthe"phylo"object(seeSection3.1.1).Theycanbeplottedwithsomethinglike:plot(tr)nodelabels(tr$node.label,adj=0,frame="n")Itisalsousualtoplotseveralvaluesaroundanode.Michauxetal.[100]showedontheirtreebootstrapvaluesfromthedifferentphylogenyreconstruc-tionmethodstheyused:parsimony,neighbor-joining,andmaximumlikeli-hood.Thiscanbedonebysuccessivecallstonodelabelswithdifferentvaluesforadj.Theoptionfontcanbeusedtodistinguishthedifferentvalues.Wefirstinputthebootstrapvaluesonthekeyboardsimplyusingscan:>bs.pars<-scan()1:NA76345474100569174606310010014:Read13items>bs.nj<-scan()1:NA74486875100NA9167825210010014:Read13items>bs.ml<-scan()1:NA88767371100458172676310010014:Read13itemsThereareofcoursemanyotherwaystoinputthesevalues.Notethatwehavegivenamissingvaluetothefirstnode,becausethisistherootandthetreewasrootedwithanoutgroup.Wethenplotthetreewithoutthemarginstoleavemorespaceforthebootstrapvalues,andaddsuccessivelythelatterwiththreecallstonodelabels(Fig.4.9): 764PlottingPhylogeniesDiplothrixlegata100100100RattusnorvegicusMuscaroli100100100Musmusculus7476Tokudaiaminutus88Apodemusagrarius483482607667Apodemussemotus526363Apodemuspeninsulae685473Apodemusmystacinus7574Apodemushermonensis71100100Apodemussylvaticus10056Apodemusflavicollis45919181Apodemusuralensis677472Apodemusalpicola0.01Fig.4.9.Addingbootstrapvaluesplot(tr,no.margin=TRUE)nodelabels(bs.pars,adj=c(-0.2,-0.1),frame="n",cex=0.8,font=2)nodelabels(bs.nj,adj=c(1.2,-0.5),frame="n",cex=0.8,font=3)nodelabels(bs.ml,adj=c(1.2,1.5),frame="n",cex=0.8)add.scale.bar(length=0.01)Thelastcommandaddsascalebar(seebelowforexplanationofthisfunction).Tographicallydisplaythedifferentlevelsofasingleproportion,saybs.ml,wecanusetheoptionthermo.Itrepresentstheproportionsoftwoormorecategoriesasafilledthermometer.Thisrepresentationislessusualthancir-cularsymbolssuchaspiecharts,butthelatterarelessintelligible,particularlywithmorethanthreeproportions.Thecommandsare(Fig.4.10):plot(tr,no.margin=TRUE)nodelabels(thermo=bs.ml/100,col="grey",bg="white")Wenowillustratetheuseofthepchoptionbyplottingsymbolsinsteadoftherawnumericvalues.Forthis,weconsideragainthebootstrapvaluesofthemaximumlikelihoodmethod(bs.ml).Supposewewanttoplotafilledcircleforabootstrapvaluegreaterthanorequalto90,agreycircleforavaluebetween70and90,andanopencircleforavaluelessthan70.Wefirstcreateavectorofmodecharacterandassignstringswithrespecttotheoriginalbootstrapvaluesaccordingtotherulesdefinedabove. 4.1SimpleTreeDrawing77DiplothrixlegataRattusnorvegicusMuscaroliMusmusculusTokudaiaminutusApodemusagrariusApodemussemotusApodemuspeninsulaeApodemusmystacinusApodemushermonensisApodemussylvaticusApodemusflavicollisApodemusuralensisApodemusalpicolaFig.4.10.Plottingproportionsonnodeswiththermometersp<-character(length(bs.ml))p[bs.ml>=90]<-"black"p[bs.ml<90&bs.ml>=70]<-"grey"p[bs.ml<70]<-"white"Wecannowplotthetree,thencallnodelabelsgivingpasvaluefortheoptionbg.Wealsospecifypch=21whichusesacolor-filledcircle.plot(tr,no.margin=TRUE)nodelabels(node=2:13,pch=21,bg=p[-1],cex=2)Herewemustusenodetoavoidasymbolbeingplottedattheroot.Alsowehavetotelltheoptionbgtoignorethefirstvalueofp(whichisactuallyanemptystring).Tofinishthefigure,wefurtheraddalegendbytwocallstopointsandtext(Fig.4.11):points(rep(0.005,3),1:3,pch=21,cex=2,bg=c("black","grey","white"))text(rep(0.01,3),1:3,adj=0,c("90<=BP","70<=BP<90","BP<70"))Thefunctiontiplabelsplotslabelsatthetipsofthetree,andhasexactlythesamesyntaxasnodelabelsexceptthattheargumentnodeisreplacedbytip. 784PlottingPhylogeniesDiplothrixlegataRattusnorvegicusMuscaroliMusmusculusTokudaiaminutusApodemusagrariusApodemussemotusApodemuspeninsulaeApodemusmystacinusApodemushermonensisApodemussylvaticusBP<70Apodemusflavicollis70<=BP<90Apodemusuralensis90<=BPApodemusalpicolaFig.4.11.PlottingsymbolsonnodesAxesandScalesapehastwolow-levelplottingfunctionsthataddanindicationofthescaleofthebranchesonaphylogenyplot.add.scale.bar()addsashortbaratthebottomleftcorneroftheplottingregion.Ifthisdefaultlocationisnotsuitable,itcanbemodifiedwiththeargumentsxandy.Thelengthofthebariscalculatedfromthelengthsoftheplottedtree(sothisworksevenifthetreehasnobranchlengths);thiscanbemodifiedtoowiththelengthoption(seeFig.4.9).axisPhylo()addsascaleonthebottomsideoftheplotwhichscalesfromzeroontherightmosttiptoincreasingvaluesleftwards(seeFigs.4.17and4.18).Ifthetreeisultrametric,thismayrepresentatimescale.Theoptionsideallowsustodrawthescaleondifferentsidesoftheplot:side=1(thedefault)drawsitbelow,2ontheleft,3above,and4ontheright.Notethateither2or4shouldbeusedifthetreeisvertical.ManualAnnotationR’slow-levelplottingcommandscanbeusedtoannotatetreemanuallyaonceithasbeenplotted.Theusefulfunctionsinthiscontextaretext,segments,arrows(allhaveexplicitnames),andmtext(marginaltext).Exceptforthelastone,thecoordinatesmustbegivenbytheuser.Asimple,buthopefullydidacticexample,plotsafour-taxontree,andaddvariousannotations(Fig.4.12): 4.1SimpleTreeDrawing79Textabovewith"line=2"SimpletextaboveTytoalbaRootAthenenoctuaAsiootusThisisanodeTextintheleft−handmargin("side=2")StrixalucoTextbelow("side=1")Fig.4.12.Manualannotationofatreetree.owls<-read.tree(text="(((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3):6.3,Tyto_alba:13.5);")plot(tree.owls,x.lim=19)box(lty=2)text(2,1.5,"Thisisanode",font=2)arrows(3.5,1.55,6.1,2.2,length=0.1,lwd=2)text(0.5,3.125,"Root",srt=270)points(rep(18.5,4),1:4,pch=15:18,cex=1.5)mtext("Simpletextabove")mtext("Textabovewith"line=2"",at=0,line=2)mtext("Textbelow("side=1")",side=1)mtext("Textintheleft-handmargin("side=2")",side=2,line=1)Thecalltoboxhelpstovisualizethelimitbetweentheplottingregionandthemargins.Notetheuseoftheoptionx.limtoleavealittleextraspaceforthesymbolsplottedbypoints.Bydefault,mtextprintsthetextatthecenteroftheclosestlinetotheplottingregion:thisisalteredbytheoptionsatandline,respectively,asillustratedabove.Notehowdoublequotesarespecifiedinsideacharacterstring:abackslashisneededtoescapethem.Colors(whicharenotusedhere)canbespecifiedinallofthesefunctionswiththecoloptions. 804PlottingPhylogeniesPasseriformesCiconiiformesGruiformesColumbiformesStrigiformesMusophagiformesTrochiliformesApodiformesNeoavesProavesPsittaciformesCuculiformesColiiformesCoraciiformesTrogoniformesUpupiformesBucerotiformesGalbuliformesPiciformesTurniciformesAnseriformesGalliformesCraciformesTinamiformesStruthioniformesFig.4.13.Simplebars4.1.2ShowingCladesTreesarestatisticaltoolsforclassificationofobservations,anditisobviousthatinsomesituationsclades(monophyleticgroups)needtobeidentifiedinaplottedphylogeny.Thismaybeforsimpleillustrativepurpose,forinstance,toshowhowdifferentgroupssegregateonaphylogeny,orforexploratoryreasons.Inthelattercase,anautomatedapproachisclearlyrequired.Ihavefoundfourwayscommonlyusedintheliteraturetoshowcladesonaphylogeny:•Drawingbarsinthefaceofthetipsoftheclade;•Labelingthenodecorrespondingtothemostrecentcommonancestoroftheclade;•Coloringthebranchesoftheclade;•Drawinganellipseorarectangleoverthebranchesandtipsbelongingtotheclade.ThesecondapproachiscoveredinSection4.1.1.Thefirstandfourthap-proachesaremostlyappropriateforillustrativepurposes,whereasthesecondandthirdonesarethebestsuitedforexploratoryanalyses.Barscanbeaddedeasilyonthesideofatreewiththelow-levelplottingcommandsegments.Theoptionsofthisfunctionthatareusefulinthiscon-textarelwdforthelinewidthandcolforitscolor.Whendrawingsuchbars,itwillbenecessarytoleavesomespaceontheappropriatesideoftheplot.Itisusefultoknowthatthetipsofthetreearedrawninthesameorderasintheelementtip.labelinthe"phylo"object,andtheircoordinatesonthe 4.1SimpleTreeDrawing81y-axisare1,2,andsoon.Thismaybehelpfulinspecifyingthecoordinatesoftheverticalbars.Figure4.13showsasimpleexamplewithaphylogenyofbirdorders;thecommandsusedwere:plot(bird.orders,font=1,x.lim=40,no.margin=TRUE)segments(38,1,38,5,lwd=2)text(39,3,"Proaves",srt=270)segments(38,6,38,23,lwd=2)text(39,14.5,"Neoaves",srt=270)Someargumentsareobviouslyrepeatedinthesuccessivecallstosegmentsandtext:theyarethecoordinatesoftheplottedobjects.Thesecallsmaybegroupedinasingleone(e.g.,text(rep(39,1),c(3,14.5),c("Proaves","Neoaves"),srt=270);theywerekeptdistinctforclarity.Colorsareinterestingforshowingclades,becausethiscanbesome-whatautomatedinR,andthususedforexploratorygraphicalanalyses.Inplot.phylo,theoptionsedge.colorandedge.widthallowustospecifythecolorandwidthofeachbranchofthetree.Forinstance,edge.color="blue"willcoloralledgesinblue.Asmanycolorsasthenumberofbranchesmaybespecified,thevaluesbeingpossiblyrecycled:edge.color=c("blue","red")willcolorthefirst,third,...,branchesinblue,andthesecond,fourth,...,inred.Theproblemistoknowthenumbersofthebranches.Thismaybeeasywithasmall"phylo"objectbyprintingitandthenvisuallyfindingthenumberofeachbranch.However,thismaybemoredifficultwithlargetrees.Thefunctionwhich.edgemaybeusedherebecauseitreturnstheindicesofthebranchesthatbelongtoaspecifiedgroup.Thelattermaybenotmonophyleticinwhichcasetheindiceswillincludebranchesuptothemostrecentcommonancestorofthegroup.Forinstance,usingthesamebirdphylogeny:>wh<-which.edge(bird.orders,19:23)>wh[1]31353738394041424344Itisnoweasytodefineavectorofcolorstobeusedinplot.phylo.Wefirstrepeatadefaultcolor(sayblack)withasmanybranchesasinthetree:colo<-rep("black",dim(bird.orders$edge)[1])Thecommanddim(...)[1]extractsthenumberofrowsintheelementedgeofthetree:2wenowhaveavectorwith45repetitionsof"black".Thecolorsofthecladesdefinedabove(tips19–23)aresimplymodifiedwith:colo[wh]<-"grey"2Thiscouldbedonewithlength(bird.orders$edge.length),butthiswillnotworkifthetreehasnobranchlength. 824PlottingPhylogeniesPasseriformesCiconiiformesGruiformesColumbiformesStrigiformesMusophagiformesTrochiliformesApodiformesPsittaciformesCuculiformesColiiformesCoraciiformesTrogoniformesUpupiformesBucerotiformesGalbuliformesPiciformesTurniciformesAnseriformesGalliformesCraciformesTinamiformesStruthioniformesFig.4.14.SimpleedgecolorsThetreecannowbedrawn.Weusewiderlinestodisplaythedifferenceincolorsbetter(Fig.4.14):plot(bird.orders,"c",FALSE,font=1,edge.color=colo,edge.width=3,no.margin=TRUE)Showingacladewithaframeoranellipseisnotsoeasybecauseifthecontourisaddedafterthetreeisplotted,itwilloverlapthelatterandhideaportionofitifacoloredbackgroundischosen.Anobvioussolutionistoplotacontourwithoutbackground(whichisthedefaultinmostfunctionsinR).Forinstance,withthebirdphylogeny,ifwewantarectangleshowingthecladeofthefirstfiveorders,wecoulddo:plot(bird.orders,font=1)rect(1.2,0.5,36,5.4,lty=2)Bydefault,thelinesoftherectanglearethesameasthoseofthetreeedges,henceitmaybegoodtodistinguishthemwiththeusualoptions(lty=2specifiesdashedlines).Thenumericargumentstorectgivethepositionoftheleftmost,lower,rightmost,anduppersidesoftherectangle.ThosecanbeobtainedwiththelocatorfunctionwhichreturnsthecoordinatesonthecurrentRplotofpointsindicatedbytheuserwithapointer(usuallythemouseofthecomputer).Alessstraightforward,butmaybemoreefficient,solutionistoeditthecodeofplot.phylo,andaddtheabovecalltorectjustafterthecalltoplot.Thiswilldrawtherectanglebeforethetree.Apossiblesetofcommandsmaybe(Fig.4.15): 4.2CombiningPlots83PasseriformesCiconiiformesGruiformesColumbiformesStrigiformesMusophagiformesTrochiliformesApodiformesPsittaciformesCuculiformesColiiformesCoraciiformesTrogoniformesUpupiformesBucerotiformesGalbuliformesPiciformesTurniciformesAnseriformesGalliformesCraciformesTinamiformesStruthioniformesFig.4.15.Aframedcladefix(plot.phylo)##addrect(1.2,0.5,36,5.4,col="lightgrey")##justafterplot(0,....)##thensaveandclosetheeditorplot(bird.orders,font=1,no.margin=TRUE)Notethatthemodificationsdonebyfixalteronlythefunctionsloadedinmemory,nottheonesonthedisk.ThustheoriginalfunctionsarerestoredwhenRisclosed.4.2CombiningPlotsItmaybeenlighteningtocombineseveralplotsinasinglefigure.Thismaybeneededtoindicatethedistributionofsomevariablesamongrecentspecies(representedbythetipsofthetree).apehasnospecialfunctiontocombinetreeswithotherplots:thismustbedonewithstandardRfunctions.ade4hasafewspecialfunctionstoplotvariablesinthefaceofthetipsofatree.Letusfirstseewhatcanbedonewiththem.Ifavariablemustbeplottedfacingthetipsofthetree,symbols.phylogordotchart.phylogcanbeused.Toillustratethem,firstconverttheclassofourowltree,andcreateavectorxwiththemeanbodylength(incm)ofthesefourspecies;theplotsarethenmade(Fig.4.16):tg<-newick2phylog(write.tree(tree.owls)) 844PlottingPhylogenies22.527.532.537.52025303540Fig.4.16.Thefunctionssymbols.phylog(left)anddotchart.phylog(right)x<-c(38,36,22,34)symbols.phylog(tg,squares=x)dotchart.phylog(tg,x)table.phylogisamultivariateversionofsymbols.phylog:thetreeisplottedhorizontallyfacingamatrixwithsymbolsrepresentingthevariablesarrangedincolumns.Itispreferablethatthevariablesareonthesamescale.Tohaveamoreflexiblewayofplottingvariables,onecanuseplot.phyloandmanuallyaddfurthergraphicalelements.Itisusefultoknowherethatwhenplottingaphylogramoracladogram,thetipshavethecoordinates1,2,andsoon(whateverthedirection).Itisthuspossibletoadd,forinstance,horizontalbarsafterleavingextraspacewithx.lim(ory.limifthetreeisvertical).Wecould,forinstance,plotthespeciesrichnessofeachavianorderinthefaceofthecorrespondingphylogeny.WehavethevectorOrders.datwithnamessetastheorders:>Orders.dat<-scan()1:104769214161173555156103915213:6143358103319232913131961027571224:Read23items>names(Orders.dat)<-bird.orders$tip.label>Orders.datStruthioniformesTinamiformesCraciformes104769 4.2CombiningPlots85PasseriformesCiconiiformesGruiformesColumbiformesStrigiformesMusophagiformesTrochiliformesApodiformesPsittaciformesCuculiformesColiiformesCoraciiformesTrogoniformesUpupiformesBucerotiformesGalbuliformesPiciformesTurniciformesAnseriformesGalliformesCraciformesTinamiformesStruthioniformes3025201510500510ln(speciesrichness)Fig.4.17.Barsinthefaceofatreeplottedwithplot.phyloGalliformesAnseriformesTurniciformes21416117PiciformesGalbuliformesBucerotiformes3555156UpupiformesTrogoniformesCoraciiformes1039152ColiiformesCuculiformesPsittaciformes6143358ApodiformesTrochiliformesMusophagiformes10331923StrigiformesColumbiformesGruiformes291313196CiconiiformesPasseriformes10275712Fortunately,thedataareinthesameorderasinthetree.3Wecanthusproceedinastraightforwardmanner(Fig.4.17):plot(bird.orders,x.lim=50,font=1,cex=0.8)segments(rep(40,23),1:23,rep(40,23)+log(Orders.dat),1:23,lwd=3)axis(1,at=c(40,45,50),labels=c(0,5,10))mtext("ln(speciesrichness)",at=45,side=1,line=2)3Iftheywerenotinthecorrectorder,thenameswouldsolvethiseasilywithOrders.dat[bird.orders$tip.label]. 864PlottingPhylogeniesaxisPhylo()Oncewehavedeterminedthatthebarswillspanbetween40and50onthehorizontalscale(whichcouldbedonebyexaminingthedefaultx.limofplot.phylo),itiseasytosettheothervaluesinthecommand.Notehowwedrawa‘custom’scaleonthex-axis.Wedidnotuseno.margin=TRUEtoleavesomespaceforthescalesundertheplot.Intheexampleswehaveseenabove,thedifferentgraphicswereplottedinthesameplottingregion.Itispossibletoplotdifferentgraphsonthesamegraphicaldevice.Thisisusuallydonebysplittingthegraphicaldevice(i.e.,thewindoworthefile)inseveralregionsthencallingsuccessivelydifferenthigh-levelplottingfunctions.Themostusefulapproachistousethefunctionlayout.Themainargumentofthisfunctionisamatrixwithintegernumbersindicatingthenumbersofthe‘subwindows’.Forinstance,todividethedeviceintofourequalparts:>layout(matrix(1:4,2,2))Printingthematrixmakesclearhowthedeviceisdivided:>matrix(1:4,2,2)[,1][,2][1,]13[2,]24Thefirstgraphwillbeplottedinthetop-leftquarter,thesecondinthebottom-leftquarter,thethirdinthetop-rightquarter,andthefourthinthebottom-rightquarter.Whereaswith:>matrix(c(1,1,2,3),2,2)[,1][,2][1,]12[2,]13thefirstgraphwillspanthelefthalfofthedevice,andthesecondandthirdoneswillbeinthetop-rightandbottom-rightquarters,respectively.Quitealargenumberofgraphscanbeplottedonthesamedevice,forinstance16with:4>matrix(1:16,4,4)[,1][,2][,3][,4][1,]15913[2,]261014[3,]371115[4,]4812164ItmayhappenthatRcannotplotthegraphsifthereisnotenoughspaceintheplottingregion. 4.2CombiningPlots87Thelayoutfunctiongivesalotofpossibilities.Toillustratethis,wecon-siderplottingtwotreesofthesamespeciesbutshowingdifferentinformation.LetuscomebacktotheApodemusdata(Fig.4.1).Michauxetal.[100]es-timateddivergencedatesontheirtreeusingamolecularclock.ThetreeonFig.4.1couldalsobeanalyzedwiththenonparametricratesmoothingmethodofSanderson[135]usingthecalibrationpointof12Ma(millionyearsago)forthedivergenceMus/Rattus.Thisisdonewiththefunctionchronogram(Sec-tion5.4).WecanproceedveryeasilybyreadingtheclocktreeofMichauxetal.,computingthechronogram,splittingthegraphicaldeviceintwo,andfi-nallyplottingbothtreessuccessively.Thesetofneededcommandsisstraight-forward:trk<-read.tree("Apodemus_molclock.tre")trc<-chronogram(tr,scale=12)layout(matrix(1:2,1,2))plot(trk)plot(trc,show.tip.label=FALSE,direction="l")Thefigureobtainedthiswaywillnotdisplaytheinformationnicelybe-causeofthedefaultmarginswhicharetoowidehere.Weneedalittleextraworktomakethefigureinformative.Wefirstchangethetiplabelsofthefirsttreetoreplacethegenusnameswiththeirinitials.Thiscouldbedonemanu-allybyeditingtrk$tip.labelandreplacing"Apodemusagrarius"with"A.agrarius",andsoon.Fortunately,Rhasfunctionsthatmanipulateregularexpressionswhichconsiderablyfacilitatesthiskindoftask.Hereweusethefunctiongsub(globalsubstitution),forinstance:trk$tip.label<-gsub("Apodemus","A.",trk$tip.label)willreplaceeveryoccurrenceof"Apodemus"by"A.".Wecoulddothisforthefivegenerainthetreebutthisisstilltedious,andthereisamoregeneralsolution:trk$tip.label<-gsub("[[:lower:]]{1,}_","._",trk$tip.label)Theregularexpression"[[:lower:]]{1,}_"means“oneormorelowercaseletter(s)followedbyanunderscore”.Weclearlytakeadvantageofthefactthatthegenusandspeciesnamesareseparatedbythislastcharacter.Wecannowplotthetreesbutweneedtocareaboutthespacearoundboth.Letusfirstseethewholecommands,thenexplainwhathasbeendone.TheresultingplotisinFig.4.18.layout(matrix(1:2,1,2),width=c(1.4,1))par(mar=c(4,0,0,0))plot(trk,adj=0.5,cex=0.8,x.lim=16)nodelabels(node=12,"?",adj=2,bg="white")axisPhylo() 884PlottingPhylogeniesD.legataR.norvegicusM.caroli?M.musculusT.minutusA.agrariusA.semotusA.peninsulaeA.mystacinusA.hermonensisA.sylvaticusA.flavicollisA.uralensisA.alpicola121086420024681012Fig.4.18.Facingtreesplot(trc,show.tip.label=FALSE,direction="l")axisPhylo()Thecriticaloptionsarewidthforlayoutandx.limforplot:theyallowustohavebothtreesofthesamesizeonthefigure.Thesecommandswillworkforanyotherdataprovidingthesetwooptionsaresetcorrectly.Notethatweremovethespacearoundthetreesexceptthatbelow,sowecannotusetheoptionno.marginofplot.phylo:insteadweusetheparfunction.Thecalltonodelabelsistoindicatethatonenode(thedivergencebetweenthetwospeciesofMus)wasnotdatedbyMichauxetal.[100].Finally,wedrawtheaxisbeloweachtreeusingaxisPhylo.Notethepossibilitywithlayoutofinsertingagraphwithinalargerone.Inprinciplethedifferentsubwindowsarecompletelyindependent,butifoneofthemissurroundedbyanother,thenthegraphinthefirstwilloverlapwiththesecond.Forinstance,withthefollowingmatrixgivenasargumenttolayout:matrix(c(2,1,1,1),2,2)[,1][,2][1,]21[2,]11 4.3LargePhylogenies89PasseriformesCiconiiformesGruiformesColumbiformesStrigiformesMusophagiformes050100150Trochiliformes−3−2−10123ApodiformesPsittaciformesCuculiformesColiiformesCoraciiformesTrogoniformesUpupiformesBucerotiformesGalbuliformesPiciformesTurniciformesAnseriformesGalliformesCraciformesTinamiformesStruthioniformesFig.4.19.Insertanhistogramthefirstgraphwillbeplottedonthewholegraphicaldevice,andthesecondonewillbeonthetop-leftquarter,thuspotentiallypartiallyoverlappingthefirstone.Tofurtherreducethesizeoftheinsert,onecoulddo:5layout(matrix(c(2,rep(1,8)),3,3))Hereisanexampleofhowthiscouldbeused(Fig.4.19):plot(bird.orders,"p",FALSE,font=1,no.margin=TRUE)arrows(4.3,15.5,6.9,12,length=0.1)par(mar=c(2,2,0,0))hist(rnorm(1000),main="")4.3LargePhylogeniesLargetreesareapuzzleforphylogeneticistsbecausetreesarethemselveswaystosummarizetherelationshipsamongspeciesandothertaxonomicunits,butwhentheyreachacertainsize,theinformationthatwassupposedtobesummarizedislikelytobenomorevisible.Therecentliteraturehasseenthedefinitionofaterminologyabout“largetrees”,“verylargetrees”,andeven“hugetrees”reachingtensofthousandsoftips,butitisclearthatevena5layouthasoptionswidthandheighttomodulatethesizesofthesubwindowsinamoreflexiblewaythandonehere. 904PlottingPhylogeniestreewithafewhundredtipsmayhidethephylogeneticinformationthatwasoriginallysought.LargetreeshavebecomeanissuewiththeavailabilityoflargerandlargermoleculardatabasessuchasGenBank,andthedevelopmentofambitiousprojectstoassemblethetreeoflife.Largetreesarealsobecomingpresentinfieldssuchasgenomicswhereasingleexperimentcanresultinthousandsofobservations.Thegeneralstrategytovisualizealargetreeistoplotonlyaportionofthefullphylogeny,whileindicatingitscontext,thatis,howitrelatestotherestofthetree.Weshowthatmostofthenecessaryingredientstovisualizeandexplorelargetreesarepresentinvariousfunctionsinape.plot.phyloanddrop.tipmaybeusedinconjunctionwithR’sfunctionslayoutandX11togiveapowerfulandflexibleenvironmentforthegraphicalexplorationofphylogenies.Onefunctioninape,zoom,integratestheseideastogiveanautomatedwaytoexplorelargetrees.Wehaveseenthatdrop.tipremovessometerminalbranchesfroma"phylo"object,andeventuallytrimsthecorrespondinginternalbranches.Itisthuspossibletousethisfunctiontoextractasubtreebypassingallbutthewantedtipsasargument.Ifonehasthenumbersofthewantedtips,sayinavectorx,thiscanbedonewith:drop.tip(tr,tr$tip.label[-x])Alternatively,ifxisavectorwiththelabelsofthetipstobekept,onecoulddo:drop.tip(tr,which(!tr$tip.label%in%x))Theexpressiontr$tip.label%in%xreturnsalogicalvalueforeachtiplabel:itisTRUEifthelabelisinx,FALSEotherwise.Theoperator!invertstheselogicalvalues,andthefunctionwhichreturnstheindicesofthosethatareTRUE.Thustheactionofdrop.tipisquitestraightforward,butitmaybeusefultoshowinsomewaytherelationshipofthereturnedsubtreewiththeoriginaltree.Thiscanbedonewiththeoptionsubtreewhichtakesalogicalvalue.IfitisTRUE(thedefaultisFALSE),abranchisincludedinthereturnedtreethatshowshowmanytipshavebeendeletedintheoperation;thisisdoneforasmanymonophyleticgroupsashavebeenremoved.LetusseehowthisworkswithasupertreeofthemammalorderChiroptera[76].Ourgoalistoextractasubtreewiththefirst15tips.Thetreehas921tips,thusthesecondargumenttodrop.tipcouldeitherbe16:921orchiroptera$tiplabel[-(1:15)]withexactlythesameresult.Wethenplottheextractedtree(Fig.4.20).Thethreecommandsare:data(chiroptera)tr<-drop.tip(chiroptera,16:921,subtree=TRUE) 4.3LargePhylogenies91[757tips][122tips][22tips]NyctimenevizcacciaNyctimenemasalaiNyctimenemalaitensisNyctimeneraboriNyctimenecephalotesNyctimeneminutusNyctimenedraconillaNyctimenealbiventerNyctimenerobinsoniNyctimenemajorNyctimenecyclotisNyctimenecertansNyctimenecelaenoNyctimeneaelloParanyctimeneraptorFig.4.20.Extractingasubtreeplot(tr,font=c(rep(3,15),rep(2,3)),cex=0.8,no.margin=TRUE)Notehowwespecifiedthefontargumenttohaveonlythespeciesnamesinitalics.drop.tipcanthusbeusedtoexplorelargetrees.Onecanuselayout,aswehaveseenabove,toplotthewholetreeandasubtreeonthesamedevice.Anotherpossibilityistoopenanotherdeviceandplotthewholetreeandthesubtreesonthedifferentdevices.Forinstance,toexplorethebatsupertree,thefollowingcommandscanbeused.plot(chiroptera)X11()plot(tr)Thiswillopenasecondgraphicalwindow,andplottheextractedsubtree.Becausethissecondwindowistheactivedevice,allsubsequentgraphicswillbeplottedinit.6zoomisafunctionthatallowsexplorationoflargetreesinamoreuser-friendlyway.Itsprincipleistoplotthewholetreeintheleftthirdofthedevice,andoneorseveralsubtreesintheremainingportionofthedevice.Thelocationsofthesubtreesareindicatedwithcolorsonthewholetree.Thesubtree(s)is(are)specifiedinthesamewayasindrop.tip.Therearetwooptions:subtreewhichhasthesameeffectasindrop.tip,andcol6See?dev.listonhowtosetthepriorityofgraphicaldevices. 924PlottingPhylogenies[122tips]TurnicidaeAnatidaeDendrocygniAnseranatidaAnhimidaeOdontophoridNumididaePhasianidaeMegapodiidaCracidaeTinamidaeApterygidaeCasuariidaeRheidaeStruthionidaeFig.4.21.Usingzoomwhichindicatesthecolorstobeused.Bydefault,apresetrainbowpaletteisused.Anyfurtherargumentrecognizedbyplot.phylo(seeTable4.1)maybepassedthankstothe“dot-dot-dot”argument(seep.71).Asimpleexampleoftheuseofzoomcouldbe(Fig.4.21):data(bird.families)zoom(bird.families,1:15,col="grey",no.margin=TRUE,subtree=TRUE)Wehavesetsubtree=TRUE(thedefaultisFALSE)toshowthecontextofthespecifiedsubtree,andno.margin=TRUE(whichispassedtoplot.phyloaspartofthe“dot-dot-dot”argument)touseasmuchspaceasavailableonthedevice.Ifseveralsubtreesneedtobevisualizedonthesameplot,theyhavetobespecifiedasalist(becausetheycoulddifferinsize).Forinstance(Fig.4.22),zoom(bird.families,list(1:15,38:48),col=rep("grey",2),no.margin=TRUE,font=1,subtree=TRUE)Herewehaveusedthesamegreycolorforbothsubtrees,butbydefaultredandcyan(green-blue)areused.4.4PerspectivesThegraphicalanalysisandexplorationofphylogeniesareintheirearlydays.Thereisundoubtedlymuchtoexpectfromresearchinthisarea.Withthenow 4.4Perspectives93[122tips]TurnicidaeAnatidaeDendrocygnidaeAnseranatidaeAnhimidaeOdontophoridaeNumididaePhasianidaeMegapodiidaeCracidaeTinamidaeApterygidaeCasuariidaeRheidaeStruthionidae[14tips][1tips][5tips][16tips][1tips][80tips][9tips]MusophagidaeTrochilidaeHemiprocnidaeApodidaePsittacidaeNeomorphidaeCrotophagidaeOpisthocomidaeCoccyzidaeCentropidaeCuculidaeFig.4.22.Usingzoomtoshowtwogroupswidespreadavailabilityofpowerfulcomputers,itwillbepossibletoexploreandanalyzelargephylogeniesinaflexibleway.Futuredevelopmentswillneedtotakecareofintegrationwithothertools,andoperabilityfortheinterchangeofinformationamongdifferentsystems.TheexamplespresentedinthischapterallusethegraphicspackageofRwhichisthedefaultgraphicalenvironmentofR.FuturedevelopmentsmayconsiderinsteadthegridpackagedevelopedbyPaulMurrell.Thisisareim-plementationofR’sgraphicalenvironmentwithgreaterperformanceandflex-ibility.Amongtheimprovementsare:•Graphicalobjectsareeditableandcanbemodifiedwithoutredrawingthewholeplot;•Plotsmaybearrangedinmanyways(rotated,scaled,overlapping,etc.);•Theusercan“navigate”amongplots;•Graphicalobjectsmaybesharedamongplots.Usinggridclearlyneedsfurtherdevelopmentbutsomeoftheexistingcodesinapecanbereuseddirectly(suchasthefunctionsthatcomputethecoor-dinatesoftheedgesofthetree).Anotherexampleofapotentiallyusefuldevelopmentistheuseofthe3-DgraphicallibrariesOpenGLwhichisal-readyinterfacedwithRviathepackagergl.Ihavealreadyconductedsomeexperimentswithbothgridandrgldemonstratingtheeaseofsuchadapta-tions.Theissuenowistodeterminewhichtoolsneedtobedevelopedontheseenvironments. 944PlottingPhylogenies4.5Exercises1.DrawFig.4.11usingacolorscaleinplaceofthegreyone.Thefigureshouldincludealegend.2.Plotthephylogenyofavianorders,andcolortheProavesinblue.Repeatthisbutonlyfortheterminalbranchesofthisclade.3.Supposeyouhaveafactor,sayrepresentingacharacterstate,foreachnodeandeachtipofatree.Findawaytoassociateacolorwitheachbranchdependingonthestateatbothendsofthebranch. 5PhylogenyEstimationReconstructingtheevolutionaryrelationshipsamonglivingspeciesisoneoftheoldestproblemsinbiology.Ithasclearlyenjoyedanincreasinginterestaswitnessedbythereviewspublishedinthelastfewyears[4,12,68,69,155].Therehavebeensomerealadvancesduringthepasttwodecades,butseveraldifficultiesremain.•Theestimationofphylogeniesisacomputationallyhardproblemwhichisanalyticallyintractableinthegeneralcase[19].•Realisticmodelsofcharacterevolutioninvolvemanyparameters,anditislikelythatrealprocessesaremuchmorecomplexthanthemostcomplexmodelsavailableintheliterature.•Acommonbiologicalcomplicationisthatthespeciesandthecharactersunderstudydonothavethesamehistory;thisisparticularlythecaseforgeneticdata[4].•Itisoftennecessarytoestimatemanyparameterssimultaneouslybutonlysomeofthemareofinterest[68].•Thereissomeconfusionintheuseofsometerminologyrelatedtoesti-mationandstatisticsthatislikelytorevealdifficultiesincommunicatingacrossdifferentscientificfields[69].•Someconfusionarisesbecausephylogenyestimationmethodsarealsousedforsystematics(i.e.,classificationofspecies)ratherthanestimatingevo-lutionaryparameters.•Manystudiesassessedthe“performance”ofphylogeneticmethodsusingsimulationsbuttheseconsideredonlyspecialcases,andtheconclusionsdrawnfromthesesimulationsareofverylimitedvalue[69].•Thedifferentmethods,models,andalgorithmsforphylogenyestimationareavailableindistinctprogramsresultinginseveralpracticaldifficulties.Thelastpointisofparticularinteresthere.Alltheseprogramshavetheirownfeaturesandrequirementsintermsofoperatingsystems,userinterfaces,dataformats,orlicenses.Manyofthemarenotfree.Comparingdifferent 965PhylogenyEstimationmethodsisdifficultbecauseitisoftenhardtodecidewhethertheobserveddifferencesintheresultsareduetodifferentassumptions,algorithms,run-timeenvironments,computerarchitectures,orotherfeaturesthatvaryamongprograms.Eventheanalysisofasingledatasetismadedifficultbytheneedtoswitchbetweendifferentsoftwareand/oroperatingsystems.ThedevelopmentofphylogenyestimationinRisverynew,andsomeprogresshasbeenmadeindistance-basedandmaximumlikelihoodmethods.Thisislimitedcomparedtothemethodsavailableintheliterature(partic-ularlywithrespecttotheold,well-establishedparsimonymethods,andthecurrentsuccessofBayesianmethods).Therearegoodreasonstofocusondistanceandlikelihoodmethods,becausethesemethodshavebeenshowntoperformwellinanumberofsituations(althoughwehavetobecautiousingeneralizingtheseconclusionsasmentionedabove).Therehasbeenalong-lastingdebateonthemeritsofparsimony,andalthoughthismethodhasbeenseverelycriticized[37],itcanbeviewedasavalidnonparametricmethod[69].Bayesianmethodsenjoyacurrentsuccess,butsomecriticspointedoutthelimitationsofthisapproach[39,148].However,Bayesianphylogenyestima-tionmaybeimplementedinastraightforwardwaybecauseallthenecessaryingredientsexistinRorhavebeendevelopedinvariouspackages.5.1DistanceMethodsDistancemethodshavealonghistorybecauseintheirsimplestformulationtheyaregenerallytractableevenwithalargeamountofdata[152].Iconcen-trateononlytwomethods:UPGMAandneighbor-joining.ThefirstsectiondealswithhowtocomputedistancesinR.5.1.1CalculatingDistancesThereisadifferencebetweentheconceptsofstatisticalandevolutionarydis-tances.Instatistics,adistancecanbeviewedasa“physical”orgeometricdistancebetweentwoobservations,eachvariablebeingadimensioninahy-perspace.Inevolutionarybiology,adistanceisanestimateofthedivergencebetweentwounits(individuals,populations,orspecies).Thisisusuallymea-suredinquantityofevolutionarychange(e.g.,numbersofmutations).Rhasvariousfunctionstocomputedistancesavailableindifferentpack-ages.Table5.1liststhesefunctions,whicharedetailedinthefollowingsec-tions.ClassicalDistancesRhasarichsetofmethodstocomputeclassicaldistances.distinpack-agestatsperformsdistancecalculationstakingamatrixasitsmainargu-ment.Itsmainoptionismethodwhichcantakeoneofthesixfollowing 5.1DistanceMethods97Table5.1.FunctionsforcomputingdistancesinRPackageFunctionDataTypesstatsdistContinousorbinarycopheneticObjectsofclass"hclust"or"dendrogram"clusterdaisyContinuousand/ordiscreteade4dist.binaryBinarydist.propRelativefrequenciesdist.genetAnobjectofclass"genet"apedist.geneDiscretedist.dnaAlignedDNAsequencesweight.taxo‘Taxonomic’levelscopheneticAnobjectofclass"phylo"strings:"euclidean"(thedefault),"maximum","manhattan","canberra","binary",or"minkowski".Asasimpleexample:>X<-matrix(rep(c(0,1,5),3),3)>rownames(X)<-LETTERS[1:3]>X[,1][,2][,3]A000B111C555>dist(X)ABB1.732051C8.6602546.928203>dist(X,method="maximum")ABB1C54>dist(X,method="manhattan")ABB3C1512distreturnsanobjectofclass"dist"whichisavectorstoringonlythelowertriangleofthedistancematrix(becauseitissymmetricandallitsdiagonalelementsareequaltozero).Theseobjectscanbeconvertedtomatricesusingthegenericfunctionas.matrix,andmatricescanbeconvertedwithas.dist:>d<-dist(X)>class(d)[1]"dist">as.matrix(d) 985PhylogenyEstimationABCA0.0000001.7320518.660254B1.7320510.0000006.928203C8.6602546.9282030.000000Thefunctiondaisyinthepackageclusteralsoperformsdistancecal-culationsbutitimplementssomemethodsthatcandealwithmixeddatatypes.Twometricsareavailableviatheoptionmetric:"euclidean"or"manhattan".Thedatatypesarespecifiedwiththeoptiontype.EvolutionaryDistancesapehastwofunctionstocalculateevolutionarydistances:dist.geneanddist.dna.TheyhandleallelicdataandDNAsequences,respectively.Addi-tionallyade4hasthefunctiondist.genetthatcomputesdistancesbetweenpopulationsusingallelefrequencydatadist.geneprovidesasimpleinterfacetocomputethedistancebetweentwohaplotypesusingasimplebinomialdistributionofthepairwisedifferences.Thisallowsustocomputeeasilythevarianceoftheestimateddistanceswiththeexpectedvarianceofthebinomialdistribution.Theinputdataareama-trixoradataframewhereeachrowrepresentsahaplotype,andeachcolumnalocus.dist.dnaprovidesacomprehensivefunctionfortheestimationofdis-tancesfromalignedDNAsequencesusingsubstitutionmodels(Table5.2).Ifacorrectionforamong-sitesheterogeneity(usuallybasedonaΓdistribution)isavailable,thismaybetakenintoaccount.Thevariancesofthedistancescanbecomputedaswell.dist.genettakesasinputtheallelefrequenciesfromoneorseveralloci,andcomputesthedistancesbetweenpopulations.Thedatamustbealistofclass"genet".Suchalistmaybeobtainedfromamatrixwiththefunctionchar2genet(seethehelpofthisfunctionfordetails).Fivemethodsareavail-abletocomputethesedistances:standard(orNei’s),angular(orEdwards’s),Reynolds’s,Rogers’s,andProvesti’s.Thisisspecifiedwiththeoptionmethodwhichtakesanintegervaluebetween1and5.Bycontrasttodist.gene,dist.dnaanddist.genetreturnanobjectofclass"dist".SpecialDistancesThepackageade4hastwofunctionsthatcomputedistanceswithsomespecialtypesofdata:dist.binaryanddist.prop,forbinarydataandproportions,respectively.Thefirstonehastheoptionmethodwhichtakesanintegerbe-tween1and10;thisincludesthewell-knownJaccard,andtheSokalandSneathmethods.Thesecondfunctionhasasimilaroptiontakinganintegerbetween1and5;thisincludesRogers’s,Nei’s,andEdwards’smethods. 5.1DistanceMethods99Table5.2.Optionsofthefunctiondist.dnaOptionsEffectPossibleValuesmodelSpecifiesthesubstitution"raw","JC69","K80"(d),model"K81","F81","F84","T92","TN93","GG95"varianceWhethertocomputetheFALSE(d),TRUEvariancesgammaThevalueofαfortheΓcor-NULL(nocorrection)(d),anu-rectionmericgivingthevalueofαpairwise.deletionWhethertodeletethesitesFALSE(d),TRUEwithmissingdatainapair-wisewaybase.freqThefrequenciesofthefourNULL(calculatedfromthedata)bases(d),fournumericvaluesas.matrixWhethertoreturnthere-TRUE(d),FALSEsultsasamatrixorasanob-jectofclass"dist"apehasthefunctionweight.taxothatcomputesasimilaritymatrixbe-tweenobservationscharacterizedbycategoriesthatcanbeinterpretedasataxonomiclevel(i.e.,anumericcode,acharacterstring,orafactor).Thevalueis1ifbothobservationsareidentical,0otherwise.Finally,statshasagenericfunctioncopheneticthatcomputesthedis-tancesamongthetipsofahierarchicaldatastructure:therearemethodsforobjectsofclass"hclust","dendrogram",and"phylo".5.1.2SimpleClusteringandUPGMAThereisacorpusofphylogenyestimationmethodsthatarebasedonstatis-ticalclusteringmethods.Theywerepopularinthepast,buthaverecentlydeclinedsincetheriseoflikelihoodandBayesianmethods.Thesemethodsarelimited,mostlybecauseoftheirassumptionofconstantratesofevolution[106].Wedonotconsiderthemindetail,butusingthesemethodsisaniceillustrationofhowdifferentfunctionsfromdifferentpackagesinRcaninteractsimply.Rhasareasonablylargenumberoffunctionsthatperformclustering[154].Theymostlyworkonadistance(alsocalleddissimilarity)matrix,butsomeofthemworkdirectlyontheoriginaldatamatrix(observationsandvariables).Remarkably,atreeestimatedwiththeunweightedpair-groupmethodusingarithmeticaverage(UPGMA)isbuiltinexactlythesamewayasahierarchicalclusteringwiththeaveragemethod.Thussuchatreecanbeestimatedinastraightforwardway,forinstance,fromasetofDNAsequencesnamedXwith:M<-dist.dna(X) 1005PhylogenyEstimationhc<-hclust(M,"average")tr<-as.phylo(hc)Thesubstitutionmodelcanbechangedwiththeappropriateoptionindist.dna.Givingthegraphicalfunctionsdetailedinthepreviouschapter,itiseasytocomparethetreesestimatedwithdifferentsubstitutionmodels;forinstance:M1<-dist.dna(X)tr1<-as.phylo(hclust(as.dist(M1),"average"))M2<-dist.dna(X,model="F84")tr2<-as.phylo(hclust(as.dist(M2),"average"))layout(matrix(1:2,2,1))plot(tr1,main="Kimura(80)distances")plot(tr2,main="Felsenstein(84)distances")WeshowsomepracticalexamplesinSection5.5.5.1.3Neighbor-JoiningTheneighbor-joining(NJ)methodisafastandstraightforwardmethodforestimatingaphylogenetictreefromadistancematrix[134].Itsprincipleistoconstructatreebysuccessivepairingoftaxons(theneighbors):thepairthatleadstothetreewiththesmallesttotalbranchlengthisselected.Theprocedureisiterateduntilthetreeisdichotomous.apehasthefunctionnjthatperformstheNJalgorithm.Itsuseisex-tremelysimple:ittakesadistancematrixasuniqueargument,andreturnstheestimatedtreeasanobjectofclass"phylo".AsfortheUPGMA,itiseasytoobtainNJtreeswithdifferentsubstitutionmodels.Itisalsopossibletocallnjrepeatedlyforaseriesofmodels:mod<-list("JC69","K80","F81","F84")lapply(mod,function(m)nj(dist.dna(X,model=m)))Intheabovecommand,weinsertthecalltodist.dnawiththecalltonjinafunctionwherethemodelistreatedasavariable.lapplythendispatchesthedifferentmodelstothisfunction,andreturnstheresultsasalist.AstrengthoftheNJmethodisthatitisfast[152],evenwithlargesam-plesizes,bothintermsofnumberoftips(whichisdealtwithbytheNJmethod)andintermsofnumberofsites(whichisdealtwithbythedistancecomputationmethods).5.2MaximumLikelihoodMethodsMaximumlikelihoodisthecornerstoneofmodernstatistics[27,30].Thetwocriticalingredientsinestimatingaphylogenybymaximumlikelihoodare: 5.2MaximumLikelihoodMethods101•Aparametricmodelofevolutionappropriateforthecharacters;•Analgorithmthatwillsearchthroughthetreesinordertofindthemax-imumlikelihoodone.Alltheotheringredients(derivingtheprobabilitydistributionofthedataandthelikelihoodfunction,etc.)aresomewhatstraightforward.Themodelchosendependsessentiallyonthenatureofthecharactersunderstudy.Amongthemanypossiblemodelsofcharacterevolution,thosecommonlyusedfallintotwocategories:MarkovianandBrownian.Markovianmodelsareappropriateformodelingtheevolutionofdiscretecharacters,whereasBrownianonesaremoreappropriateforcontinuouscharacters.5.2.1SubstitutionModels:APrimerThevastmajorityofmodelsofevolutionfordiscretecharactersareMarkovianimplyingthat:•Thenumberofcharacterstatesisfinite;•Theprobabilitiesoftransitionsamongthesestatesarecontrolledbysomeparameters;•Theprocessisatequilibrium.Thiscanbeappliedtomanykindsofdata[110],buttherecentriseoflarge-scalemoleculardatabaseshasledtothisapproachbeingappliedessentiallytonucleotide(DNA)andproteinsequences.Anintermediatekindofdataoftenconsideredforcodingnucleotidesequencesisbasedoncodons.Asubstitutionmodelisaformulationoftheinstantaneousratesofchangeamongthedifferentstatesofthecharacter.Forinstance,foracharacterwithtwostates,AandB,wheretherateofchange(i.e.,theprobabilityofchangefromonestatetoanotherforaveryshorttime)issymmetricandequalto0.1,theratematrix,usuallydenotedQ,is:−0.10.1Q=.(5.1)0.1−0.1TherowsofQcorrespondtotheinitialstate,anditscolumnstothefinalone.Theelementsonthediagonalaresetsothatthesumofeachrowiszero.Foranarbitrarytimeintervalt,theprobabilitymatrixPisobtainedbythematrixexponentiationofQ:P=etQ.(5.2)TheelementpijfromtheithrowandjthcolumnofPistheprobabilityofbeinginstatejaftertimetgivingthattheinitialstatewasi.TheprobabilitiesinPtakeintoaccountpossiblemultiplechanges(e.g.,achangefromAtoBmaybetheresultofA→B,orA→B→A→B,...).Thematrixexponentiationisusuallycalculatedwithaninfinitesum: 1025PhylogenyEstimation(tQ)2(tQ)3etQ=I+tQ+++···(5.3)2!3!∞(tQ)i=I+.(5.4)i!i=1Inpractice,anapproximationisdone.SeveralfunctionsinRperformmatrixexponentiation.Weusemexpinthepackagermutil:>library(rmutil)>Q<-matrix(c(-0.1,0.1,0.1,-0.1),2)>Q[,1][,2][1,]-0.10.1[2,]0.1-0.1>mexp(Q)#t=1[,1][,2][1,]0.909365380.09063462[2,]0.090634620.90936538>mexp(10*Q)#t=10[,1][,2][1,]0.56766760.4323324[2,]0.43233240.5676676Weeffectivelyhaveprobabilitiesbecausetherowssumtoone.NotethatQisindependentoftimewhereasPisnot.Bothcalculatedmatricesaresymmetric;theywouldbeasymmetricifQwere.Whenfittingasubstitutionmodeltosomedata,itsparameter(s)willusuallybeunknown.Forthehypotheticaltwo-statescharacterwewrite:.αQ=,(5.5)α.whereαistheparameterandthedotsonthediagonalindicatethatthesevaluesaresetsothattherowssumtozero.ThismethodologyisgeneralizedtoDNAsequences(byassumingthatQis4×4),toproteinsequences(20×20),andcodons(64×64).ThesubstitutionmodelsdifferinthewaytheratematrixQismodeled.WeconsiderhereindetailthecaseofDNAsequencesbecausesubstitutionmodelsforthiskindofdataareimplementedinseveralfunctionsinape.ForthesimplestmodelsofDNAsubstitution,itispossibletoderivethetransitionprobabilities(i.e.,theelementsofP)withoutmatrixexponentia-tion:thisisnicelyexplainedbyFelsenstein[39,p.156].Inthefollowing,eachmodeliscited,thecharactercodeusedinapeisgiven,andthemodelisbrieflydescribed. 5.2MaximumLikelihoodMethods103JukesandCantor1969("JC69")ThisisthesimplestmodelofDNAsubstitution[77].Theprobabilityofchangefromonenucleotidetoanyotheristhesame.Itisassumedthatallfourbaseshavethesamefrequencies(0.25).TheratematrixQis:AGCT⎡⎤A.αααG⎢α.αα⎥⎢⎥.C⎣αα.α⎦Tααα.Aswiththegeneralcaseabove,therowscorrespondtotheoriginalstateofthenucleotide,andthecolumnstothefinalstate(therowandcolumnlabelsareomittedinthefollowingmodels).Theoverallrateofchangeinthismodelisthus3α.Theprobabilityofchangefromonebasetoanotherduringtimetcaneasilybederived(see[39]):p(t)=(1−e−4αt)/4a=b,(5.6)abwhereaandbareamongA,G,C,andT.Theexpectedmeannumberofsubstitutionsbetweentwosequencesis3(1−e−4αt)/4becausetherearethreedifferenttypesofchange.Fromthis,itisstraightforwardtoderiveanestimateofthedistance.Thismodelisavailableindist.dna,mlphylo,andphymltest.Kimura1980("K80")Becausetherearetwokindsofbaseswithdifferentchemicalstructures,purines(AandG)andpyrimidines(CandT),itislikelythatthechangeswithinandbetweenthesekindsaredifferent.Kimura[81]developedamodelwhoseratematrixis:⎡⎤.αββ⎢α.ββ⎥⎢⎥.⎣ββ.α⎦ββα.Achangewithinatypeofbaseiscalledatransitionandoccursatrateα;achangebetweentypesiscalledatransversionandoccursatrateβ.Thebasefrequenciesareassumedtobeequal.Thismodelisavailableindist.dna,mlphylo,andphymltest. 1045PhylogenyEstimationFelsenstein1981("F81")Felsenstein[34]extendedtheJC69modelbyrelaxingtheassumptionofequalfrequencies.Thustherateparametersareproportionaltothelatter:⎡⎤.απGαπCαπT⎢⎢απA.απCαπT⎥⎥.⎣απAαπG.απT⎦απAαπGαπC.Therearethreeadditionalparameters(thebasefrequencies,πA,πG,πC,andπT,sumtoone,thusonlythreeofthemmustbeestimated)buttheyareusuallyestimatedfromthepooledsampleofsequences.Thismodelisavailableindist.dna,mlphylo,andphymltest.Kimura1981("K81")Kimura[82]generalizedhismodelK80byassumingthattwokindsoftransver-sionshavedifferentrates:A↔CandG↔Tononeside,andA↔TandC↔Gontheother.⎡⎤.αβγ⎢α.γβ⎥⎢⎥.⎣βγ.α⎦γβα.Thismodelisavailableindist.dna.Felsenstein1984("F84")ThismodelcanbeviewedasasynthesisofK80andF81:therearedifferentratesforbasetransitionsandtransversions,andthebasefrequenciesarenotassumedtobeequal.Theratematrixis:⎡⎤.πG(α/πR+β)βπCβπT⎢⎢πA(α/πR+β).βπCβπT⎥⎥,⎣βπAβπG.πT(α/πY+β)⎦βπAβπGπC(α/πY+β).whereπR=πA+πG,andπY=πC+πT(theproportionsofpurinesandpyrimidines,respectively).FelsensteinandChurchill[40]gaveformulaefortheprobabilitymatrixandthedistance.Thismodelisavailableindist.dna,mlphylo,andphymltest. 5.2MaximumLikelihoodMethods105Hasegawa,Kishino,andYano1985("HKY85")Thismodelisverycloseinessencetothepreviousonebutitsparameterizationisdifferent[66]:⎡⎤.απGβπCβπT⎢⎢απA.βπCβπT⎥⎥.⎣βπAβπG.απT⎦βπAβπGαπC.Duetosomemathematicalpropertiesofthisratematrix,itdoesnotseempossibletoderiveanalyticalformulaeofthetransitionprobabilities,andsoforthedistanceaswell[156].Thismodelisavailableinmlphyloandphymltest.Tamura1992("T92")ThemodeldevelopedbyTamura[150]isageneralizationofK80thattakesintoaccountthecontentofG+C.Theratematrixis:⎡⎤.αθβθβ(1−θ)⎢⎢α(1−θ).βθβ(1−θ)⎥⎥,⎣β(1−θ)βθ.α(1−θ)⎦β(1−θ)βθαθ.whereθ=πG+πC.Tamura[150]gaveformulaeforthedistance,andGaltierandGouy[44]gaveformulaeforthetransitionprobabilities.Thismodelisavailableindist.dnaandmlphylo.TamuraandNei1993("TN93")TamuraandNei[151]developedamodelwherebothkindsofbasetransitions,A↔GandC↔T,havedifferentratesαRandαY,respectively.Thebasefrequenciesmaybeunequal.AlltheabovemodelscanbeseenasparticularcasesoftheTN93model.Theratematrixis:⎡⎤.πG(αR/πR+β)βπCβπT⎢⎢πA(αR/πR+β).βπCβπT⎥⎥.⎣βπAβπG.πT(αY/πY+β)⎦βπAβπGπC(αY/πY+β).FixingαR=αYresultsintheF84model,whereasfixingαR/αY=πR/πYresultsintheHKY85model[39].Thismodelisavailableindist.dna,mlphylo,andphymltest. 1065PhylogenyEstimationThe“GeneralTime-Reversible”Model("GTR")Thisisthemostgeneraltime-reversiblemodel.Allsubstitutionratesaredif-ferent,andthebasefrequenciesmaybeunequal[87].Theratematrixis:⎡⎤.απGβπCγπT⎢⎢απA.δπCπT⎥⎥.⎣βπAδπG.ζπT⎦γπAπGζπC.Therearenoanalyticalformulaeforthetransitionprobabilities,norforthedistance[39].Thismodelisavailableinmlphyloandphymltest.GaltierandGouy1995("GG95")GaltierandGouy[43]developedanonequilibriummodelwheretheG+Ccontentisallowedtochangethroughtime.SequencesareassumedtoevolveoneachlineagedependingonitsG+Ccontent.ThisisestimatedfromtheG+Ccontentoftherecentspeciesorpopulations.ItisthusnecessarytoestimateancestralG+Ccontents.TheratematricesforeachlineagearesimilartotheonefortheT92modelexceptthatθmayvary.Thismodelisavailableindist.dna.5.2.2EstimationwithMolecularSequencesIftheprobabilitiesofchangealongatreeareknown(usingoneofthemodelsdescribedintheprevioussection),thelikelihoodofthetreecanbecomputed.However,thestatesofthedataonthenodesofthetreeareunknown,anditisnecessarytosumtheprobabilitiesforallpossiblestatesonthenodeswhichmayinvolveaverylargenumberoftermsevenforamoderatedataset.Felsenstein[34]presentedanalgorithmthatallowsconsiderabletimesavinginthiscomputation.Theideaistocomputesuccessivelythelikelihoodsofeachcharacterstateateachnodebysummingtheprobabilitiesgivingthelikelihoodsofthedescendants(hencethename“pruningalgorithm”).DenoteasMthenumberofstates(e.g.,M=4forDNAdata),pab(t)theprobabilityofchangefromstateatostatebduringtimet.Thenthelikelihoodofstateaatnodez,giventhelikelihoodofitsdescendantsxandy(assumingabinarytree)andthebranchlengthstxzandtyzis:MMLaz=pab(txz)Lbxpab(tyz)Lby.(5.7)b=1b=1Ifxisatip,thenLbx=1ifstatebisobserved,0otherwise.Oncethiscomputationhasbeenappliedtoallnodesofthetree,thelike-lihoodofthecharacterforthetreeisobtainedby: 5.2MaximumLikelihoodMethods107ML=πaLar,(5.8)a=1whereπaisthefrequencyoftheathstate,andristherootofthetree.Therootcanactuallybeplacedonanyinternalnodeofthetreebecausethelatterisunrooted[34].Thelikelihoodofthefulldatasetis:NML=πaLair,(5.9)i=1a=1whereNisthenumberofcharacters.Takingthelogarithmofthisexpressionleadsto:NMlnL=lnπaLair.(5.10)i=1a=1Withmolecularsequences,afurtherlayerofcomplexityisaddedbycon-sideringheterogeneityamongcharacters(sites).Twotypesofheterogeneityareoftenconsidered:partitionsandmixtures[123,158].Withpartitions,thedifferentcharactersareassignedindifferentcategories,whereaswithmixturesweassumethattherearedifferentcategories,butwedonotknowwhichsitesbelongtowhichcategories.Denoteasfkthefrequencyofthekthcategoryinthemixture(withkfk=1),then(5.7)wouldbecome:MML=fpk(t)Lpk(t)L.(5.11)aizkabxzbixabyzbiykb=1b=1Theexponentkofpindicatesthattheseprobabilitiesdependonthecategoriesofthemixture.Thepresenceofpartitionsisignoredinthisformulation,buttheycanbetakenintoaccounteasilybecausethelog-likelihoodissummedoverallsites:thefulllog-likelihoodwouldbecomeasumofindividuallog-likelihoodssimilarto(5.10)foreachpartition.Thepartitionscanhavedifferentmodelsofevolutionanddifferentmix-turesaswell.Ontheotherhand,themodelsofevolutionand/orthemixturescanbeconstrainedtobethesameacrosspartitions(possiblywithdifferentpa-rametervalues).Onecanalsoimaginenestedpartitionswithdifferentsharedmodelcomponents,forinstance,fourpartitionseachwithdifferentmixtures,andamodelofsubstitutionscommontotwopartitions.Thisgeneralframeworkisimplementedinape.Thiscoversmanymod-elsofmolecularevolutioncurrentlyusedinphylogenetics.Amongthosenotincludedinthisframeworkarethenonequilibriummodelswheresomeparam-etersareassumedtochangeovertime(typicallythenucleotidefrequencies).Notincludedaswellarethemodelswithanonfinitenumberofstates,suchasthenumberofrepeatsinmicrosatellites,andthemodelsofinsertions–deletions(indels). 1085PhylogenyEstimationTheuserinterfacefordefiningamodelofevolutionisoneofthefunctionsDNAmodel,AAmodel,orCODONmodel,dependingonthekindofdataanalyzed.Thesefunctionscreateanobjectwhoseclasshasthesamename.LetusfocusonDNAsequencedata:thetwootherfunctionsworksensiblyinthesameway.DNAmodelhassixargumentsthatdefinethreeaspectsofamodelofDNAevolution:thesubstitutionmodel,theΓ-variationamongsites,andthepro-portionofinvariantsites.Apartitioncanbedefinedforeachoftheseaspects.Theoptionpart.modeldefinesthepartitionsusedforthesubstitutionmod-els:itneedsasinglevectorofintegersthatspecifiesthepartitioneachsitebelongsto;thisvectorisrecycledifnecessary.Forinstance,part.model=c(1,1,2)isusedforacodingsequenceinwhichthethirdcodonpositionwillbeinadifferentpartition.Anotherchoicefortwoconcatenedsequencesof,say800and900nucleotides,part.model=c(rep(1,800),rep(2,900))willspecifyadifferentpartitionforeachsequence.Ifmorethanoneparti-tionisspecified,itispossibletousedifferentsubstitutionmodelsbygivingavectorofmodelstomodel;forinstance,model=c("K80","JC69")meansusingKimura’s1980modelforthefirstpartitionandJukes–Cantor’soneforthesecond(inotherwords,thetransition/transversionratiowillbeallowedtovaryonlyinthefirstpartition).Theintersitevariationisspecifiedinawaysimilartothesubstitutionmodels.Twoargumentscanbeused:part.gammawhichisusedinthesamewayaspart.model,andncatwhichspecifiesthenumberofcategoriesofthediscretizedΓdistribution[157](1bydefaultmeaningthatthereisnointersitevariation).Thespecificationofinvariantsitesfollowsthesamelogicwithtwoargu-ments:part.invarandinvar.Thelatterisalogicalvectorgivingwhetherthereareinvariantsitesforeachpartition.Wehavejustseenthatpartitionsarespecifiedseparatelyforthethreecomponentsofthemodel.Thepartitionsthatareactuallyused(i.e.,thesetsofnucleotideswiththesameparametersofevolution)whenfittingthemodelspecifiedbyDNAmodelresultfromcrossingoverallthreecomponents.Thisallowsustoformulatealargenumberofmodels.Toseehowthisworksweconsiderasimpleexamplewithtwopartitionsforthesubstitutionmodelandtwopartitionsfortheintersitevariation.Ifbothpartitionscoincidetheresultingmodelobviouslyhastwopartitions:SequenceACCT...Partitions1400800SubstitutionK80K80modelΓΓΓ−variationParametersκ1,α1κ2,α2 5.2MaximumLikelihoodMethods109whereκ1andκ2arethetransition/transversionratios,andα1andα2aretheshapeparametersfortheΓdistributionofintersitevariation.Thecodetospecifythismodelis:DNAmodel(part.model=c(rep(1,400),rep(2,400)),model="K80",part.gamma=c(rep(1,400),rep(2,400)))Ontheotherhand,iftheydonotcoincidethemodelhasthreepartitionsresultingfromcrossingoverthetwospecifiedpartitions.Thisallowsustospecifyparametersthataresharedacrossseveralpartitions.SequenceACCT...Partitions1200600800SubstitutionK80K80modelΓΓΓ−variationParametersκ1,α1κ2,α1κ2,α2Thecodeisnow:DNAmodel(part.model=c(rep(1,200),rep(2,600)),model="K80",part.gamma=c(rep(1,600),rep(2,200)))Ofcourse,theinterestofDNAmodelistolettheuserformulatesomemodelsthatmakesensebiologicallyfortheparticulardataathand.Amodelofinterestforasequencecouldbe:DNAmodel(part.model=c(1,1,2),model=c("K80","JC69"),part.gamma=c(1,1,2),ncat=c(4,1))Thisdefinestwopartitionswithrespecttothecodonpositions:inthefirstone,Kimura’stwo-parametermodelisassumedwithanintersitevariationfollowingaΓ-distributionwithfourcategories,andinthesecondoneJukes–Cantor’smodelisassumedwithnointersitevariation(becauseonecategoryhasbeenassumedforthesecondpartitionofpart.gamma).Thismodelseemsbiologicallyreasonablebecausemutationsonthethirdcodonpositionarelikelytobelessconstrainedthanonthefirstandsecondones,andthustran-sitionsandtransversionsmayoccuratequalrates.Becausemutationsonthefirstandsecondcodonpositionshavegreaterstructuralimpactontheprotein,itislikelythattheyvaryalongthesequence.Theabovemodelassumesthatthebasefrequenciesarebalanced:torelaxthisassumption,themodelcanbemodifiedwith:DNAmodel(part.model=c(1,1,2),model=c("F84","F81"),part.gamma=c(1,1,2),ncat=c(4,1)) 1105PhylogenyEstimationAlltheoptionsofDNAmodelhavedefaultvalueswhichare:DNAmodel(part.model=1,model="K80",part.gamma=1,ncat=1,part.invar=1,invar=FALSE)ThisimpliesthatcallingDNAmodel()generatesamodelwithKimura’stwo-parametermodelforallsites,withnointersitevariation,andnoinvariants.Whenflexibilityinmodel-buildingispossible,itiscriticaltoassesstherelevanceofthemodelswithempiricaldata[13,14].Thisispossibleinthemaximumlikelihoodframework,andthishasbeendiscussedrepeatedlyinthephylogeneticliterature[71,121].Thisisdealtwithinthenexttwosections.5.2.3FindingtheMaximumLikelihoodTreeOnceamodelofsequenceevolutionhasbeenchosen,itsparametersmustbeestimated.Inthemaximumlikelihoodframework,thisinvolvesfindingthevaluesoftheparametersthatmaximize(5.10)foragivendataset.Adifficultycomesfromthefactthattherearetwokindsofparametersthatneedtobeestimated:purelynumericparameters(branchlengths,substitutionparameters,shapeparameteroftheΓ-distributionofintersitevariation,etc.)andthetopologyofthetree.Maximumlikelihoodmethodsfortreeestimationusenumericalmethodstoestimatethefirstkindofparameter[40,58,159].Thisisrelativelystraightforwardbecausecomputerscientistshavedevotedalotofefforttocreatingnumericalmethodsthatmaximizecomplexfunctionswithpossiblymanyvariables[e.g.,6,139].Ontheotherhand,findingthetopologythatmaximizesthelikelihoodisamuchmoredifficulttask.Severalalgorithms(sometimescalledheuristics1)havebeenproposedforexploringthetreespace.apehasthefunctionmlphylothatperformsmaximumlikelihoodestima-tionofphylogenyusingmolecularsequences.Itsinterfaceis:mlphylo(model=DNAmodel(),x,phy,search.tree=FALSE)wherexisaDNAsequencedataset,phyisaphylogenetictree(asanobjectofclass"phylo"),andsearch.treespecifieswhethertosearchthetreespaceforthebesttopology(thedefaultisonlytoestimatethebranchlengthsandotherparameters).Iftheoptionmodelisomitted,Kimura’s[81]modelisused.Thisfunctioncanbeusedtoestimatetheparametersofarelativelycom-plexmodelofDNAevolutionforagivenphylogeny(leavingthedefaultforsearch.tree).Ifthetreespaceissearched(i.e.,search.tree=TRUE),amethodclosetothatofGuindonandGascuel[58]isused.Thisinvolvesstartingfromaninitialtree(e.g.,usingnj),andthenrearrangingitstopologywithnearest-neighbor1Thisredefinitionisunfortunatebecause“heuristics”hasamoreusefulmeaninginepistemology. 5.2MaximumLikelihoodMethods111interchanges(NNI).InGuindonandGascuel’salgorithm,NNIsareselectivelydoneundersomeoptimizationcriteria,leadingtoaveryfastmethodoftreespacesearch.mlphyloreturnsanobjectofclass"phylo"whichistheestimatedtree,withadditionalattributes.Thereareseveralmethodfunctionstoextractthisinformation:logLikreturnsthelog-likelihood,AICtheAkaikeinformationcriterion,andsummaryprintsdetailsontheestimatedtreeandparameters(theyareallgeneric).5.2.4DNAMiningwithPHYMLTheprevioussectionexplainshowtodefineandfitavarietyofmolecularevolutionmodels.Howtoselecttheappropriatemodel(s)forparameteres-timationisanissuethathasattractedalotofattentionanddebateamongstatisticians[13,15,21,99].Theimportanceofmodelselectioninalike-lihoodframeworkhasbeenmaderepeatedlyinthephylogeneticliterature[101,122,120].PosadaandCrandall[121]developedacomputerprogram,tobeusedwiththeprogramPAUP*,thatfitsaseriesofDNAevolutionmodelstoagivendataset.Thisprogramissupposedtohelpinselectingasubstitutionmodelforfurtheranalyses.2Inordertoprovideasimilarfunctionality,butwithafreephylogenyesti-mationprogram,apehasthefunctionphymltestwhich,insteadofPAUP*,usesPHYMLdevelopedbyGuindonandGascuel[58].AnotherdifferenceisthatphymltestletsPHYMLsearchforthebesttreeforallfittedmodels.AllsubstitutionmodelsavailableinPHYMLareused;theseare:JC69,K80,F81,F84,HKY85,TN93,andGTR.Additionally,modelswith(out)invariantsitesand/orintersitevariation(withtheusualΓdistribution)areused.Thisresultsin28fittedmodels.Theinterfaceis:phymltest(seqfile,format="interleaved",itree=NULL,exclude=NULL,execname,path2exec=NULL)whereseqfileisthenameofthefilewiththesequences(givenasacharacter).Theotherargumentshavedefaultvalues,exceptexecname,thenameofthePHYMLexecutable,whichmustbespecifiedasacharacterstring.UnderWindows,execnamemaybeleftmissingifthePHYMLexecutablefileisnamed‘phymlwin32.exe’(itsoriginalnameinPHYML’sdistribution).Somecaremustbetakentosetcorrectlythethreediffferentpathsinvolvedhere:thepathtoPHYML’sexecutable,thepathtothesequencefile,andthepathtoR’sworkingdirectory.HerearetwopossibleusesunderLinuxandWindows,respectively:phymltest("/home/paradis/data/seq.txt",2MODELTESThashadremarkablesuccess:thepaperpublishedinBioinformaticswascited3068times(source:WebofScience,January23,2006). 1125PhylogenyEstimationexecname="phyml_linux",path2exec="/usr/local/bin")phymltest("D:/data/seq.txt",path2exec="D:/phyml")IfRreturnsanerrormessagebecauseofaprobleminfindingoneofthefiles,itmightbebettertomoveallfilesinthesamedirectory,say‘/home/paradis/phyml’or‘D:/phyml’,andsetthelatterasR’sworkingdi-rectory:#Linux:setwd("/home/paradis/phyml")phymltest("seq.txt",execname="phyml_linux")#Windows:setwd("D:/phyml")phymltest("seq.txt")phymltestreturnsanobjectofclass"phymltest"thathasthreemeth-ods:theprintmethodprintsatableofallfittedmodelswiththenumberoffreeparameters,thevaluesofthelog-likelihood,andtheAkaikeInformationCriterion(AIC);thesummarymethodcomputesandprintsallpossiblelikeli-hoodratiotests(LRTs)betweenpairsofnestedmodels;andtheplotmethodplots,onaverticalaxis,allAICvalueswithanindicationofthecorrespondingmodel(seeSection5.5foranexample).5.3BootstrapMethodsandDistancesBetweenTreesTheuseofthebootstraphasenjoyedgreatsuccessinphylogeneticanalyses[35].Theideaofthebootstrapcanbesketchedasfollows:supposeweareinterestedinquantifyingtheconfidencelevelinaparameterestimategivensomedata,butwecannotapplythemethodsbasedondistributionaltheoryofthisparameter.Thenwecouldresamplethesampleathandmanytimes,mimickingtheprocessofsamplingtherealpopulationseveraltimes.Thevari-ationintheestimatedparameterfromthe“bootstrap”samplesisameasureoftheconfidencelevelinthisestimate[29].Theideaissimple,intuitive,andelegant,but,insomesituations,requiresintensivecomputations[32].Theapplicationofthebootstrapinphylogenyestimationisalmostassimple:estimateatreewithagivenmethod,resampletheoriginaldata(thematrixtaxa×characters)alargenumberoftimes,andanalyzethese“bootstrap”sampleswiththesamemethod,andcalculatethenumberoftimesthecladesobservedintheestimatedtreeappearinthe“bootstrap”ones.Theapplicationofthebootstraptoassessconfidencelevelsinphylogeneticestimationhasbeencriticized,butEfron,Halloran,andHolmes[31]showedthatthiswasduetoconfusionintheinterpretationoftheoriginalbootstrapmethodbyFelsenstein[35].Efronetal.alsoproposedanotherwaytocompute 5.3BootstrapMethodsandDistancesBetweenTrees113thebootstrapvaluesforhypothesistestingratherthanassessingconfidencelevels[31].Inthissection,weexaminethedifferentwaysofresamplingphylogeneticdata,comparing(possiblyalargenumberof)phylogenetictrees,andcom-putingbootstrapvalues.5.3.1ResamplingPhylogeneticDataRhasapowerfulfunction,sample,thatcanbeusedtocreateabootstrapsamplefromadataset:thisfunctionreturnsasample,bydefaultwithoutreplacement,ofthevectorgivenasargument.Iftheoptionreplace=TRUEisused,thensamplingisdonewithreplacementwhichisclearlywhatisneededforabootstrapsample.Belowisasimpleexamplewithavectorxcontaining10values1,2,...,10:>x<-1:10>sample(x)[1]98611075432>sample(x,replace=TRUE)[1]75241062122Notethatsample(x)returnsa(random)permutationofthedata.Wecanalsogiveasingleintegervaluetosample,say10,whichwillthenreturnasampleofintegersfrom1to10.Withphylogeneticdatawearemostlyinterestedinresamplingthecolumnsofthematrixtaxa×characters(wheretaxaaretherows,andcharactersthecolumns).IfthismatrixiscalledX,thenonecansimplydo:X[,sample(ncol(X),replace=TRUE)]NotethepresenceofthecommajustaftertheleftbracketwhichmeansthatallrowsofXwillbeselected(seep.15).Hereisanexampleofhowthiscouldbeused:>x<-scan(what="")1:aacttaacttcacct16:Read15items>X<-matrix(x,3,5,byrow=TRUE)>X[,1][,2][,3][,4][,5][1,]"a""a""c""t""t"[2,]"a""a""c""t""t"[3,]"c""a""c""c""t">X[,sample(ncol(X),replace=TRUE)][,1][,2][,3][,4][,5][1,]"a""c""c""a""a" 1145PhylogenyEstimation[2,]"a""c""c""a""a"[3,]"a""c""c""a""c"Ithappenssometimesthatthecolumnsofamatrixareaffectedwithweights,forinstance,becausethesamevalueshavebeenobservedseveraltimesforalltaxa[31,84].Thismaybeausefulwaytoreducethesizeofthedatamatrix,particularlyiffewsitesarepolymorphic.Inthesecases,re-samplingmusttaketheseweightsintoaccount.SupposeeachcolumnofXisassociatedwithaweightstoredinavectorw(length(w)isequaltoncol(X)),thenabootstrapsampleisobtainedusingtheoptionprobofsample:X[,sample(ncol(X),replace=TRUE,prob=w)]Thevaluespassedtoprobneednotsumto1becausetheyareusedasrelativeprobabilityweights.Ifthevaluesinwareintegerweights,onemayneedtousetheoptionsizetoproduceasampleoftheappropriatesize:X[,sample(ncol(X),replace=TRUE,prob=w,size=sum(w))]Anissueinresamplingphylogeneticdataisthatthecolumnsmaynotbeindependent,particularlyinthecaseofmolecularsequences.Asolutionistosamplethesitesbygroups(orblocks)ratherthanindividually.ThereareseveralwaystodothisinR.Oneistobuildblocksofsitesusingthefunctionsplitseqinseqinr,sampleamongtheseblocks,andreconstitutethesequence:>library(seqinr)>x<-scan(what="")1:aaacccgggttt13:Read12items>x[1]"a""a""a""c""c""c""g""g""g""t""t""t">x.codon<-splitseq(x)>x.codon[1]"aaa""ccc""ggg""ttt">x.boot<-sample(x.codon,replace=TRUE)>x.boot[1]"ccc""aaa""ttt""ggg">s2c(c2s(x.boot))[1]"c""c""c""a""a""a""t""t""t""g""g""g"Thelengthoftheblockssampledmaybealteredwiththeoptionwordofsplitseq(whichis3bydefault):>s2c(c2s(sample(splitseq(x,word=2),replace=TRUE)))[1]"t""t""a""a""g""t""g""g""g""g""a""c" 5.3BootstrapMethodsandDistancesBetweenTrees115Amoregeneralsolutiontothisproblemistosampletheindicesofthevectorinsteadofthevectoritself.Letusconsiderthesamecaseofsamplingblocksofthreenucleotidesinthevectorx.First,buildavectorwiththeindices3,6,...:>block<-3>i<-seq(block,length(x),block)>i[1]36912Then,samplethisvectoriasbefore:>i.boot<-sample(i,replace=TRUE)>i.boot[1]126129Whatwewantinfactisavectorwiththevalues10,11,12,4,5,6,10,11,12,7,8,and9.Thepatternisclear:the3rd,6th,9th,and12thvaluesarethoseini.boot,the2nd,5th,8th,and11thonescanbeobtainedwithi.boot-1,andthe1st,4th,7th,and10thonescanbeobtainedwithi.boot-2.Wefirstcreateavectoroftheappropriatelength,andthenfeedinthevalueswithaloop:>boot.ind<-numeric(length(x))>boot.ind[i]<-i.boot>for(jin1:(block-1))boot.ind[i-j]<-i.boot-j>boot.ind[1]101112456101112789Thebootstrapsampleisfinallyobtainedwith:>x[boot.ind][1]"t""t""t""c""c""c""t""t""t""g""g""g"Notethatwedidnotusethevalueofblock(3)orlength(x)(12)intheabovecommands,sotheycanbeusedindifferentsituations.Theyalsocanbeusedtoresampleblocksofcolumnsofadatamatrix:inthiscaseitisnecessarytoreplacelength(x)byncol(x),andthefinalcommandbyx[,boot.ind].Becauseinmostcases,alargenumberofbootstrapsampleswillbeneeded,itisusefultoincludetheappropriatesamplingcommandsinaloopand/orafunction.Thisiswhatisdonebythefunctionboot.phylodescribedbelow.5.3.2BipartitionsandComputingBootstrapValuesOncebootstrapsamplesandtreeshavebeenobtained,itisnecessarytosum-marizetheinformationfromthem.apeprovidesseveralfunctionsforthistaskdependingontheapproachtaken. 1165PhylogenyEstimationAbipartitionismadewithtwosubsetsofthetipsofatreeasdefinedbyaninternalbranch.prop.parttakesasitsargumentalistoftreesandreturnsanobjectofclass"prop.part"whichisalistofallobservedbipartitionstogetherwiththeirfrequencies.Thereareprintandsummarymethodsforthisclass;thelatterprintsonlythefrequencies.Hereistheresultwithafour-taxatree:>tr<-read.tree(text="((a,(b,c)),d);")>prop.part(tr)==>1time(s):[1]abcd==>1time(s):[1]abc==>1time(s):[1]bcInsteadofalistofbipartitionsindexedtotheinternalbranches,prop.partreturnsalistindexedtothenumbersofthenodes,andgivesthetipsthataredescendantsofthecorrespondingnode:thusthefirstvectorinthelistincludesalltipsbecausethefirstnodeistheroot.Itisthenstraightforwardtogetthebipartitions.ThefollowingcodeprintsthemforanobjectnamedY:for(iin2:length(Y)){cat("Internalbranch",i-1," ")print(Y[[i]],quote=FALSE)cat("vs. ")print(Y[[1]][!(Y[[1]]%in%Y[[i]])],quote=FALSE)cat(" ")}prop.cladestakestwoarguments:atree(asa"phylo"object),andeitheralistoftrees,oralistofbipartitionsasreturnedbyprop.part.Inthelattercase,thelistofbipartitionsmustbenamedexplicitly(e.g.,prop.clades(tr,part=list.part)).Thisfunctionreturnsanumericvectorwith,foreachcladeinthetreegivenasfirstargument,thenumberoftimesitwasobservedintheothertreesorbipartitions.Forinstance,wehavetheobviousfollowingresult:>prop.clades(tr,tr)[1]111Likethepreviousfunctions,theresultsareindexedaccordingtothenodenumbers.Notethatbothprop.partandprop.cladesdonotrequirethatalltreesanalyzedhavethesametips(asidentifiedbythelabels).Thismaygiveun-desirableresultswithprop.part,butthismaybeusefulinsomesituations,particularlywithprop.clades,becausethe“support”valuesmaycomefromasampleoftreeswithalargernumberoftips.Usingthetwofunctionsjustdescribed,bootstrapsamplesobtainedasde-scribedintheprevioussection,andtheappropriatefunction(s)forphylogeny 5.3BootstrapMethodsandDistancesBetweenTrees117estimation,onecanperformthebootstrapfortheestimatedphylogenyinastraightforwardwayusingbasicprogrammingtechniques.However,todosuchananalysisdirectly,thefunctionboot.phylocanbeusedinstead.Itsinterfaceis:boot.phylo(phy,x,FUN,B=100,block=1)withthefollowingarguments:phyanobjectofclass"phylo"whichistheestimatedtree;xtheoriginaldatamatrix(taxaasrowsandcharactersascolumns);FUNthefunctionusedtoestimatephyfromx.Notethatifthetreewasestimatedwithadistancemethod,thismustbespecifiedassomethingsuchas:FUN=function(xx)nj(dist.dna(xx))or:FUN=function(xx)nj(dist.dna(xx,"TN93"))Bthenumberofbootstrapreplicates;blockthesizeofthe“block”ofcolumns,thatis,thenumberofcolumnsthataresampledtogetherduringthebootstrapsamplingprocess(e.g.,ifblock=2,columns1and2aresampledtogether,thesameforcolumns3and4,5and6,andsoon;seeabove).boot.phyloreturnsexactlythesamevectorasprop.clades.Theboot-straptreesgeneratedbythisfunctionarenotsaved,andsocannotbeex-aminedorfurtheranalyzed,forinstance,toperformthetwo-levelbootstrapproceduredevelopedbyEfronetal.[31].Thiscanbecircumventedbydoingthebootstrapsamplesbeforehand.Atypicalprogramtogetalistofbootstraptreescouldbe:B<-100btr<-list()length(btr)<-Bfor(iin1:B)btr[[i]]<-nj(dist.dna(x[,sample(ncol(x),replace=TRUE)]))Then,iftr.estistheestimatedtree,doing:prop.clades(tr.est,brt)or:pp<-prop.part(brt)prop.clades(tr.est,part=pp)willgivethesameresultsasusingboot.phylo(exceptforthedifferencesduetorandomsampling!) 1185PhylogenyEstimation5.3.3DistancesBetweenTreesTheideaofdistancesbetweentreesissomehowrelatedtothebootstrapbe-causethisrequiressummarizingandquantifyingthevariationintopologyfromdifferenttrees.Severalwaystocomputethesedistanceshavebeenpro-posedintheliterature[39,Chap.30].Twoofthemareavailableinthefunctiondist.topo.PennyandHendy[118]proposedmeasuringthedistancebetweentwotreesastwicethenumberofinternalbranchesthatdifferintheirbipartitions.RzhestkyandNei[133]proposedamodificationofthisdistancetotakemulti-chotomiesintoaccount:thisisthedefaultmethodindist.topo.Foratrivialexample:>tr<-read.tree(text="((a,b),(c,d));")>tb<-read.tree(text="((a,d),(c,b));")>dist.topo(tr,tb)[1]2>dist.topo(tr,tr)[1]0Billera,Holmes,andVogtmann[9]developedamoreelaboratedistancebasedontheconceptoftreespace.Thisspaceisactuallyacubecomplexbecauseitismadeupofcubesthatsharecertainfaces.Twotreeswiththesametopologylieinthesamecubeofdimensionn−2(nbeingthenumberoftips).Iftheydonothavethesametopologytheywillbeintwodistinctcubes.However,thesecubesmeetattheoriginwheretheinternalbranchesthataredifferentbetweenthetreesareequaltozero.Thusitispossibletodefineageometricdistancefordifferenttopologies.Thisiscomputedwiththeoptionmethodofdist.topo:>tr<-rtree(10)>trb<-rtree(10)>dist.topo(tr,trb)[1]12>dist.topo(tr,trb,method="BHV01")[1]3.455182>dist.topo(tr,tr,method="BHV01")[1]05.3.4ConsensusTreesConsensustreesareaninterestingwaytosummarizeasetoftrees:iftheyaredichotomous,thecladesnotobservedinall(strictconsensus)ofthemajority(majority-ruleconsensus)willbecollapsedasmultichotomies.Thefunctionconsensusreturnstheconsensusfromalistoftreesgiveninthesamewayasforprop.partorprop.clades.Thereisoneoption,p, 5.4MolecularDating119whichspecifiesthethreshold,asarealbetween0.5and1,oftheproportionofthebipartitionsfortheirinclusionintheconsensustree.Ifp=1(thedefault),thenthestrictconsensustreeisreturned,whereasp=0.5returnsthemajority-ruleconsensustree.ThiscorrespondstotheparameterloftheMlconsensusmethodsin[39].5.4MolecularDatingSomeparametersareconfoundedinphylogeneticmodels(branchlengthsandsubstitutionrates),thereforeitisnotpossibletoestimatebranchlengthsinunitsthatareproportionaltotime.Thismustbedoneusingadditionalas-sumptionsonratevariations.Sanderson[135,136]proposedtwoapproachesfortheestimationofdatesusingmolecularphylogenies.Theseareimple-mentedinapewithasetoffunctionsthatareexplainedbelow.Nonparametricratesmoothing(NPRS)assumesthateachbranchofthetreehasitsownrate,buttheserateschangesmoothlybetweenconnectedbranches[135].Givenatreewithestimatedbranchlengthsintermsofnumberofsubstitutions,itispossibletoestimatethedatesofthenodesbyminimizingthechangesinratesfromonebranchtoanother.Practicallythisisdonebyminimizingthefunction:|rˆ−rˆ|p,(5.12)kjwhereˆristheestimatedabsoluterate,kandjaretwonodesofthesamebranch,andpisanexponent(usually2).ThefunctionratogramcomputestheabsoluteratesforatreewithbranchlengthsusingtheNPRSmethod.Bydefaulttheageoftherootisone,butthiscanbechangedwiththeoptionscale.Thefunctionchronogramcomputestheagesofthenodeswiththesamemethod.Itsoptionsarethesameasforratogram.Inordertomakeatrade-offbetweennonparametricandparametricmeth-ods,Sanderson[136]proposedtomodifyhismethodbyusingasemiparamet-ricapproachbasedonapenalizedlikelihood.Thelatter(denotedΨ)ismadeofthelikelihoodofthe“saturated”model(theonethatassumesonerateforeachbranchofthetree)minusaroughnesspenalty(denotedΦ)whichissimilarto(5.12)multipliedbyasmoothingparameterλ:Ψ=lnL−λΦ,(5.13)xexp(−rk)L=rk.(5.14)x!withxbeingthenumberofsubstitutionsobservedonabranch.TheproductofthelikelihoodfunctionLismadeoverallbranchesofthetree.Ifλ=0thentheabovemodelisthe(saturated)modelwithonedistinctrateforeachbranch.Ifλ=+∞,thenthemodelconvergestoaclocklikemodelwith 1205PhylogenyEstimationthesamerateforallbranches.Inordertochooseanoptimalvalueforλ,Sanderson[136]suggestedacross-validationtechniquewhereeachterminalbranchisremovedfromthedataandthenitslengthispredictedfromtheremainingdata.Adifferentcriterionisusedhere:n−2(tˆ−tˆ−i)22jjDi=,(5.15)tˆj=1j−iwheretˆjistheestimateddatefornodejwiththefulldata,andtˆjistheoneestimatedafterremovingtipi.ThiscriterioniseasiertocalculatethanSanderson’s[136].Thepenalizedlikelihoodmethodisimplementedinthefunctionchronopl;itsinterfaceis:chronopl(phy,lambda,node.age=NULL,nodes=NULL,CV=FALSE)wherephyisanobjectofclass"phylo"withbranchlengthsgivingthenumberofsubstitutions(oritsexpectation),lambdaisthesmoothingparameterλ,node.ageisanumericvectorgivingthedatesthatareknown,nodesisthenumberofthenodesdatesofwhichareknown,andCVisalogicalspecifyingwhethertodothecross-validation.Thisfunctionreturnsatreewithbranchlengthsproportionaltotime(i.e.,achronogram)withattributesrates(theestimatedabsoluterates,ˆr),andploglik(thepenalizedlikelihood).IfCV=TRUE,anadditionalattributeD2isreturnedwiththevaluecalculatedwith(5.15)foreachtip.Thecross-validationmaybedonefordifferentvaluesofλinastraightfor-wardway,forinstance,forλ=0.1,1,10,...,106:l<-10ˆ(-1:6)cv<-numeric(length(l))for(iin1:length(l))cv[i]<-sum(attr(chronopl(phy,lambda=l[i]),"D2"))plot(l,cv)Sandersonsuggestedselectingthevalueofλthatminimizesthecross-validationcriterion.IfCV=TRUE,chronoplreturnsavalueD2foreachtip,soitispos-isibletoexaminewhichobservationsareparticularlyinfluential,forinstancewith:chr<-chronopl(phy=phy.est,lambda=1)plot(attr(chr,"D2"),type="l") 5.5CaseStudies1215.5CaseStudiesInthissection,wecomebacktosomeofthedatapreparedinChapter3.Weseehowwecanestimatephylogenies,eventuallyrepeatsomeanalysesdoneintheoriginalpublications,andpossiblyseehowwecouldgofurtherwithR.5.5.1SylviaWarblersTocontinuewiththeSylviadata,itmaybenecessarytoreloadthedatapreparedandsavedpreviously:load("sylvia.RData")Adistancematrixcanbeestimatedfromthesealignedsequencesusingdist.dna;because2ofthe25sequencesaresubstantiallyincomplete,weusetheoptionpairwise.deletion=TRUE:syl.K80<-dist.dna(sylvia.seq.ali,pairwise.deletion=TRUE)WerecallthatthedefaultmodelforthisfunctionisKimura’stwo-parameterone.Weusetheoptionmodeltotrydifferentmodels:syl.F84<-dist.dna(sylvia.seq.ali,model="F84",pairwise.deletion=TRUE)syl.TN93<-dist.dna(sylvia.seq.ali,model="TN93",pairwise.deletion=TRUE)syl.GG95<-dist.dna(sylvia.seq.ali,model="GG95",pairwise.deletion=TRUE)Awaytocomparethesedistancematricesissimplytolookattheircorrela-tions.Wedothisbybindingalldistancesinasinglematrix,andcomputethecorrelationsamongitscolumns(theresultsareroundedtothreedigits):>round(cor(cbind(syl.K80,syl.F84,syl.TN93,syl.GG95)),3)syl.K80syl.F84syl.TN93syl.GG95syl.K801.0000.9081.0000.927syl.F840.9081.0000.9110.686syl.TN931.0000.9111.0000.925syl.GG950.9270.6860.9251.000Thisshowssomesubstantialdifferencesintheestimateddistances.Notethataperfectcorrelationdoesnotguaranteethatthedistancesarethesame:somegraphicalanalysesareneededtocheckthis.Wedothistoexaminethesatu-rationofsubstitutionsinthesequences.WefirstcomputethedistancesusingtheJukes–Cantormodelandtherawdistance(i.e.,proportionofdifferentsites): 1225PhylogenyEstimationsyl.JC69<-dist.dna(sylvia.seq.ali,model="JC69",pairwise.deletion=TRUE)syl.raw<-dist.dna(sylvia.seq.ali,model="raw",pairwise.deletion=TRUE)Wethenplotthesetwodistancesinasimpleplotexpectingtherawdistancestobesmallerbecausetheydonotconsidermultiplesubstitutionsonasinglesite;wealsoplottheJukes–CantordistanceversustheKimuraonetoshowthepotentialinfluenceofthetransition/transversionratio(Fig.5.1):layout(matrix(1:2,1))plot(syl.JC69,syl.raw)abline(b=1,a=0)#drawx=ylineplot(syl.K80,syl.JC69)abline(b=1,a=0)Theseplotsshow,asexpected,thatthemostdivergentsequencesareslightlysaturated,whereasthetransition/transversionratiodoesnotseemtoaffecttheestimateddistancesgreatly.syl.rawsyl.JC690.050.100.150.040.080.120.160.050.100.150.050.100.15syl.JC69syl.K80Fig.5.1.Saturationplotsforthecytochromebsequencesof25speciesofSylviashowingtheeffectsofmultiplesubstitutions(left)andofthetransition/transversionratio(right)ApointweexplorebrieflyistheimpactofthechoiceofthesubstitutionmodelonthephylogenyestimationwiththeNJmethod.Weestimateatreewiththefunctionnjforeachdistancematrix: 5.5CaseStudies123nj.sylvia.K80<-nj(syl.K80)nj.sylvia.F84<-nj(syl.F84)nj.sylvia.TN93<-nj(syl.TN93)nj.sylvia.GG95<-nj(syl.GG95)Toseeiftheestimatedtopologyisthesame,wecomputethetopologicaldistanceamongthem:>dist.topo(nj.sylvia.K80,nj.sylvia.F84)[1]20>dist.topo(nj.sylvia.K80,nj.sylvia.TN93)[1]0>dist.topo(nj.sylvia.K80,nj.sylvia.GG95)[1]16>dist.topo(nj.sylvia.F84,nj.sylvia.TN93)[1]20>dist.topo(nj.sylvia.F84,nj.sylvia.GG95)[1]26>dist.topo(nj.sylvia.TN93,nj.sylvia.GG95)[1]16ThesametopologieswereobtainedwithKimura’sandTamuraandNei’smodels.Wevisualizethecladesthatareconsistentlyobservedwiththedif-ferentsubstitutionmodelsbycomputingtheconsensustreewiththefunctionconsensusandplotitafterchangingitstiplabelswiththespeciesnamesinplaceoftheGenBanknumbers(Fig.5.2):sylvia.cons<-consensus(nj.sylvia.K80,nj.sylvia.F84,nj.sylvia.GG95,nj.sylvia.TN93)sylvia.cons$tip.label<-taxa.sylvia[sylvia.cons$tip.label]plot(sylvia.cons,no.margin=TRUE)WenowdoabootstrapanalysisliketheonereportedbyB¨ohning-Gaeseetal.[10]usingboot.phylodirectly:>nj.boot.sylvia<-boot.phylo(phy=nj.sylvia.K80,x=sylvia.seq.ali,FUN=function(xx)nj(dist.dna(xx,pairwise.deletion=TRUE)),B=200)>nj.boot.sylvia[1]2009181192964074787519419691193185[15]99147741691941368519976NotehowtheFUNargumentisusedhere:becauseweresampletheoriginalalignedsequences,thetreeisestimatedbyfirstcomputingthedistances,thenperformingtheneighbor-joining.Weuse200bootstrapreplicatesasin[10]. 1245PhylogenyEstimationSylviaatricapillaSylviaborinSylviaabyssinicaSylviarueppelliSylviamelanothoraxSylviamystaceaSylviamelanocephalaSylviacantillansSylviabalearicaSylviaundataSylviadeserticolaSylviaconspicillataSylviacommunisSylviananaSylviacurrucaSylviahortensisSylviacrassirostrisSylvialeucomelaenaSylvialugensSylviaburyiSylviaboehmiSylviasubcaeruleumSylvialayardiSylvianisoriaChamaeafasciataFig.5.2.ConsensustreeforSylviabasedonfourneighbor-joiningtreesestimatedwithdifferentsubstitutionmodelsHowcouldthesebootstrapvalueshavebeeninfluencedbythefactthatwedealwithcodingsequences?Wecanassessthisbyusingtheoptionblockofboot.phylo;thiswillresultinresamplingatthecodonlevelinsteadofatthesitelevel:>nj.boot.sylvia.codon<-boot.phylo(nj.sylvia.K80,sylvia.seq.ali,function(xx)nj(dist.dna(xx,pairwise.deletion=TRUE)),200,3)>nj.boot.sylvia.codon[1]20013179199833774928419219692187179[15]99135711671971348619991Theresultsareveryclosetothesite-levelresamplinganalysis;wethusconsiderthelatterinthefollowing.WenowplottheestimatedtreebyNJwiththebootstrapvaluesonthenodes.Wefirstcopytheestimatedtree,substitutetheaccessionnumbers(whichwereusedastiplabels)withthespeciesnames,addtothistreethebootstrapvalues(aspercents),andfinallyrootthisunrootedtreeusingChamaeafasciataasoutgroup:nj.est<-nj.sylvia.K80nj.est$tip.label<-taxa.sylvia[nj.est$tip.label]nj.est$node.label<-nj.boot.sylvia/2nj.est<-root(nj.est,"Chamaea_fasciata") 5.5CaseStudies125Thetreeisthenplottedwithplot,thebootstrapvaluesareaddedwithnodelabels,andwedrawascalebar(Fig.5.3):plot(nj.est,no.margin=TRUE)nodelabels(nj.est$node.label,bg="white")add.scale.bar(y=0.5,length=0.01)Sylviaatricapilla4896SylviaborinSylviaabyssinicaChamaeafasciataSylviamelanothorax49.5Sylviarueppelli45.5Sylviamelanocephala90.592.596.5SylviamystaceaSylviacantillans39Sylviadeserticola9897Sylviaundata3737.5SylviabalearicaSylviaconspicillata20SylviacommunisSylvianisoriaSylvianana4.5Sylvialayardi38SylviaboehmiSylviaburyi10099.53742.5SylvialugensSylviahortensis97Sylvialeucomelaena73.584.568SylviacrassirostrisSylviacurrucaSylviasubcaeruleum0.01Fig.5.3.Phylogeneticrelationshipsamong25speciesofthegenusSylviabasedcytochromebsequencesanalyzedwithneighbor-joiningandKimura’stwo-parameterdistanceThebootstrapvaluesshowninFig.5.3areveryclosetothoseobtainedbyB¨ohning-Gaeseetal.[10].ItisinterestingtonotethatthecladeswellsupportedbythebootstrapanalysiswerealsothosethatwereconsistentlyfoundinthefourtreesestimatedbyNJwiththedifferentsubstitutionmodels.WefinishbysavingthefinaltreeinafileusingtheNewickformat:write.tree(nj.est,"sylvia_nj_k80.tre")5.5.2PhylogenyoftheFelidaeTocontinuetheanalyseswiththeFelidaedata[75],wefirstloadthedatapreviouslypreparedandsaved:load("felid.RData")Wefocushereonananalysiswithphymltest.PHYMLhasbeeninstalled(thisisasingleexecutablefile)inthesamedirectorywherethesequencefilehasbeensaved(whichisalsosetasR’sworkingdirectory).Thecommandforthepresentanalysisisthussimply: 1265PhylogenyEstimationphymltest.felid<-phymltest("felidseq16S.phy",execname="phyml_linux")ThistakesafewminutestorunonaPCwithaprocessorat3GHzand521Mbofcachememory,and2GbofRAMmemory.Displayingtheresultsshowsthelog-likelihoodandAICvaluesforeachmodel:>phymltest.felidnb.free.paraloglikAICJC691-2301.1824604.364JC69+I2-2162.1214328.243JC69+G2-2151.1504306.300JC69+I+G3-2144.0114294.023K802-2174.6534353.306K80+I3-2031.5204069.040K80+G3-2012.3444030.688K80+I+G4-2001.5724011.144F814-2301.0584610.116F81+I5-2161.5924333.185F81+G5-2143.5114297.022F81+I+G6-2139.0114290.023F845-2163.8844337.768F84+I6-2013.9744039.949F84+G6-1993.1783998.356F84+I+G7-1985.3733984.745HKY855-2170.7044351.408KHY85+I6-2018.4434048.886HKY85+G6-1997.1194006.238HKY85+I+G7-1988.5213991.041TN936-2138.1494288.298TN93+I7-2002.1684018.336TN93+G7-1977.7703969.539TN93+I+G8-1972.5963961.192GTR9-2132.2764282.553GTR+I10-1998.0044016.009GTR+G10-1973.1843966.367GTR+I+G11-1967.9193957.838Thesummaryfunctioncomputesallpossiblepairedlikelihoodratiotests(211tests):>summary(phymltest.felid)model1model2chi2dfP.val1JC69JC69+I278.12186010.00002JC69JC69+G300.06459410.00003JC69JC69+I+G314.34185820.0000 5.5CaseStudies1274JC69K80253.05866010.00005JC69K80+I539.32388220.0000....Wecanplottheseresultstohaveamoresyntheticview(Fig.5.4):plot(phymltest.felid)Akaikeinformationcriterionforphymltest.felidF814600JC69K80HKY85F844500F81+IJC69+IJC69+ΓF81+Γ4400JC69+I+ΓF81+I+ΓTN93GTR4300K80+IKHY85+IF84+IK80+Γ4200TN93+IGTR+IK80+I+ΓHKY85+Γ4100F84+ΓHKY85+I+ΓF84+I+ΓTN93+Γ4000GTR+ΓTN93+I+ΓGTR+I+ΓFig.5.4.Resultsoftheanalysisof16Smitochondrialsequencesfrom35speciesofFelidaeandtwoothercarnivoreswithphymltestThemostcomplexmodelGTR+I+ΓistheonethatbestexplainsthedataintermsofAIC.AninterestingpatternfromFig.5.4isthatforagivensubstitutionmodel,addinginvariants(I)considerablyimprovesthefit,whereasthisimprovementisevenbetterbyaddingΓ,andagainbetterwithboth;thusthereisahierarchyX>>>X+I>>X+Γ>X+I+Γ.Whencomparingthesubstitutionmodels,thekeyelementseemstotakethetransition/transversionratiointoaccount.Oncethishasbeenincludedinthemodel(F80beingthesimplestone),takingunequalbasefrequenciesintoaccountisalsoimportantalthoughlessthanthepreviousparameter.Oncetheanalysiswithphymltesthasbeendone,itispossibletoreadthetreesestimatedbyPHYML:tr<-read.tree("felidseq16S.phy_phyml_tree.txt")Thisfilecontainsthe28treesestimatedbyPHYML,thelastonebeingtheoneestimatedwiththemostcomplexmodel.Weextractthistree,substituteitstiplabelstogetthespeciesnamesinplaceoftheaccessionnumbers,rootthetreewithGalidiaelegansasoutgroup,removethetwonon-felidspecies,andplotthefinaltree(Fig.5.5): 1285PhylogenyEstimationmltree.felid<-tr[[28]]mltree.felid$tip.label<-taxa.felid[mltree.felid$tip.label]mltree.felid<-root(mltree.felid,"Galidia_elegans")mltree.felid<-drop.tip(mltree.felid,c("Crocuta_crocuta","Galidia_elegans"))plot(mltree.felid)add.scale.bar(length=0.01)UnciaunciaPantheratigrisPantheraoncaPantheraleoPantherapardusNeofelisnebulosaOtocolobusmanulPardofelismarmorataPrionailurusplanicepsPrionailurusbengalensisPrionailurusviverrinusOncifeliscolocoloLeopardustigrinusOncifelisgeoffroyiOncifelisguignaFelislibycaFelissilvestrisFeliscatusFelismargaritaFelisnigripesFelischausAcinonyxjubatusPumaconcolorLynxcanadensisLynxlynxLynxrufusHerpailurusyaguarondiCatopumatemminckiiCatopumabadiaLeptailurusservalPrionailurusrubiginosaCaracalcaracalProfelisaurataLeoparduspardalis0.01LeoparduswiediiFig.5.5.MaximumlikelihoodestimateoftheextantFelidaeusing16Smitochon-drialsequenceswithGTR+I+ΓFromthisMLestimateofthephylogenyoftheFelidae,wecannowesti-mateachronogramwiththeNPRSmethod[137].JohnsonandO’Brien[75]wrotethatextantfelidslastsharedacommonancestor10–15millionyearsago.Weusethemidpointofthisrangeastheageoftheroot.Weplottheestimatedchronogramanddrawthetime-axiswithaxisPhylo(Fig.5.6):felid.chrono<-chronogram(mltree.felid,scale=12.5)par(mar=c(2,0,0,0))plot(felid.chrono,cex=0.8)axisPhylo()Wesavethischronogramforfurtheranalysis:write.tree(felid.chrono,"felid.chrono.tre") 5.5CaseStudies129UnciaunciaPantheratigrisPantheraoncaPantheraleoPantherapardusNeofelisnebulosaOtocolobusmanulPardofelismarmorataPrionailurusplanicepsPrionailurusbengalensisPrionailurusviverrinusOncifeliscolocoloLeopardustigrinusOncifelisgeoffroyiOncifelisguignaFelislibycaFelissilvestrisFeliscatusFelismargaritaFelisnigripesFelischausAcinonyxjubatusPumaconcolorLynxcanadensisLynxlynxLynxrufusHerpailurusyaguarondiCatopumatemminckiiCatopumabadiaLeptailurusservalPrionailurusrubiginosaCaracalcaracalProfelisaurataLeoparduspardalisLeoparduswiedii121086420Fig.5.6.ChronogramoftheextantFelidsestimatedwiththeNPRSmethod5.5.3ButterflyDNABarcodesWehave466alignedsequencesofCOI:welimitourselvesheretosimpleanal-yses.Hebertetal.[67]showedthatthereseemtobeseveral(tenactually)speciesinsteadofoneoriginallyrecognized.Wecomputethepairwisedis-tancesbetweenallspecimenswithdist.dna.Wetakecaretousetheoptionpairwise.deletion=TRUEbecausemanysequencesdonothavethesamelength:M.astraptes.K80<-dist.dna(astraptes.seq.ali,pairwise.deletion=TRUE)Welookatthedistributionofthedistancesusingsummary:>summary(M.astraptes.K80)Min.1stQu.MedianMean3rdQu.Max.0.000000.015900.021070.027490.038870.08326Asacomparison,wecanlookatthesummaryofthedistanceswithouttheoptionpairwise.deletion=TRUE:>summary(dist.dna(astraptes.seq.ali))Min.1stQu.MedianMean3rdQu.Max.0.000000.000000.000000.001220.000000.07155Thisshowsthatmostdistanceswouldbeequaltozerobecauseonlyafewsitesremainafterremovingallthosewithatleastonemissingdata(whichisthedefaultofdist.dna).Wemayplotanhistogramofthe108,345distances(Fig.5.7): 1305PhylogenyEstimationHistogramofM.astraptes.K80Frequency050001000015000200000.000.020.040.060.08M.astraptes.K80Fig.5.7.Distributionofpairwisedistancesamong466specimensAstraptesfulgera-torbasedoncytochromeoxydaseIsequencesanalyzedwithKimura’stwo-parameterdistancehist(M.astraptes.K80)Thisclearlyshowsthreepeaksinthedistribution:at0,around0.02,andaround0.07.ThisisincompleteagreementwithHebertetal.’sresultswhichshowedthatthesepeakscorrespondtodifferentiationwithinpopulations,in-traspecies,andinterspecies,respectively.ItispossibletoestimateanNJtreewiththedistancematrixtoassesshowthedifferenttaxaaredifferentiated:tr<-nj(M.astraptes.K80)tr$tip.label<-taxa.astraptes[tr$tip.label]Theresultingtreeisabittoolargetobedisplayedwithplot.phylo,sowemayusezoominstead.Forthiswehavetofindtheindicesofeachtaxoninthevectoroftiplabels.Hereisapossiblesolution:taxon<-unique(taxa.astraptes)L<-list()length(L)<-10for(iin1:10)L[[i]]<-grep(taxon[i],tr$tip.label)WecannowuseLasanargumenttozoom.WemayplotallthesubtreesatonceinalargePDFfilewith:pdf("astraptes.pdf",width=30,height=30)zoom(tr,L)dev.off() 5.7Exercises131andthenopenitwithanappropriateviewer.Eachtaxoncanbevisualizedseparatelywith,forinstance,zoom(tr,L[1]).5.6PerspectivesThecapabilitiesofRtoestimatephylogeniesarestilllimitedcomparedtopro-gramssuchasPhylipofPAUP*;however,therearegoodreasonstocontinuethecurrentdevelopmentofthesemethods.•Somemethodsareeasilyimplementedbecausetheneededfunctionsal-readyexistinR.Forinstance:–TheimplementationofBayesianmethodsshouldbeeasedbythefunc-tionalitiesalreadypresentinape(computationoftreelikelihood,gen-erationofrandomtrees)andotherpackages(randomnumbers,prob-abilitydensityfunctions);–TheflexibilityofRforreadingandmanipulatingvariouskindsofdatawilleasetheimplementationofnewmethodsofphylogenyestimation,suchasthosebasedongenomicrearrangements[88].•Rhasmanyfunctionalitiesforefficientcomputation,particularlyforlargedatasets,whichareusefulintheestimationoflargephylogenies[58,144,152,159].•Theintegrationofphylogenyestimationwithotherfacetsofphylogenetics,suchastreedrawing(Chapter4)oranalysisofmacroevolution(Chapter6)isaveryusefulfeatureforusers.•Theimplementationofdifferentmethodsindifferentprogramsmakestheircomparisondifficult,becauseeventheimplementationofthesamemethodindifferentprogramscouldresultinsubstantialdifferencesamongtheresults.Thelastpointhasrarelybeenconsideredinthephylogeneticliterature,althoughithasbeendemonstratedthatevensimplecomputationaltasks(suchascomputingasamplevariance)maygiveverydifferentresultsdependingonthestatisticalpackage[98,97].KosiolandGoldman[85]showedthatanalyzingthesameproteinsequenceswiththesamemethodbutusingdifferentpackagesresultedindifferencesthatwouldbeconsideredstatisticallysignificant.5.7Exercises1.ConsideraDNAsequencethatevolvesaccordingtotheJukes–Cantor(JC69)model.(a)Buildthecorrespondingratematrixusingfortheoverallrateofchangethevalue3×10−4. 1325PhylogenyEstimation(b)Compute,usingtwodifferentapproaches,theprobabilitymatrixfort=1,t=1000,andt=1×106.Whatdoyouobserve?Wasthatexpected?(c)Whatcouldyouconcludeaboutphylogenyestimationfromthisexer-cise?2.ConsideraGTRmodelwiththefollowingparameters:α=0.001,β=5×10−4,γ=2×10−4,δ=3×10−4,=1×10−4,ζ=5×10−5,πA=0.35,πG=0.17,πC=0.25,andπT=0.23.(a)Buildthecorrespondingratematrix.(b)Computetheprobabilitymatrixfort=1.(c)FindamethodtosimulatetheevolutionofaDNAsequenceunderthisGTRmodelforanarbitraryt.(d)Whataretheexpectedbasefrequencieswhentisverylarge?3.SketchafunctiondoingBayesianestimationofphylogeny.Thecodeshouldincludecommentsexplainingtherationaleofthechoices.4.TakethedatapreparedinExercise5ofChapter3.(a)Buildsaturationdiagramsforthewholesequence,andforeachcodonposition.(b)Examinegraphicallytheeffectsofunequaltransitionandtransversionratesand/orunequalbasefrequenciesonthedistanceestimatesforeachdataset(wholesequencesandeachcodonposition).5.AnalyzethedatapreparedinExercise6ofChapter3.Usethefunctionphymltest,andcomparetheresultswiththosefromtheFelidaeanalysisabove. 6AnalysisofMacroevolutionwithPhylogeniesReconstructingthehistoryofspeciesisanecessarystepinunderstandingthemechanismsofbiologicalevolution.Onceaphylogenyhasbeenestimated,alotofquestionsonhowspecieshaveevolvedcanbeaddressed.Whyaresometaxonomicgroupsmorediversethanothers?Howhavespeciestraitsevolved?Havesometraitsfavoreddiversification?Aresometraitslinkedthroughevo-lution?Bycontrasttothefieldofmolecularevolutionwhichisrecentinthehistoryofsciences,thesequestionsareoldissuesthatwerealreadylivelydebatedinthenineteenthcentury.Theremarkabledevelopmentofphylogeneticsduringthepastdecadeshasrenewedinterestintheselong-standingissues,andledtothedevelopmentofnewanalyticalmethodstoaddressthem.Thischapterpresentsthesemethods.Theircommonfeatureisthattheytakeanestimatedtreeasrawdata.Thefirstsectionpresentsmethodstoanalyzespeciesdatainaphylogeneticframework,thesecondone,methodsthatestimateancestralcharacters,andthethirdone,methodstoanalyzediversificationabovethespecieslevel.Allsectionsconsider,inmostcases,ultrametrictreeswithdatednodesasakeyelementofrawdata.6.1PhylogeneticComparativeMethodsComparingobservationsmadeondifferentspeciesisanintuitiveandappeal-ingapproachthatcertainlydatesbacktoantiquity[64].Forinstance,ifsomecombinationsoftraitsareconsistentlyassociatedacrossseveralspecies,thiscouldsuggestthatevolutionaryforces,suchasselection,shapedtheseassoci-ations.However,nonrandomassociationsofsometraitsamongsomespeciesmaybeduetocommonheritagefromtheirancestor,andthusconcomitantchangethroughtimecannotbeinferred[131].Conversely,ifcharactershaveevolvedrandomlywithoutassociation,morecloselyrelatedspeciesaremorelikelytobesimilarthanothers,thuscreatingapparentrelationshipsamongcharacters[36]. 1346AnalysisofMacroevolutionwithPhylogeniesItisconsequentlynecessarytoconsiderthephylogeneticrelationshipsamongspecieswhenanalyzingtheircharacters.Severalattemptsinthisdi-rectionhavebeenmadeearlyonbyconsideringpartialphylogeneticinforma-tionsuchastaxonomicinformation(see[64]forareview).Withthegrowingavailabilityofcompletephylogenieswithestimatedbranchlengths,itisnowpossibletogofurther[36].Fromananalyticalperspective,twoissuesmaybeaddressedwhenincor-poratingphylogenyintocomparativedata:•Takinginterspeciesnonindependenceintoaccountwhenstudyingtraitsandtheirrelationships,and•Estimatingtheparametersofcharacterevolution.Bothissuesaretightlyconnected.Itisindeedimportanttorealizethattheimpactofphylogenyontraitdistributionsdependsnotonlyonphylogenybutalsoonthewaythesetraitsevolve.Particularemphasishasbeengiventothefirstissuebecausetraditionalcomparativemethods(i.e.,withoutphylogeny)havebeenwidelyusedfordecades[65].Themethodsdevisedto“correctforphylogeneticdependence”usuallyassumeasimplemodelofcharacterevolution:Brownianmotionforcontinuouscharacters,orparsimoniouschangefordiscreteones.However,evenifthesemodelsdonotapplytoaparticularsituation,phylogenyisstillimportantinthedistributionofspeciestraits[62].Whenestimatingparametersofcharacterevolution,amodelmustbefor-mulatedexplicitlyandfittothedata(thecharactersandthetree),usuallybymaximumlikelihood.Severalmodelscanbefittothesamedatasetandcomparedwiththeusualstatisticaltechniques(e.g.,likelihoodratiotests,orinformationcriteria).Table6.1liststhemethodscurrentlyavailableinapeandade4togetherwiththeirmainfeatures.Mostofthesemethodsdonotspecificallyrequireanultrametrictree:differentsetsofbranchlengthsmaybeusedimplyingdifferentassumptionsonratesofevolution[48,54].Thebranchlengthsmaybemodified,orevencreatedifthetreehasnone,withcompute.brlen.Table6.1.ComparativemethodsimplementedinRandtheirmainfeaturesPICAuto-Auto-Multiv.GLSGEEMixedOUregres.correl.decomp.Correctforphylo.dependenceEstimateevol.parametersUnivariateRelationshipsamongvariablesContinuousvariablesCategoricalvariablesAllowmultichotomies 6.1PhylogeneticComparativeMethods1356.1.1PhylogeneticallyIndependentContrastsFelsenstein[36]wasprobablythefirsttoproposeamethodthatfullytakesphylogenyintoaccountintheanalysisofcomparativedata.Theideabehindthe“contrasts”1methodisthat,ifweassumethatacontinuoustraitevolvesrandomlyinanydirection(i.e.,theBrownianmotionmodel),thenthe“con-trast”betweentwospeciesisexpectedtohaveadistributioncenteredonzero,andavarianceproportionaltothetimesincedivergence.Ifthecontrastsarescaledwiththelatter,thentheyhaveavarianceequaltoone.Acontrastiscomputedwith[36]:xi−xjCij=,(6.1)dijwherexiandxjarethevaluesofthetraitobservedonspeciesiandj,andthedistancebetweenbothspeciesdijismeasuredonthetree.Thisisstraightforwardifxiandxjareobservedonrecentspecies,butthiscanbedonealsoforinternalnodesbecauseundertheassumptionsoftheBrownianmodeltheancestralstateofthevariablecanbecalculated;arescalingoftheinternalbrancheseventuallyoccurs[36].Inthisformulation,thetreeneedstobebinary(fullydichotomous),andacontrastiscomputedforeachnode.Thusfornspecies,n−1contrastswillbecomputed.Thecontrastsareindependentwithrespecttothephylogeny(un-liketheoriginalvaluesofx),andstandardstatisticalmethodsforcontinuousvariablescanbeused.Themethodofphylogeneticallyindependentcontrasts(PICs),isimple-mentedinthefunctionpic.ThisfunctioncomputesthePICsgivingatreeandavectorofvalues.Theresultisavectorofnumericvalueswiththecom-putedPICs.AsasimpleexamplewetakeadatasetanalyzedbyLynch[94]consistingofthelog-transformedbodymassandlongevityoffivespeciesofprimates.>tree.primates<-read.tree("primfive.tre")>body<-c(4.09434,3.61092,2.37024,2.02815,-1.46968)>longevity<-c(4.74493,3.3322,3.3673,2.89037,2.30259)>names(body)<-names(longevity)<-c("Homo",+"Pongo","Macaca","Ateles","Galago")>pic.body<-pic(body,tree.primates)>pic.longevity<-pic(longevity,tree.primates)>pic.body-1-2-3-43.35831891.19292631.58474160.74593331Phylogeneticallyindependentcontrasts,oftencalled“contrasts”inthephyloge-neticliterature,arerelatedtothestatisticalcontrastsusedinanalysisofvarianceandothermethods(see?contrastsinR)inthesensethattheybothconsidercontrastsinexpectedmeans. 1366AnalysisofMacroevolutionwithPhylogenies>pic.longevity-1-2-3-40.89706040.86789690.71761252.1798897WeplotthetreeandshowthevaluesofthePICswithnodelabels(Fig.6.1):plot(tree.primates)nodelabels(round(pic.body,3),adj=c(0,-0.5),frame="n")nodelabels(round(pic.longevity,3),adj=c(0,1),frame="n")Galago3.3580.897Ateles1.1930.868Macaca1.5850.718Pongo0.7462.18HomoFig.6.1.Atreeoffiveprimategenerashowingphylogeneticallyindependentcon-trastsofln(bodymass)andln(longevity),aboveandbelow,respectivelyAplotofthetwosetsofPICsshowsnoclearrelationshipbetweenthem(Fig.6.2):plot(pic.body,pic.longevity)abline(a=0,b=1,lty=2)#x=ylineThisisconfirmedbyacorrelationandasimpleregression:>cor(pic.body,pic.longevity)[1]-0.5179156>lm(pic.longevity˜pic.body)Call:lm(formula=pic.longevity˜pic.body) 6.1PhylogeneticComparativeMethods137pic.longevity1.01.52.01.01.52.02.53.0pic.bodyFig.6.2.PlotofthefourpairsofcontrastsfromFig.6.1;thedashedlineisx=yCoefficients:(Intercept)pic.body1.6957-0.3081Garlandetal.[48]recommendedthatlinearregressionswithPICsshouldbedonethroughtheorigin(i.e.theinterceptissettozero).ItisclearfromFig.6.2thattheresultwillbedifferentiftheirsuggestionisfollowed:>lm(pic.longevity˜pic.body-1)Call:lm(formula=pic.longevity˜pic.body-1)Coefficients:pic.body0.4319Noneoftheabovecoefficientsissignificantlydifferentfromzerowhichishardlysurprisingconsideringthesmallsamplesize.DoingtheregressionamongPICsthroughtheoriginisjustifiedifthecharactersevolveunderaBrownianmotionmodelandthereisalinearrelationbetweenthem[48].However,thisislikelytoignoreapossiblenonlinearrelationship[127].Inallcases,itseemswisetoplotthePICsasdonehere.PurvisandGarland[124]introducedamodificationofFelsenstein’s[36]methodinordertotakemultichotomiesintoaccount.Thisisnotimplementedinthefunctionpic,butthismaydonebycombiningthisfunctionwithotherssuchasmulti2di(Section3.4.3).Therearealternativeapproaches,suchas 1386AnalysisofMacroevolutionwithPhylogeniesgeneralizedleastsquares,tocopewithmultichotomieswithcontinuoustraits(Section6.1.5).6.1.2PhylogeneticAutoregressionIfitispostulatedthatspeciesarenotindependentthroughtheirphylogeneticrelationships,thenthelattermaybeusedtoquantifytheassociationbetweenthevariablesobservedonthespecies.ThisapproachwasusedbyCheverud,Dow,andLeutenegger[18]andlaterrefinedbyRohlf[132].Thisisbasedonthefollowingmodel:x=ρWx+,(6.2)wherexisthestudiedvariable,Wisaconnectivitymatrixbasedonthephy-logeny,ρisaparameter,andisthevariationnotexplainedbythephylogeny.TherowsofWsumtooneandthevaluesindicatethe“distance”betweenthedifferentspecies(thediagonalelementsarethusequaltozero).Theparameterρisestimatedfromthedata:positivevaluesindicateaninfluenceofthephy-logenyonx,whereasnegativevaluesindicatetheopposite(distantlyrelatedspeciesaremoreidentical).Ifρ=0,thenthephylogenyhasnoinfluenceonx.Thevariationinxexplainedbythephylogenycanbecalculatedas[18]:2Var()R=1−.(6.3)Var(x)Cheverudetal.’s[18]method,includingRohlf’s[132]correction,isimple-mentedinthefunctioncompar.cheverud.Thisfunctiontakesasargumentsanumericvector,andamatrixthatistransformedtogivetheconnectivitymatrix.Thematrixgiventothefunctioncouldbeacorrelationmatrix(ob-tainedwithvcv.phylo),oradistancematrix(obtainedwithcophenetic):theresultswillbethesame.Letusconsideragainthesmallprimatedataset.Thecorrelationmatrixisobtainedwiththefunctionvcv.phylo:>W<-vcv.phylo(tree.primates,cor=TRUE)>CM.prim<-compar.cheverud(body,W)>CM.prim$rhohat[1]-2.623383$WnormHomoPongoMacacaAtelesGalago[1,]0.000000000.090517240.21120690.26724140.4310345[2,]0.090517240.000000000.21120690.26724140.4310345[3,]0.188461540.188461540.00000000.23846150.3846154 6.1PhylogeneticComparativeMethods139[4,]0.216783220.216783220.21678320.00000000.3496503[5,]0.250000000.250000000.25000000.25000000.0000000$residuals[,1]Homo-1.681081Pongo-2.049707Macaca-1.740552Ateles-1.296137Galago-1.237742Theresultisalistwiththreeelements:theestimatedvalueofρ(rhohat),thenormalizedmatrixW(Wnorm),andtheestimatedresidualsi(residuals).Theproportionofvariationexplainedbythephylogenyisthus:>1-var(CM.prim$residuals)/var(body)[,1][1,]0.9763006Thisanalysissuggestsanegativeinfluenceofphylogenyonthedistributionofbodymassintheseprimates.Thisisquitenonintuitive,butlookingatthecontrastscalculatedintheprevioussection,wecanseetheyareallpositive.Thissuggeststhatthereisatrendintheevolutionofbodymass,andthustheBrownianmotionmodeldoesnotapply.6.1.3AutocorrelativeModelsGittlemanandKot[53]introducedamethodclosetoCheverudetal.’s[18]butbasedonanautocorrelationapproach.ThisusesMoran’sautocorrelationindexI[102]:nnwij(xi−x¯)(xj−x¯)ni=1j=1I=,(6.4)Sn0(x−x¯)2ii=1nnS0=wij,(6.5)i=1j=1wherewijisthedistancebetweenspeciesiandj,and¯xistheobservedmeanofx.Thisissomehowsimilartothecorrelationbetweentwovariables,butinsteadlooksatdifferentvaluesofthesamevariables(inthepresentcontext,madeondifferentspecies),andwhereeachpairisweightedwithw.Becauseitisexpectedthatmorecloselyrelatedspeciesaremoresimilar,thelatter 1406AnalysisofMacroevolutionwithPhylogeniescanbederivedfromthephylogeny.GittlemanandKot[53]proposedthatintheabsenceofanaccuratephylogeny,theweightscanbederivedfromthetaxonomy.Intheabsenceofphylogeneticautocorrelation,themeanexpectedvalueofIanditsvarianceareknown[53].Itisthuspossibletotestthenullhypothesisoftheabsenceofdependenceamongobservations.GittlemanandKot’s[53]methodisimplementedinthefunctionMoran.I.Consideringtheprimatesmalldataset,thedistancesbetweenspeciescanbecomputedwiththefunctioncophenetic:>Moran.I(body,cophenetic(tree.primates))$observed-0.4250254$expected[1]-0.25$sd[1]0.0743147$p.value0.01851316Theresultisalistwithfourelements:theobservedvalueofI(observed),itsexpectedvalueunderthenullhypothesisofnocorrelation(expected),thestandard-deviationoftheobservedI(sd),andtheP-valueofthenullhypothesis(p.value).Inagreementwiththeautoregressionanalysis,anegativeautocorrelationwasfound.Notethattheexpectedvalueisnegative(−0.25):thisisnotreallyintuitive,butintheabsenceofcorrelationamongobservations,theexpectedvalueofMoran’sautocorrelationcoefficientisnegative(see[102]).ade4hasthefunctiongearymoranthatcomputesMoran’scoefficientandtestsitssignificancewitharandomizationprocedure.Thetwomainargu-mentsofthisfunctionareadistancematrixandadataframewithoneorseveralvectors.Theoptionnrepetspecifiesthenumberofreplicationsoftherandomizationtest(999bydefault).Weleavethisoptionasitsdefaultforthepresentanalysis:>gearymoran(cophenetic(tree.primates),+data.frame(body,longevity))class:krandtesttestnumber:2permutationnumber:999testobsP(X<=obs)P(X>=obs)1body-0.4230.01412longevity-0.3390.1660.849 6.1PhylogeneticComparativeMethods141TheresultforbodymassisveryclosetotheonewithMoran.I.Thislatterfunctiongiveswithlongevity:>Moran.I(longevity,cophenetic(tree.primates))$observed[1]-0.3182082$expected[1]-0.25$sd[1]0.0734518$p.value[1]0.3530901Forthisvariable,thecomputedcoefficientsareveryclosebetweenbothfunc-tions,buttheP-valuesaresomehowdifferentalthoughbothnotsignificant.GittlemanandKot[53]suggestedtheuseofcorrelogramstovisualizetheresultsofphylogeneticautocorrelativeanalyses.Theideaistolookatthecorrelationatdifferentdistancecategories.Thiscanbedoneevenintheabsenceofacompletephylogenyusingtaxonomiclevels.Ifaphylogenyisavailable,thenatleasttwodistancecategoriesmustbedefined.Bothmethods(withtaxonomiclevelsorwithaphylogeny)areimplementedintwofunctions:correlogram.formulaandcorrelogram.phylo,respectively.Theoptionsinthesetwofunctionsareslightlydifferent.Asanexample,wetakethedatacompiledbyGittleman[52]on112speciesofcarnivores.Thisincludesvariouslife-historyvariablesaswellastaxonomiclevels(species,genus,family,super-family,andorder).Weconsider(asin[53])thecorrelationlevelsinmeanbodymassatthevarioustaxonomiclevels.Thefunctioncorrelogram.formularequiresaformulawherethelevelsareseparatedwithslashes:2>data(carnivora)>correl.carn<-correlogram.formula(+log10(SW)˜Order/SuperFamily/Family/Genus,+data=carnivora)>correl.carn$obs[1]0.6143713640.404715752-0.266621894-0.001377008$p.values[1]9.529087e-070.000000e+000.000000e+005.432094e-012ThisistheusualnotationtospecifynestedeffectsinR’sformulae. 1426AnalysisofMacroevolutionwithPhylogenies0.60.40.2I/Imax0.0−0.2GenusFamilySuperFamilyOrderRankFig.6.3.Phylogeneticcorrelogramofln(bodymass)among112speciesofcarni-vores;thefilledcirclesindicatethesignificantcoefficients(P<0.05)$labels[1]"Genus""Family""SuperFamily""Order"attr(,"class")[1]"correlogram"Thereturnedobjectisofclass"correlogram";thereisaplotmethodforthisclass(Fig.6.3):plot(correl.carn)Thecorrelationcoefficientatthe“Genus”leveliscomputedamongpairsofspeciesbelongingtothesamegenus,andthesameforthoseatthe“Family”,“SuperFamily”,and“Order”levels.6.1.4MultivariateDecompositionMultivariatemethodscanbeusedtosummarizethestructureofphylogenetictreesleadingtopossiblemeasuresofphylogeneticdependence.Diniz-Filho,deSant’Ana,andBini[26]developedamethodtheycalledphylogeneticeigen-vectorregression(PVR).Itsprincipleistodoaneigendecompositionofthedoublycenteredmatrixofamong-speciesdistances.Aregressionofthestud-iedvariableisthenmadeonthematrixofeigenvectors.Diniz-Filhoetal.[26]recommendedfirstrunningaphylogeneticautocorrelationanalysis(Sec-tion6.1.3)totestforthepresenceofsignificantphylogeneticdependence.Ifthetestissignificant,thisdependencemaybequantifiedwithPVR:the 6.1PhylogeneticComparativeMethods143numberofeigenvectorsusedintheregressionisselectedaccordingtotheexpectationunderabroken-stickmodel.Ollier,Couteron,andChessel[108]proposedarelatedapproachthatdifferssubstantiallyinthedetails.Insteadofusingadistancematrix,theyuseamatrixbuiltfromthetopologyofthetree.Theythenperformanorthonormaltransformonthismatrixleadingtoamatrixthatisalinearcombinationoftheiroriginalmatrix.Theyfinallyperformaneigendecompositionofthelastmatrix,keepingonlytheeigenvectorswithpositiveeigenvaluesonwhichthestudiedvariableisregressed.Thefunctionvariance.phyloginpackageade4implementsOllieretal.’s[108]method:ittakesasmainargumentsanobjectofclass"phylog"andanumericvector.Toperformtheanalysiswiththeprimatesdatawefirstneedtotransformthetreeofclass"phylo"intooneofclass"phylog"(Section3.4.5):>tpg<-newick2phylog(write.tree(tree.primates))>variance.phylog(tpg,body)$lmCall:lm(formula=fmla,data=df)Coefficients:(Intercept)A1A22.139e-16-8.685e-01-1.371e-01$anovaAnalysisofVarianceTableResponse:zDfSumSqMeanSqFvaluePr(>F)A113.77193.771956.21980.01733A210.09400.09401.40050.35825Residuals20.13420.0671$sumryDfSumSqMeanSqFvaluePr(>F)Phylogenetic23.865821.9329128.810130.03355Residuals20.134180.06709Thetestofthephylogeneticdependence(orinertia)correspondstothetestofthelinearmodelwiththeselectedeigenvectorsaspredictors.Wethusconcludewithasignificantphylogeneticinertiaforbodymass.Thesameanalysiswithlongevitygives:>variance.phylog(tpg,longevity) 1446AnalysisofMacroevolutionwithPhylogenies$lmCall:lm(formula=fmla,data=df)Coefficients:(Intercept)A1A2-2.958e-16-7.305e-011.226e-01$anovaAnalysisofVarianceTableResponse:zDfSumSqMeanSqFvaluePr(>F)A112.668072.668074.24600.1755A210.075200.075200.11970.7624Residuals21.256730.62837$sumryDfSumSqMeanSqFvaluePr(>F)Phylogenetic22.743271.371632.182850.31418Residuals21.256730.62837Thetestisinagreementwiththeresultsfromtheautocorrelationanalysis.Desdevisesetal.[24]proposedamethodclosetoDiniz-Fliholetal.’s[26]:insteadofselectingtheeigenvectorsaccordingtoabroken-stickmodel,theysuggestedselectingallstatisticallysignificanteigenvectorsintheregression.Giannini[50]proposedamethodwithamatrixcodingthetreestructuresimilartotheoneusedbyOllieretal.[108]:hethenperformedalinearregressionofthestudiedvariableonthismatrix.Thebestsubsetofthe“tree”matrixwasselectedusingMonteCarlopermutations.6.1.5GeneralizedLeastSquaresThemethodofgeneralizedleastsquares(GLS)canbeseenasanextensionofthemethodofordinaryleastsquares.Withthelatter,observationsareassumedtohavethesamevariance,andcovariancesequaltozero.TheseassumptionsarerelaxedwithGLS.TheuseofGLSincomparativemethodscameasawaytogeneralizethecontrastsapproach.Grafen[54]firstproposedthisapproachasawaytodealwithmultichotomiesintreesandalsoasawaytointegratemorecomplexmodelsofmulti-characterevolution.Hesuggestedamodelwhereeachnodeisgivenaheightequaltothenumberoftipsminusone;theseheightsarethenscaledsothattheroothasheightoneandtheotherheightsareraisedtopower 6.1PhylogeneticComparativeMethods145ρ(withρ>0).Grafen’smodelisactuallysimilartoaBrownianmotionmodelwithmodifiedbranchlengths.UnderaBrownianmotionmodelofcharacterevolution,thecovariancebetweenspeciesiandj,denotedvij,isgivenby:v=σ2T,(6.6)ijawhereTaisthedistancebetweentherootandthemostcommonrecentan-cestorofspeciesiandj,andσ2isthevarianceoftheBrownianprocess.MartinsandHansen[96]suggestedtheOrnstein–Uhlenbeckmodelwherethecovariancebetweentwospeciesisgivenby:v=σ2exp(−αd),(6.7)ijijwhereσ2issimilartothevarianceoftheBrownianprocess,αspecifieshow“fast”thespeciescharacterdivergeafterspeciation,anddijisthedistancebetweenbothspecies.WeshowtheOrnstein–UhlenbeckmodelagaininSec-tion6.1.8.ThefunctionglsinpackagenlmeisusedtofitmodelswithGLS.Thisisaverygeneralfunctionthatcanaccomodatecorrelationamongobservationsandheterogeneousvariancefunctions.Theformerisspecifiedwithanobjectofclass"corStruct"(correlationstructure):thevariance–covariancematrixisthengeneratedduringtheanalysisthroughseveralfunctionscalledinternallybygls.JulienDutheilintroducedtheideaofusingthecorrelationstructuresusedinthepackagenlmetocodephylogeneticcorrelationstructures.ThethreemodelssketchedabovearespecifiedwiththefunctionscorGrafen,corBrownian,andcorMartins,respectively:corGrafen(value,phy,fixed=FALSE)corBrownian(value=1,phy)corMartins(value,phy,fixed=FALSE)wherevalueistheparameterofthemodel,phyisanobjectofclass"phylo",andfixedalogicalindicatingwhethertoestimatetheparametersfromthedata(thedefault).Thesefunctionsreturnanobjectofclasswiththreeele-ments:1.Thenameofthecalledfunction(i.e.,"corGrafen","corMartins",or"corBrownian");2."corPhyl";3."corStruct".Thelastoneisimportantbecauseitallowsustofitthesemodelswithgls.3AnevolutionarymodelisthenfitasisanylinearmodelwithGLS.Forinstance,comingbacktotheprimatedata,wefirstcreateacorrelationstructurethatfollowsaBrownianmotionmodel:3Thepackagenlmeisloadedwhenapeisstarted. 1466AnalysisofMacroevolutionwithPhylogeniesbm.prim<-corBrownian(phy=tree.primates)Wethenfitthelinearmodelwherelongevityisafunctionofbodymass.Asmalldatamanipulationisrequiredbycreatingadataframethatincludesthestudiedvariablestoeasethewaytheyarepassedtogls:4DF.prim<-data.frame(body,longevity)Wecannowfitthemodel:m1<-gls(longevity˜body,correlation=bm.prim,data=DF.prim)Weextractthedetailsofthemodelfitwithsummary:>summary(m1)GeneralizedleastsquaresfitbyREMLModel:longevity˜bodyData:DF.primAICBIClogLik17.4807214.77656-5.74036CorrelationStructure:corBrownianFormula:˜1Parameterestimate(s):numeric(0)Coefficients:ValueStd.Errort-valuep-value(Intercept)2.50006720.77545163.2240140.0484body0.43193280.28649041.5076690.2288....Notethatnoparameterisestimatedinthepresentcorrelationstructure,hencetheoutputnumeric(0).IncontrasttowhatwassuggestedbytheplotofPICvalues(Section6.1.1),therelationshipbetweenvariablesnowappearspositivealthoughnotstatisticallysignificant.Thisunderlinesthedifferencebetweenbothmethods:GLSfocusesontherelationshipbetweenvariables,whereasthePICmethodfocusesontherelationshipbetweencontrasts(i.e.,betweenchangesinthevariablesthroughthephylogeny).Thetwopresentvariablesareindeedstronglypositivelycorrelated:>cor(body,longevity)[1]0.82961074Whenthisdataframeiscreated,thenamesofthevectorsareusedasrownames(seep.16);thelatterarethenmatchedwiththetiplabelsofthetree,eveniftheyarenotinthesameorder. 6.1PhylogeneticComparativeMethods147WenowfittheOrnstein–UhlenbeckmodelbasedonMartinsandHansen’scorrelationstructuretothesamedata:>ou.prim<-corMartins(1,tree.primates)>m2<-gls(longevity˜body,correlation=ou.prim,+data=DF.prim)>summary(m2)GeneralizedleastsquaresfitbyREMLModel:longevity˜bodyData:DF.primAICBIClogLik17.8170714.21152-4.908536CorrelationStructure:corMartinsFormula:˜1Parameterestimate(s):alpha51.55332Coefficients:ValueStd.Errort-valuep-value(Intercept)2.59897680.38434476.7620990.0066body0.34253490.13309772.5735610.0822....TheAICvaluedoesnotindicateanimprovementcomparedtotheBrownianmodel.Notsurprisingly,theparameterestimatesareverycloseinthesetwomodels.6.1.6GeneralizedEstimatingEquationsTheuseofgeneralizedestimatingequations(GEEs)fortheanalysisofcom-parativedatahadtwomotivations:todealeasilywithmultichotomies,andtoanalyzecategoricalvariablesinanaturalway[116].GEEswereintroducedbyLiangandZeger[92]asanextensionofgen-eralizedlinearmodels(GLMs)forcorrelateddata.Thecorrelationstructureisspecifiedthroughacorrelationmatrix.SimilarlytoGLMs,themodelisspecifiedwithalinkfunctiong:g(E[y])=xTβ,(6.8)iiHowever,thedistinctioncomesfromthewaythevariance–covariancematrixisgiven:V=φA1/2RA1/2,(6.9) 1486AnalysisofMacroevolutionwithPhylogenieswhereAisann×ndiagonalmatrixdefinedbydiag{V(E[yi])}:thatis,ama-trixwithallitselementszeroexceptthediagonalwhichcontainsthevariancesofthenobservationsexpectedunderthe(marginal)GLM,Risthecorrela-tionmatrixoftheelementsofy,φisthescale(ordispersion)parameter,andV(E[yi])isthevariancefunction.Thesetwocomponents,φandV(E[yi]),aredefinedwithrespecttothedistributionassumedforyinthesamewayasinastandardGLM.Iftheobservationsareindependent,thenRisann×nidentitymatrix.BeyondthetechnicalitiesoftheGEEapproachliesthepossibilityofana-lyzingdifferentkindsofvariablesthankstotheGLMframework.Theanalysisisdonewiththefunctioncompar.gee.Thisusesthesameinterfaceasglm:themodelisgivenasaformula,andthedistributionoftheresponseisspecifiedwiththeoptionfamily.Bydefaultthisoptionis"normal",thuswedonotneedtouseitforthesmallprimatedata:>compar.gee(longevity˜body,phy=tree.primates)[1]"BeginningCgeeS-function,@(#)geeformula.q4.1398/01/27"[1]"runningglmtogetinitialregressionestimate"[1]2.59897680.3425349Call:formula:longevity˜bodyNumberofobservations:5Model:Link:identityVariancetoMeanRelation:gaussianSummaryofResiduals:Min1QMedian3QMax-0.7275418-0.4857216-0.15655150.43732580.4763833Coefficients:EstimateS.E.tPr(T>|t|)(Intercept)2.50006720.43251675.7802790.06773259body0.43193280.15979322.7030740.17406821EstimatedScaleParameter:0.4026486"Phylogenetic"df(dfP):3.32Theoutputfromcompar.geeisveryclosetotheonefromgee;theformeradditionallyprintsthephylogeneticnumberofdegreesoffreedom(dfP).Somesimulationsshowedthatifthestatisticaltestsontheregressionparametersaredonewithat-testwiththeusualresidualnumberofdegreesoffreedom, 6.1PhylogeneticComparativeMethods149thentypeIerrorratesareinflated[116].Asolutiontothisproblemistocorrectthenumberofdegreesoffreedomwith:branchlengthdf=tree×n,(6.10)Pni=1distancefromroottotipiwherenisthenumberofspeciesinthetree.Thiscorrectionwasfoundempir-ically,andworksinpractice,butitstillneedstobeconfirmedtheoretically,andpossiblyrefined.6.1.7MixedModelsandVariancePartitioningIntheliteratureoncomparativemethods,someemphasisisputonrelation-shipsamongvariables:manycomparativeanalysesaremotivatedbyestablish-ingrelationshipsamongecologicalorphysiologicalvariables[45,46].Lynch[94]pointedoutthattheseapproachesdonotconsideralltheavailableinfor-mationontheevolutionaryprocess.Hesuggestedrathertoshifttheattentionon(co)variationofthetraitsbyusinganapproachclosetooneusedinquan-titativegeneticstoassessthedifferentcomponentsofgeneticvariation.Heproposedthefollowingmodel:xi=µ+ai+ei,(6.11)whereµisthegrandmeanofthetrait,aicomesfromanormaldistributionwithavariance–covariancematrixσ2GwhereGisacorrelationmatrixderivedafromthephylogeny(wecanwritethisasa∼N(0,σ2G)),andtheesareaiindependentnormalvariablessothate∼N(0,σ2).Thisunivariatemodelcanebeextendedtoseveralvariablesinwhichcasethereareadditionalparameters,CovaandCove,namelythecovarianceexplainedbythephylogenyandtheresidualcovariance,respectively[94].Lynch[94]proposedanexpectation–maximization(EM[23])algorithmtofitmodel(6.11)bymaximumlikelihoodbutthisisveryslowandbecomesintractablewithlargesamplesizes.Housworthetal.[70]proposedareparam-eterizationof(6.11)andanewalgorithmtoremedythisproblem,butthisappliedonlytouni-andbivariatecases.Fittingmodel(6.11)isactuallyadifficulttask.Apossibleexplanationmaybebecausebothcomponentsofvarianceareconfounded,andcannotbeestimatedseparately.Inmixed-effectsmodels,variancecomponentsareusuallyestimatedwithdifferentgroupsthatarestatisticallyindependent,butobservationswithingroupscanbecorrelated[119].Withphylogeneticdata,thereisonlyonegroup,andthusσ2andσ2areconfounded.aeThefunctioncompar.lynchusestheEMalgorithmproposedbyLynch[94]tofitmodel(6.11).Weillustrateitsusewiththesmallprimatedataset.Wefirstbuildacorrelationmatrixinthewayseenpreviously:>G<-vcv.phylo(tree.primates,cor=TRUE) 1506AnalysisofMacroevolutionwithPhylogenies>compar.lynch(cbind(body,longevity),G=G)$vare[,1][,2][1,]0.049088180.1053366[2,]0.105336610.2674316$varabodylongevitybody3.00186700.9582542longevity0.95825420.3068966$A[,1][,2][1,]2.50566710.8006949[2,]2.57059590.8201169[3,]1.14854390.3663313[4,]0.96542360.3065841[5,]-2.7534270-0.8779460$E[,1][,2][1,]0.349157430.89988706[2,]-0.19919129-0.53226494[3,]-0.01781930-0.04337929[4,]-0.17678902-0.46056213[5,]0.044231580.13618796$ubodylongevity1.2394333.044322$lik[,1][1,]-12.21719Theresultsarereturnedasalistwithfiveelements:vare:theestimatedresidualvariance–covariancematrix;vara:theestimatedadditiveeffectvariance–covariancematrix;u:theestimatesofthephylogenywidemeans;A:theadditivevalueestimates;E:theresidualvalueestimates;lik:thelog-likelihood. 6.1PhylogeneticComparativeMethods1516.1.8TheOrnstein–UhlenbeckModelTheBrownianmotionmodelassumesthatcontinuouscharacterscoulddivergeindefinitelyafterdivergencefromthesamevalues.Amorerealisticmodelwouldbeonewherecharactersareconstrainedtoevolvearoundagivenvalue.AcandidatemodelistheOrnstein–Uhlenbeck(OU)model.ThequantityofcharacterchangealongashorttimeintervaldtaccordingtoageneralOUmodelis[7,86]:dxt=−α(xt−θ)dt+dt,(6.12)whereαcontrolsthestrengthofcharacterevolutiontowardsthe“optimum”valueθ,and∼N(0,σ2).Ifα=0,theOUmodelreducestoaBrowniantmotionmodel.Adiscrete-timeversionof(6.12)is:xt+1=−α(xt−θ)+t.(6.13)ItisstraightforwardtosimulateanOUmodelinRusing(6.13).Ifwesetα=0andθ=0,thenwesimulateaBrownianmotionmodelwithzeroasinitialvalueandσ2=1,on99time-stepswith:x<-cumsum(c(0,rnorm(99)))TheOUequivalentwithα=0.2andθ=0wouldbe:x<-numeric(100)for(iin2:100)x[i]<--0.2*x[i-1]+rnorm(1)ToreplicatetheBrownianmotionsimulation,sayfivetimes,wecanusethefollowingcode:X<-replicate(5,cumsum(c(0,rnorm(99))))FortheOUversionofthiscode,wefirstcreateafunctionthatincludesthecommandsabove:sim.ou<-function(){x<-numeric(100)for(iin2:100)x[i]<--0.2*x[i-1]+rnorm(1)x#returnsthevalueofx}Thefunctioncanthenbeusedinthesamewayasabove:X2<-replicate(5,sim.ou())Itisinterestingtolookatthevarianceofthefivereplicatesofeachmodel: 1526AnalysisofMacroevolutionwithPhylogeniesBrownianOUXX2−10−5051015−10−5051015020406080100020406080100Fig.6.4.SimulationswithfivereplicatesoftheBrownianmotion(left)andOrnstein–Uhlenbeckmodels(right)>var(X[100,])[1]75.83865>var(X2[100,])[1]0.8434638Aplotofthesimulatedvaluesshowsevenmoreclearlythecontrastbetweenbothmodels(Fig.6.4):layout(matrix(1:2,1,2))yl<-range(X)matplot(X,ylim=yl,type="l",col=1,main="Brownian")matplot(X2,ylim=yl,type="l",col=1,main="OU")Thefunctioncompar.oufitsageneralOUmodelwhereθmayvarythroughthephylogeny[61].Theinterfaceis:compar.ou(x,phy,node=NULL,alpha=NULL)wherexisanumericvariable,phyisatree(asanobjectofclass"phylo"),nodespecifiesthenodeswhereθchanges,andalphaisthevalueofα.Thelatterparameterisassumedtobeconstantthroughoutthephylogeny;onlytheoptimumθcanchange.Whenanodenumberisgiveninnode,thenitisassumedthattheoptimumchangesatthispointforallbranchesfromthisnode.Bydefault(i.e.,ifnode=NULL),itisassumedthatθisthesameforallbranches. 6.1PhylogeneticComparativeMethods153Bydefault,αisestimatedfromthedatabutthisisnotusuallyagoodideaastheestimationisunstable.Itispreferabletogiveafixedvaluewhenfittingthemodel.Hansen[61]madesimilarobservationsontheinstabilityoftheestimatesofα.Asasimpleexamplewiththeprimatedata,wefitanOUmodeltothelongevitydatausingα={0.2,2}:>compar.ou(longevity,tree.primates,alpha=.2)$deviance[1]17.87657$paraestimatestderrsigma28.2187223.6762567theta12.4484050.4280387$callcompar.ou(x=longevity,phy=tree.primates,alpha=0.2)>compar.ou(longevity,tree.primates,alpha=2)$deviance[1]12.42138$paraestimatestderrsigma20.74843980.3348018theta13.08056910.3127302$callcompar.ou(x=longevity,phy=tree.primates,alpha=2)Thefunctionreturnsthedeviance(−2×log-likelihood)ofthemodel,theparameterestimateswiththeirstandarderrors,andthefunctioncallrecallingthefittedmodel.Thisexampleshowsthatthemodelwithα=2fitsbetterbecauseitsdevianceissmaller,indicatingthatthereissubstantialconstraintintheevolutionoflongevity.Theestimatedoptimum,withits95%confidenceinterval,isθˆ=3.08±0.62,andtheestimatedvarianceoftheOUprocessisσˆ2=0.74±0.67.Theestimatesofαandσ2arehighlycorrelatedwhichcouldbetheresultofthesmallsamplesize.6.1.9PerspectivesComparativemethodshaveenjoyedgreatsuccessduringthepast20years,bothintermsofmethodologicalandconceptualdevelopments,andintermsofempiricalapplications.Muchemphasishasbeenputoncorrectingforphy-logeneticdependenceinordertousestandardstatisticalmethods.Thereis 1546AnalysisofMacroevolutionwithPhylogeniessurelysomegaininshiftingattentiontoestimatingevolutionaryparametersinasmuchastheessenceofcomparativedataistheevolutionaryprocessesthatgeneratedthem.Inthisrespect,theOrnstein–UhlenbeckmodelislikelytobeaninterestingalternativetothecommonlyusedBrownianmotionmodel[16].Roffersawiderangeofphylogeneticcomparativemethods.Somemethodsnotdiscussedhereare:•Garlandetal.[47]developedamethodbasedonsimulations.•ReadandNee[130]developedamethodfortheanalysisofbinarytraits(e.g.,presenceorabsence).•GrafenandRidley[55,56,57]developedsimilarmethodsfordiscretechar-acters.•Huelsenbecketal.[72]developedaBayesianmethodtotakephylogenyuncertaintyintoaccount.ThesemethodscanbeeasilyprogrammedinR.6.2EstimatingAncestralCharactersForsometime,theestimationofancestralcharacterswasconsideredasacomponentofphylogenyestimationwithparsimonymethodswherederiv-ingancestralandderivedcharactersisanessentialstep[39].Withthede-velopmentofalternativemethodswhereancestralcharactervaluesarenotnecessary(distancemethods)ortheirprobabilisticdistributionistakenintoaccount(likelihoodmethods),theestimationofancestralvalueshasbecomelesscriticalinphylogenyestimation.Theuseofphylogeniestotestevolutionaryhypotheseshascreatednewinterestinestimatingancestralcharactervalues.Manyissuesdependonhowcharactersevolvedfromanancestralvalue[42].Someresearchershavefocusedtheirattentiononstatisticalmethodsofancestralcharacterestimationwhereuncertaintyintheestimatesistakenintoaccount[107].Ancestralcharactervaluesarenotobserved,andthusitismorerationaltoconsiderthemasparametersinamodelwherethecharactervaluesofrecentspeciesaretheobservedvariables.Consequently,theword“estimation”ispreferableto“re-construction”.Inthesameway,itisbettertowrite“charactervalues”ratherthan“characterstates”inasmuchasweconsiderbothcontinuousanddiscretecharacters(“state”implicitlyreferstodiscretecharacters).apehasasinglefunctiontoperformancestralcharacterestimation:ace.Bydefault,aceperformsestimationforcontinuouscharactersassumingaBrownianmotionmodelfitbymaximumlikelihood.Theoptionsofacehavedifferenteffectsdependingonthetypesofcharacterunderstudy.Inallcasesafullyresolvedphylogenyisrequired. 6.2EstimatingAncestralCharacters1556.2.1ContinuousCharactersTwomethodscanbeusedforcontinuouscharacters:leastsquares(method="pic"),andmaximumlikelihood(method="ML",thedefault).Themodelofevolutionisspecifiedwiththeoptionmodel.Theleastsquaresestimatorfollowsfromthephylogeneticallyindependentcontrastsmethod[36](Section6.1.1).ThisassumesaBrownianmotionmodelofevolution:thisallowsustocomputethevarianceofeachancestralcharacterestimate.Aconfidenceintervalcanbecomputedwiththeusualformulaˆxa±1.96V(ˆxa),withˆxabeingtheestimatedancestralvalue.ThemaximumlikelihoodestimatorunderaBrownianmotionmodelde-velopedbySchluteretal.[138]usesalikelihoodfunctionwheretheancestralvaluesareparameters:11(x−x)22ijL(σ,xa|T,x)=exp,(6.14)σn2σtijwhereσ2isthevarianceoftheBrownianmotionprocess,xaretheancestralavalues,Tisthephylogeny,andxaretheobservedvaluesofthecharacteratthetipsofT.Once(6.14)hasbeenmaximized,thestandarderrorsofσ2andxˆaareobtainedwiththesecondpartialderivatives,andconfidenceintervalsarecomputedasabove.Notethatσ2isalsoestimated.Letustrythesetwomethodsonthebodymassoftheprimatedataset.WefirstfitaBrownianmotionmodelwiththedefaultmaximumlikelihoodmethod:>ace(body,tree.primates)$loglik[1]-6.714469$ace[1]1.1837252.1920182.5713203.503182$sigma2[1]1.97115020.6970463$CI95[,1][,2][1,]-0.50585902.873308[2,]0.98687373.397163[3,]1.48440553.658235[4,]2.68584454.320519$callace(x=body,phy=tree.primates) 1566AnalysisofMacroevolutionwithPhylogeniesTheresultsarereturnedasalistwiththeancestralestimates(ace)andtheir95%confidenceintervalsinamatrix(CI);thesevaluesareindexedwiththenumbersofthenode(seeSection3.1.1).Withthedefaultmethod,thefunctionreturnsadditionallythelog-likelihood(loglik)andtheestimatedvarianceoftheBrownianmotionmodelwithitsstandarderrorinavectoroflengthtwo(sigma2).TheoptionCI,whosedefaultisTRUE,allowsustocomputethe95%con-fidenceintervalsoftheancestralestimates.Wenowusetheleastsquaresmethodtofitthesamemodel:>ace(body,tree.primates,method="pic")$ace-1-2-3-41.1837252.7808243.2003783.852630$CI95[,1][,2][1,]-1.2969313.664381[2,]0.8548664.706781[3,]1.3670005.033757[4,]2.5824285.122832$callace(x=body,phy=tree.primates,method="pic")Theleastsquaresestimatesareslightlylargerthanthemaximumlike-lihoodones,particularlyfortheoldestnodes.Furthermore,theconfidenceintervalscomputedbymaximumlikelihoodareusuallynarrowerthanthosebyleastsquares.6.2.2DiscreteCharactersMarkovianmodelsprovideausefulandpracticaltoolformodelingtheevo-lutionofdiscretecharacters[109].WealreadyhaveseenthisframeworkwiththesubstitutionmodelsofDNAsequences(Section5.2.1).BecauseMarkovianmodelshaveaprobabilisticformulation,theycanbefitbymaximumlikeli-hoodandcompared,foragivendataset,withstandardstatisticalmethods.aceallowstheusertosetavarietyofmodelsinaflexibleway.Discretecharactersaregivenasvectorsorfactors,andspecifytheoptiontype="discrete".Theoptionmodelisusedtoparameterizethetransitionratesamongthestates.Thenumberofstatesistakenfromthedata(thiscanbeseenwithunique(x)).Themodelisspecifiedwithamatrixofintegersrepresentingtheindicesoftheparameters:1representsthefirstparameter,2thesecondone,andsoon.Thesamenumbermayappearseveraltimesinthematrix,meaningthatthe 6.2EstimatingAncestralCharacters157rateshavethesamevalues.Forinstance,withatwo-statecharacter,model=matrix(c(0,1,1,0),nrow=2)specifiesthatthetransitionsamongbothstatesoccuratequalrates,andsothereisonlyoneparametertobeestimatedfromthedata.Thisisbestvisualizedbyprintingthematrix(thediagonalisalwaysignoredhere):>matrix(c(0,1,1,0),nrow=2)[,1][,2][1,]01[2,]10Ifinsteadweusethefollowingmatrix,>matrix(c(0,1,2,0),nrow=2)[,1][,2][1,]02[2,]10thendifferentratesareassumedforbothchanges,andtherearetwoparam-eters.Wemayrecallthatintheratematrix,therowsrepresenttheinitialstatesandthecolumnsthefinalstates.Iftherearethreestates,somepossiblemodelscouldhavethefollowingratematrices.>matrix(c(0,1,1,1,0,1,1,1,0),nrow=3)[,1][,2][,3][1,]011[2,]101[3,]110>matrix(c(0,1,2,1,0,3,2,3,0),nrow=3)[,1][,2][,3][1,]012[2,]103[3,]230>matrix(c(0,1:3,0,4:6,0),nrow=3)[,1][,2][,3][1,]035[2,]106[3,]240Toindicatethatatransitionisimpossible,azeromustbegivenintheappropriatecellofthematrix.Forinstance,a“cyclical”changemodelcouldbespecifiedby:>matrix(c(0,0,3,1,0,0,0,2,0),nrow=3)[,1][,2][,3][1,]010 1586AnalysisofMacroevolutionwithPhylogenies[2,]002[3,]300where,ifthethreestatesaredenotedA,B,C,thepermittedchangesarethefollowing:A→B→C→A.Thenumberofpossiblemodelsisverylarge,evenwiththreestates.Theinterestistolettheuserdefinethemodelsthatmaybesensibleforaparticularstudyandtestwhethertheyareappropriate.Thereareshort-cutswithcharacterstringsthatcanbeusedinsteadofanumericmatrix.Thepossibleshort-cutsare:•model="ER"fortheequal-ratesmodel,•model="SYM"forthesymmetricalmodel,•model="ARD"fortheall-rates-differentmodel.Forathree-statecharacter,theseshort-cutsresultinexactlythesameratematricesshownabove,respectively.Bydefault,iftheusersetstype="discrete",thenthedefaultmodelis"ER".IftheoptionCI=TRUEisused,thenthelikelihoodofeachancestralstateisreturnedforeachnodeinamatrixcalledlik.anc.Theyarecomputedwithaformulasimilarto(5.7),andscaledsothattheysumtooneforeachnode.Withtheprimatedata,consideracharacterthatsetsGalagoapartfromtheothergenera(say“bigeyes”).Wefirstfitthedefaultmodel(equalrates):>x<-c(2,2,2,2,1)>ace(x,tree.primates,type="discrete")$loglik[1]-1.768921$rates[1]0.3775508$se[1]0.3058119$lik.anc12-10.3047885040.6952115-20.0156976050.9843024-30.0199891990.9800108-40.0062210230.9937790$callace(x=x,phy=tree.primates,type="discrete")Thelikelihoodofthestates“bigeyes”and“smalleyes”attherootare0.3and0.7,respectively.Underthismodel,itishighlylikelythatthethreeothernodesofthetreewere“smalleyes”. 6.2EstimatingAncestralCharacters159Wenowfittheall-rates-differentmodel:>ace(x,tree.primates,type="discrete",model="ARD")$loglik[1]-1.602901$rates[1]0.30597531.0892927$se[1]0.38641191.2243605$lik.anc12-10.526890380.4731096-20.115860960.8841390-30.117241200.8827588-40.042229920.9577701$callace(x=x,phy=tree.primates,type="discrete",model="ARD")Interestingly,thelikelihoodsontherootarequiteaffectedbythemodel:thestateoftherootisnowmuchlesscertain.Fortheothernodes,thelikelystateisstill“small-eyes”.Theincreaseinlikelihoodwiththeadditionalparameterisnotsignificant:>1-pchisq(2*(1.768921-1.602901),1)[1]0.5644603ThegenusHomoissufficientlydifferentfromtheotherprimategenerathatitisnothardtofindadiscretecharacterthatseparatesthem.Soweconsideracharactertakingthevalue1inHomo,5and2inthefourothergenera.Wefittheabovetwomodelsandexaminehowtheirassumptionsaffectthelikelihoodofancestralestimates.>y<-c(1,2,2,2,2)>ace(y,tree.primates,type="discrete")$loglik[1]-2.772593$rates5Thiscouldbestandingandmovingupright,speakingcomplexlanguages,com-plexsocialstructures,cookingfood,writingpoems,usingcomputerstoanalyzephylogenies,andsoon. 1606AnalysisofMacroevolutionwithPhylogenies[1]14.59506$se[1]386.9471$lik.anc12-10.50000000.5000000-20.50000000.5000000-30.49999970.5000003-40.50000000.5000000$callace(x=y,phy=tree.primates,type="discrete")>ace(y,tree.primates,type="discrete",model="ARD")$loglik[1]-1.808865$rates[1]8.03015932.120695$se[1]NaNNaN$lik.anc12-10.50000000.5000000-20.50000000.5000000-30.50000000.5000000-40.50020410.4997959$callace(x=y,phy=tree.primates,type="discrete",model="ARD")Thedistributionofyleadstomuchuncertaintyintheancestrallikelihoods,afactwell-knowntotheusersoftheparsimony-basedmethods.AmoreconcreteapplicationofacewithdiscretecharactersispresentedbelowwiththeSylviadata.6.3AnalysisofDiversificationTheincreasingavailabilityofestimatedphylogenieshasledtoarenewedin-terestinthestudyofmacroevolutionprocesses.Foralongtime,thisissue 6.3AnalysisofDiversification161wasintheterritoryofpaleontology.Thefactthatcompletephylogeniesbe-comemorenumerousformoreandmoretaxonomicgroupshasbroughtthebiologistsintotheparty.Theanalysisofdiversificationisbasedonultrametrictreeswithdatednodes.Mostmethodsarebasedonaprobabilisticmodelofspeciationandextinctioncalledthe“birth–death”model[79].Thismodelassumesthatthereisaninstantaneousspeciationprobability(denotedλ)andaninstantaneousextinctionprobability(µ).Therearevariationsandextensionstothisbasicmodel.Themostwellknowniswhenµ=0(i.e.,noextinction),whichiscalledtheYulemodel.Neeetal.[104]suggestedageneralizationofthebirth–deathmodelwhereλandµvarythroughtime.Isuggestedamodel,calledtheYulemodelwithcovariates,whereλvarieswithrespecttospeciestraits[115].Becausethebirth–deathmodelanditsvariantsareprobabilisticmodels,theycanbefittodatabymaximumlikelihood.Thesemodelscanbeusedbothforparameterestimationandhypothesistesting.Fromabiologicalpointofview,themaininterestisthepossibilityoftestingavarietyofbiologicalhypothesesdependingonthefitmodels.Otherapproachesconsideragraphicalorstatisticalanalysisofthedis-tributionofthebranchingtimeswithoutassuminganexplicitmodel.Thesemethodsfocusonhypothesistesting.6.3.1GraphicalMethodsPhylogenetictreescanbeusedtodepictchangesinthenumberofspeciesthroughtime.ThisideahasbeenexploredbyNeeetal.[105]andHarveyetal.[63].Thelineages-through-timeplotisverysimpleinitsprinciple:itplotsthenumberoflineagesobservedonatreewithrespecttotime.Withaphylogenyestimatedfromrecentspecies,thisnumberisobviouslyincreasingbecausenoextinctioncanbeobserved.Ifdiversificationhasbeenconstantthroughtime,andthenumbersoflineagesareplottedonalogarithmicscale,thenastraightlineisexpected.Ifdiversificationratesdecreasedthroughtime,thentheobservedplotisexpectedtolayabovethestraightline,whereastheoppositeresultisexpectedifdiversificationratesincreasedthroughtime.Theinterpretationoflineages-through-timeplotsisactuallynotstraight-forwardbecauseinapplicationswithrealdatatheshapeoftheobservedcurverarelyconformstooneofthethreescenariossketchedabove[33].Thisgraph-icalmethodisoflimitedvaluetotesthypotheses;particularly,itsbehaviorisnotknowninthepresenceofheterogeneityindiversificationparameters.However,itisaninterestingexploratorytoolgivenitsverylowcomputationalcost.Therearethreefunctionsinapeforperforminglineages-through-timeplots:ltt.plot,ltt.lines,andmltt.plot.Thefirstonedoesasimpleplottakingaphylogenyasargument.Bydefault,thex-andy-axesarelabeled“Time”and“N”,butthiscanbechangedwiththeoptionsxlabandylab, 1626AnalysisofMacroevolutionwithPhylogeniesNN421012468101251−12−8−40−12−8−40TimeTimeFig.6.5.Lineages-through-timeplotoftheclocktreeofMichauxetal.[100](Fig.4.18)withalogarithmicscaleontheright-handsiderespectively.Thisfunctionhasalsoa“dot-dot-dot”(...)argument(seep.71foranexplanationofthisargument)thatcanbeusedtoformattheplot(e.g.,toaltertheappearanceoftheline).Asanillustration,letuscomebacktotherodenttreedisplayedinFig.4.18.Itisultrametricandsocanbeanalyzedwiththepresentmethod.Wesimplydisplaytheplottwice,withthedefaultoptions,andsetthey-axisonalogarithmicscale(Fig.6.5):layout(matrix(1:2,1,2))ltt.plot(trk)ltt.plot(trk,log="y")ltt.linescanbeusedtoaddalineages-through-timeplotonanexistinggraph(itisalow-levelplottingcommand).Ithasonlytwoarguments:anobjectofclass"phylo"andthe“dot-dot-dot”argumenttospecifythefor-mattingofthenewline(becausebydefault,itislikelytolooklikethelinealreadyplotted).Forinstance,ifwewanttodrawthelineages-through-timeplotsofbothtreesonFig.4.18,wecoulddo:ltt.plot(trk)ltt.lines(trc,lty=2)mltt.plotismoresophisticatedforplottingseverallineages-through-timeplotsonthesamegraph.Itsinterfaceis:mltt.plot(phy,...,dcol=TRUE,dlty=FALSE,legend=TRUE,xlab="Time",ylab="N") 6.3AnalysisofDiversification163Notethatthe“dot-dot-dot”argumentisnotthelastone;thusitdoesnothavethesamemeaningasinthefirsttwofunctions.Here,‘...’means“aseriesofobjectsofclass"phylo"”.Theoptionsdcolanddltyspecifywhetherthelinesshouldbedistinguishedbytheircolorsand/ortheirtypes(solid,dashed,dotted,etc.).Toproduceagraphwithoutcolors,onewillneedtoinvertthedefaultvaluesofthesetwooptions.Theoptionlegendindicateswhethertodrawalegend(thedefault).Tocomparethelineages-through-timeplotsofourtwotrees,wecoulddo(Fig.6.6):mltt.plot(trk,trc,dcol=FALSE,dlty=TRUE)trktrcN4210124681−12−10−8−6−4−20TimeFig.6.6.Multiplelineages-through-timeplotoftheclocktreeofMichauxetal.[100]andthetreeestimatedfromthenonparametricratesmoothingmethod(Fig.4.18)Notethattheaxesaresettorepresentbothlinescorrectly,whichmaynotbethecasewhenusingltt.lines(althoughtheaxesmaybesetwithxlimandylimpassedtoltt.plotwiththe“dot-dot-dot”).Theadvantageofthelatteristhatthelinesmaybecustomizedatwill,whereasthisisdoneautomaticallybymltt.plot.6.3.2Birth–DeathModelsBirth–deathprocessesprovideasimplewaytomodeldiversification.Therearereasonstobelievethatthesemodelsdonotcorrectlydepictmacroevo-lutionaryprocesses[93],buttheyareusefultousefordataanalysisbecausetherehasbeenconsiderableworktoderiveprobabilityfunctionsrelatedtotheseprocessesmakinglikelihood-basedinferencepossible[79,80]. 1646AnalysisofMacroevolutionwithPhylogeniesTheSimpleBirth–DeathModelTheestimationofspeciationandextinctionprobabilitieswhenallspeciationandextinctioneventsareobservedthroughtimeisnotproblematic[78].Somedifficultiesarisewhenonlytherecentspeciesareobserved.Neeetal.[104]de-rivedmaximumlikelihoodestimatesoftheseparametersinthiscase.Theyusedthefollowingreparameterization:r=λ−µ,a=µ/λ.Theestimatesλˆandµˆarethenobtainedbyback-transformation.Thefunctionbirthdeathimple-mentsthismethod:ittakesassingleargumentanobjectofclass"phylo".Notethatthistreemustbedichotomous.Ifthisisnotthecase,itcouldbetransformedwithmulti2di(Section3.4.3):thisassumesthataseriesofspe-ciationeventsoccurredveryrapidly.Theresultsarereturnedasanobjectofclass"birthdeath".Asanexample,wecomebacktothe14-speciesrodenttreeexaminedabovewithlineages-through-timeplots:>bd.trk<-birthdeath(trk)>bd.trkEstimationofSpeciationandExtinctionRatesWithBirth-DeathModelsPhylogenetictree:trkNumberoftips:14Deviance:25.42547Log-likelihood:-12.71274Parameterestimates:d/b=0StdErr=0b-d=0.1438844StdErr=0.02939069(b:speciationrate,d:extinctionrate)Profilelikelihood95%confidenceintervals:d/b:[0,0.5178809]b-d:[0.07706837,0.2412832]Thestandarderrorsoftheparameterestimatesarecomputedusingtheusualmethodbasedonthesecondderivativesofthelikelihoodfunctionatitsmaxi-mum.Inaddition,95%confidenceintervalsofbothparametersarecomputedusingprofilelikelihood:theyareparticularlyusefuliftheestimateofaisattheboundaryoftheparameterspace(i.e.,0,whichisoftenthecase[117]).birthdeathreturnsalistthatallowsustoextracttheresultsifnecessary.Asanillustrationofthis,letusexaminethequestionofhowsensitivetheaboveresultcouldbetoremovingonespeciesfromthetree.Theideaissimple:wedroponetipfromthetreesuccessively,andlookattheestimatedparameters(returnedintheelementparaofthelist).Insteadofdisplayingtheresultsdirectlywestoretheminamatrixcalledres.Eachrowofthismatrixreceivestheresultofoneanalysis: 6.3AnalysisofDiversification165>res<-matrix(NA,14,2)>for(iin1:14)+res[i,]<-birthdeath(drop.tip(trk,i))$para>res[,1][,2][1,]00.1354675[2,]00.1354675[3,]00.1361381[4,]00.1369858[5,]00.1376716[6,]00.1439786[7,]00.1410251[8,]00.1410251[9,]00.1421184[10,]00.1490515[11,]00.1318945[12,]00.1318945[13,]00.1361381[14,]00.1361381Thisshowsthattheanalysisisonlyslightlyaffectedbythedeletionofonespeciesfromthetree.Withalargertree,onecouldexaminetheseresultsgraphically,forinstance,withahistogram(i.e.,hist(res[,2])).CombiningPhylogeneticandTaxonomicDataItoftenoccursthataphylogenyisnotcompleteinthesensethatnotalllivingspeciesareincluded.Thisleadstosomedifficultiesintheanalysisofdiversi-ficationbecausetherearesomeobviousmissingdata.Pybusetal.[126]haveapproachedthisproblemusingsimulationsandrandomizationprocedures.AmoreformalandgeneralapproachhasbeendevelopedindependentlybyBokma[11]andmyself[113].Theideaistocombinetheinformationfromphylogeneticdata(branchingtimes)andtaxonomicdata(speciesdiversity).Formulaecanbederivedtocalculatetheprobabilitiesofbothkindsofobser-vations,andbecausetheydependonthesameparameters(λandµ)theycanbecombinedintoasinglelikelihoodfunction.Theapproachdevelopedin[113]isimplementedinthefunctionbd.ext.Letusconsiderthephylogenyofbirdorders(Fig.4.13).ThenumberofspeciesineachordercanbefoundinSibleyandMonroe[141].Theseareenteredbyhand:>data(bird.orders)>S<-c(10,47,69,214,161,17,355,51,56,10,39,152,+6,143,358,103,319,23,291,313,196,1027,5712)>bd.ext(bird.orders,S) 1666AnalysisofMacroevolutionwithPhylogeniesExtendedVersionoftheBirth-DeathModelstoEstimateSpeciationandExtinctionRatesData:phylogenetic:bird.orderstaxonomic:SNumberoftips:23Deviance:289.1639Log-likelihood:-144.5820Parameterestimates:d/b=0StdErr=0b-d=0.2866789StdErr=0.007215592(b:speciationrate,d:extinctionrate)Theoutputisfairlysimilartotheonefrombirthdeath.Notethatitispossibletoplotthelog-likelihoodfunctionwithrespecttodifferentvaluesofaandrinordertoderiveprofilelikelihoodconfidenceintervalsoftheparameterestimates(see[113]forexamples).TheYuleModelwithCovariatesThetwoapplicationsofbirth–deathmodelsaboveassumethatspeciationandextinctionrateswereconstantthroughtime.Itisobviousthat,biologically,thisassumptionmustberelaxedbecausediversificationhasclearlyfluctuatedovertime[142].Neeetal.[104]suggestedextendingthesimplebirth–deathmodeltoincludetime-varyingspeciationandextinctionrates,butthisdoesnotseemtohavebeenimplementedorfurtherdeveloped.Anotherapproachtothisproblemistoassumethattheseratesvarywithrespecttooneorseveralspeciestraits.Thisisappealingbiologicallybecauseamajorissueinbiologyistoidentifythebiologicaltraitsthatleadtohigherspeciationand/orextinctionrates[42,73].Iproposed[115]tomodelspeciationratesusingalinearmodelwrittenasλiln=β1xi1+β2xi2+···+βpxip+α,(6.15)1−λiwhereλiisthespeciationrateforspeciesi,xi1,xi2,...,xiparevariablesmeasuredonspeciesi,andβ1,β2,...,βp,αaretheparametersofthemodel.Thefunctionln(x/(1−x))iscalledthelogitfunction(itisusedinlogisticregressionandGLMs):itallowsthetermontheleft-handsidetovarybetween−∞and+∞.Thetermsontheright-handsidemustbeinterpretedinthesamewayasausuallinearregressionmodel.Letusrewrite(6.15)inmatrixformaslogit(λ)=xTβ.Givingsomevaluesofthevectorβandofthetraitsiixiitispossibletopredictthevalueofthespeciationratewiththeinverselogitfunction: 6.3AnalysisofDiversification1671λi=T.(6.16)−xβ1+eiIfwemaketheassumptionthatthereisnoextinction(µ=0),thenitispossibletoderivealikelihoodfunctiontoestimatetheparametersof(6.15)givinganobservedphylogenyandvaluesofx[115].Becausethisusesaregres-sionapproach,differentmodelscanbecomparedwithlikelihoodratiotestsintheusualway.Acriticalassumptionofthismodelisthattheextinctionrateisequaltozero.Thisisclearlyunrealisticbutitappearedthatincludingextinctionratesinthemodelmadeittoocomplextopermitparameterestimation[115].Somesimulationsshowedthatthetestofthehypothesisβ=0isaffectedbythepresenceofextinctionsbutitkeepssomestatisticalpower(itcandetectaneffectwhenitispresent;see[115]fordetails).Thefunctionyule.covfitstheYulemodelwithcovariates.Ittakesasargumentsaphylogenetictree,andaone-sidedformulagivingthepredictorsofthelinearmodel(e.g.,˜a+b).Thevariablesinthelattercanbelocatedinadataframe,inwhichcasetheoptiondatamustbeused.Theycanbenumericvectorsand/orfactors:theyaretreatedascontinuousanddiscretevariables,respectively.Thepredictorsmustbeprovidedforthetipsandthenodesofthetree;forthelattertheycanbeestimatedwithace(Section6.2).Theresultsaresimplydisplayedontheconsole.Tofitthenullmodel(i.e.,withconstantspeciationrate),onecanusethefunctionyulewhichfitsthesimpleYulemodel.Itreturnsanobjectofclass"yule".AnapplicationofthesefunctionswiththeFelidaedataisdetailedbelow(Section6.5.2).6.3.3SurvivalModelsTheproblemofmissingspeciesinphylogeniesmotivatedsomeinitialworksonhowtodealwiththisproblemintheanalysisofdiversification.Isuggestedtheuseofcontinuous-timesurvivalmodelsforthispurposebecausetheycanhandlemissingdataintheformofcensoreddata[111].Typicalsurvivaldataaretimestofailureofindividualsorobjects[20].Itoftenoccursthatsomeindividualsareknowntohavebeenlivinguntilacertaintime,buttheirexactfailuretimesareunknownforvariousreasons(e.g.,theyleftthestudyarea,orthestudyendedbeforetheyfailedordied).Thisiscalledcensorship.Theideaistousethisconceptformissingspeciesinphylogeniesinasmuchasitisoftenpossibletoestablishaminimumtimeofoccurenceforthem[111].Usingsurvivalmodelstoanalyzediversificationimpliesthatspeciationandextinctionratescannotbeestimatedseparately.Theestimatedsurvival(orhazard)ratemustbeinterpretedasadiversificationrate[111].Itisdenotedδ(=λ−µ).Intheoryavarietyofmodelscouldbeused,butonlythreeareimplementedinape(see[111]fordetails):•ModelAassumesaconstantdiversificationratethroughtime; 1686AnalysisofMacroevolutionwithPhylogenies•ModelBassumesthatdiversificationchangedthroughtimeaccordingtoaWeibulldistributionwithaparameterdenotedβ.Ifβ>1,thenthedi-versificationratedecreasedthroughtime;ifβ<1,thentherateincreasedthroughtime.Ifβ=1,thenModelBreducestoModelA;•ModelCassumesthatdiversificationchangedwithabreakpointattimeTc.Thesethreemodelscanbefitwiththefunctiondiversi.time.Thisfunc-tiontakesasmainargumentsthevaluesofthebranchingtimes(whichcanbecomputedbeforehand,forinstance,withbranching.times).Asasimpleexample,wetakethedataonRamphocelusanalyzedin[111].Thisgenusofpasserinebirdsincludeseightspecies:sixofthemwerestudiedbyHackett[59]whoresolvedtheirphylogeneticrelationships.Forthetworemainingspecies,someapproximatedatesofbranchingcouldbeinferredfromdatareportedin[59].WeenterthedatabyhandinR:>x<-c(0.8,1,1.15,1.55,2.3,0.8,0.8)>indicator<-c(rep(1,5),rep(0,2))>diversi.time(x,indicator)AnalysisofDiversificationwithSurvivalModelsData:xNumberofbranchingtimes:7accuratelyknown:5censored:2ModelA:constantdiversificationlog-likelihood=-7.594AIC=17.188delta=0.595238StdErr=0.266199ModelB:diversificationfollowsaWeibulllawlog-likelihood=-4.048AIC=12.096alpha=0.631836StdErr=0.095854beta=2.947881StdErr=0.927013ModelC:diversificationchangeswithabreakpointattime=1log-likelihood=-7.321AIC=18.643delta1=0.15625StdErr=0.15625delta2=0.4StdErr=0.2Likelihoodratiotests:ModelAvs.ModelB:chiˆ2=7.092df=1,P=0.0077ModelAvs.ModelC:chiˆ2=0.545df=1,P=0.4604 6.3AnalysisofDiversification169Theresultsaresimplyprintedonthescreen.Notethatherethebranchingtimesarescaledinmillionyearsago(Ma),andthustheestimatedparametersδˆ(delta),ˆα(alpha),δˆ1(delta1,valueofδafterTcinmodelC),andδˆ2(delta2,valueofδbeforeTcinmodelC)mustbeinterpretedwithrespecttothistimescale.However,theestimateofβˆ(beta)andthevaluesoftheLRTsarescaleindependent.6.3.4Goodness-of-FitTestsAspointedoutearlierinthischapter,theestimationofextinctionratesisdif-ficultwithphylogeniesofrecentspeciesbecauseextinctionsarenotobserved[114].However,itisclearthatextinctionsaffectthedistributionofbranchingtimesofagiventree[63,103].Analternativeapproachtoparametricmod-elsistofocusonthisdistributionandcompareittoatheoreticalonewithstatisticalgoodness-of-fittestsbasedontheempiricalcumulativedistributionfunction(ECDF)[146,149].ThesetestscomparetheECDFofbranchingtimestothedistributionpredictedunderagivenmodel.Thenullhypothesisisthattheobserveddistributioncomesfromthistheoreticalone.Adifficultyofthesetestsisthattheirdistributiondependsonthenullhypothesis,andthusthecriticalvaluesmustbedeterminedonacase-by-casebasis.Thefunctiondiversi.gofimplementsthegoodness-of-fittestsasappliedtotestingamodelofdiversification[112].Ittakesasmainargumentavectorofbranchingtimesinthesamewayasdiversi.time.Thesecondargument(null)specifiesthedistributionunderthenullhypothesis:bydefaultnull="exponential"meaningthatittestswhetherthebranchingtimesfollowanexponentialdistribution.Theotherpossiblechoiceisnull="user"inwhichcasetheusermustsupplyatheoreticaldistributionforthebranchingtimesinathirdargument(z).AsanapplicationweconsiderthesamedataonRamphocelusasintheprevioussection:>diversi.gof(x)TestsofConstantDiversificationRatesData:xNumberofbranchingtimes:7Nullmodel:exponentialCramer-vonMisestest:W2=0.841P<0.01Anderson-Darlingtest:A2=4.81P<0.01Twotestsarecomputed:theCram´er–vonMisestestwhichconsidersalldatapointsequally,andtheAnderson–Darlingtestwhichgivesmoreemphasisinthetailsofthedistribution[147].Thecriticalvaluesofbothtestshavebeen 1706AnalysisofMacroevolutionwithPhylogeniesdeterminedbyStephens[146].Ifwewanttoconsideronlythefiveaccuratelyknowndatapoints,theresultsarenotchanged:>diversi.gof(x[indicator==1])TestsofConstantDiversificationRatesData:x[indicator==1]Numberofbranchingtimes:5Nullmodel:exponentialCramer-vonMisestest:W2=0.578P<0.01Anderson-Darlingtest:A2=3.433P<0.01Theresultsofthesetestsarescaleindependent.Anothergoodness-of-fittestistheγ-statistic[125].Itisbasedonthein-ternodeintervalsofaphylogeny:undertheassumptionthatthecladediver-sifiedatconstantrates,itfollowsanormaldistributionwithmeanzeroandstandarddeviationone.Theγ-statisticcanbecalculatedwiththefunctiongammaStatwhichtakesasuniqueargumentanobjectofclass"phylo".Thenullhypothesiscanbetestedwith:1-2*pnorm(abs(gammaStat(tr)))6.3.5TreeShapeandIndicesofDiversificationThemethodsforanalyzingdiversificationwehaveseenuntilnowrequireknowledgeofthebranchlengthsofthetree.Someresearchershaveinves-tigatedwhetheritispossibletogetsomeinformationondiversificationusingonlythetopologyofaphylogenetictree(see[1,3,83]forreviews).Intuitively,wemayexpectunbalancedphylogenetictreestoresultfromdifferentialdi-versificationrates.Ontheotherhand,differentmodelsofspeciationpredictdifferentdistributionsoftreeshapes.apTreeshapeimplementsstatisticaltestsfortwoindicesoftreeshape:Sackin’sandColless’s.Theirformulaeare:nIS=di,(6.17)i=1n−1IC=|Lj−Rj|,(6.18)j=1wherediisthenumberofbranchesbetweentipiandtheroot,andLjandRjarethenumberoftipsdescendantofthetwosubcladesoriginatingfromnodej.Theseindiceshavelargevaluesforunbalancedtrees,andsmallvaluesfor 6.3AnalysisofDiversification171fullybalancedtrees.Theycanbecalculatedforagiventreewiththefunctionssackinandcolless.Twootherfunctions,sackin.testandcolless.test,computetheindicesandtest,usingaMonteCarlomethod,thehypothesisthatthetreewasgeneratedunderaspecifiedmodel.Bothfunctionshavethesameoptions:colless.test(tree,model="yule",alternative="less",n.mc=500)sackin.test(tree,model="yule",alternative="less",n.mc=500)wheretreeisanobjectofclass"treeshape",modelgivesthenullmodel,alternativespecifieswhethertorejectthenullhypothesisforsmall(default)orlargevalues(alternative="greater")oftheindex,andn.mcgivesthenumberofsimulatedtreestogeneratethenulldistribution.ThetwopossiblenullmodelsaretheYulemodel(thedefault),andthePDA(model="pda").Thesemodelsaredescribedonp.45.AmorepowerfultestoftheaboveindicesistheshapestatisticwhichisthelikelihoodratiounderbothYuleandPDAmodels.Thisstatistichasdistinctdistributionsunderbothmodels,soitispossibletodefineamostpowerfultest(i.e.,onewithoptimalprobabilitiesofrejectingeitherhypothesiswhenitisfalse).Thisisimplementedinthefunctionlikelihood.test.WegeneratearandomtreewiththeYulemodel,andthentrythefunction:>trs<-rtreeshape(1,model="yule")>likelihood.test(trs)TestoftheYulehypothesis:statistic=-1.207237p.value=0.2273407alternativehypothesis:thetreedoesnotfittheYulemodelNote:thep.valuewascomputedaccordingtoanormalapproximation>likelihood.test(trs,model="pda")TestofthePDAhypothesis:statistic=-3.280261p.value=0.001037112alternativehypothesis:thetreedoesnotfitthePDAmodelNote:thep.valuewascomputedaccordingtoanormalapproximationAldous[2,3]introducedagraphicalmethodwhere,foreachnode,thenumberofdescendantsofbothsubcladesfromthisnodeareplottedoneversustheotherwiththelargestoneonthex-axis.TheexpecteddistributionofthesepointsisdifferentundertheYuleandPDAmodels.Thefunctionaldous.test 1726AnalysisofMacroevolutionwithPhylogeniesYulemodelPDAmodel002Sizeofsmallerdaughterclade(logscale)1251125102050Sizeofparentclade(logscale)Fig.6.7.Plotofthenumberofdescendantsofbothsubcladesforeachnodeofatreewith50tipssimulatedunderaYulemodel.Thelabeledlinesaretheexpecteddistributionunderthesemodels,andtheleftmostlineisaquantileregressiononthepointsmakesthisgraphicalanalysis.Togetherwiththepoints,theexpectedlinesaredrawnunderthesetwomodels.Theoptionxmin=20controlsthescaleofthex-axis:bydefault,thesmallestcladesarenotrepresentedwhichmaybesuitableforlargetrees.IfwedotheAldoustestwiththesmalltreesimulatedabove(Fig.6.7):aldous.test(trs,xmin=1)Theexpectedlinesarelabeledwiththenullmodelsjustabovethem.Themostleftwardline(bydefaultinred)isaquantileregressiononthepoints.6.4PerspectivesThereiscertainlymuchtoexpectfromthestudyofevolutionaryprocessesusingphylogeniesofrecentspecies.Phylogeneticdataareaccumulatingatarapidpace,andwecanhopethatmorefocusonmacroevolutionaryissueswillleadtoinsightsintothemechanismsofbiologicalevolution.Themeth-odsalreadyimplementedinRcoverawiderangeofissues.Itislikelythatdevelopmentswillcontinueinthesamedirectiontoofferbiologistsacom-pleteenvironmentfordataanalysis.FuturedevelopmentscouldalsoincludemethodsnotyetavailableinRsuchasbiogeographicalmodels[90]. 6.5CaseStudies1736.5CaseStudies6.5.1SylviaWarblersWebeginbyreadingintheSylviadataifnecessary.Wefirstdroptheoutgroupspecies(Chamaeafasciata)forwhichwehavenoecologicaldata:load("sylvia.RData")tr<-read.tree("sylvia_nj_k80.tre")tr<-drop.tip(tr,"Chamaea_fasciata")Wealsosortthedataframeofecologicaldatasothatitsrowsareinthesameorderasthetiplabelsofthetree:6DF<-sylvia.eco[tr$tip.label,]Wefocusonananalysisofthegeographicalrangebytryingtoreconstructtheevolutionofthischaracter.Migratorybehavioristightlylinkedwithge-ographicalrange:>table(DF$geo.range,DF$mig.behav)longresidshorttemp040temptrop904trop070Wecanassumeinafirststepthatevolutionarychangesamongthethreestatesoccuratthesamerate.Wefitamodelwithaceusingtheoptiontype="discrete"—whichmaybeabbreviatedwith"d"—andthedefaultmodel(equalrates):>syl.er<-ace(DF$geo.range,tr,type="d")>syl.er$loglik[1]-25.26805$rates[1]136.5994$se[1]NaN....6Mostfunctionsinapeandade4donotneedthisbecausethetiplabelsandtherownamesarematched,butbecauseherethereareextraspeciesinsylvia.eco,wedobothoperationsatonce. 1746AnalysisofMacroevolutionwithPhylogeniesThefactthatnostandarderrorhasbeencomputedfortherateparameterindicatesthatthelikelihoodsurfaceofthismodelisflat,andthelatterpoorlyfitsthedata.Wefitthesymmetricalmodelwheretransitionratesdifferfromonestatetoanotherbuttransitionsbetweentwogivenstateshaveequalratesinbothdirections.Weusetheshort-cutmodel="SYM":>syl.sym<-ace(DF$geo.range,tr,type="d",model="SYM")$loglik[1]-21.71442$rates[1]28.20588-18.2341297.49406$se[1]21.3965522.7421398.51708....Thismodelclearlyfitsbetter:thisisnotsurprisingbecauseweaddedtwopa-rameters.Wecancomputethelikelihoodratiotestcomparingthetwomodelstotestwhethertheincreaseinfitissignificant:7>1-pchisq(2*(syl.sym$loglik-syl.er$loglik),2)[1]0.0286206Thisissignificant,butwemaywanttotryamoreparsimonious“custom”modelwhereonlythetransitionstemp↔temptrop↔troparepermitted.Wedefineasymmetricmatrixmodthatisusedasamodelinace:>mod<-matrix(0,3,3)>mod[2,1]<-mod[1,2]<-1>mod[2,3]<-mod[3,2]<-2>mod[,1][,2][,3][1,]010[2,]102[3,]020Theratematrixmodhastwoparameters:thefirstoneforthetransitionstemp↔temptrop,andthesecondoneforthetransitionstemptrop↔trop.>syl.mod<-ace(DF$geo.range,tr,type="d",model=mod)>syl.mod$loglik7Alikelihoodratiotestiscomputedastwicethedifferenceinlog-likelihoods,and2followsaχdistributionwiththenumberofdegreesoffreedomgivenbythedifferenceinnumberofparameters.Thefunctionpchisqgivesthecumulative22densityfunctionoftheχdistribution(i.e.,Pr(x≤χ)). 6.5CaseStudies175[1]-24.29444$rates[1]32.7976598.11600$se[1]NaNNaN....Thismodeldoesnotfitbetter,sowesticktothesymmetricalmodel.Howdoweinterprettheratesestimatedbyace?Weusethemethodologydescribedforsubstitutionmodelstocalculateaprobabilitymatrixfromtheratematrix(Section5.2.1).Wefirstbuildthelatterwiththeestimatedratesthatarearrangedcolumnwiseinthematrix:>Q<-matrix(0,3,3)>Q[1,2]<-Q[2,1]<-syl.sym$rates[1]>Q[1,3]<-Q[3,1]<-syl.sym$rates[2]>Q[2,3]<-Q[3,2]<-syl.sym$rates[3]>Q[,1][,2][,3][1,]0.0000028.20588-18.23411[2,]28.205880.0000097.49406[3,]-18.2341197.494060.00000Wesetthediagonalofthematrixsothattherowssumtozero(thecommandbelowwillworkifthisdiagonalisinitiallyfilledwithzeros):>diag(Q)<--rowSums(Q)>Q[,1][,2][,3][1,]-9.97176528.20588-18.23411[2,]28.205880-125.6999497.49406[3,]-18.23411597.49406-79.25994Theratematrixisnowreadyandwecancomputetheprobabilitiesforagiventime.Thelattermustberelevantwithrespecttotheestimatedparameters(i.e.,onthesamescaleastheoriginalbranchlengths);herewetaket=0.05:>library(rmutil)>P<-mexp(0.05*Q)>rownames(P)<-c("temp","temptrop","trop")>colnames(P)<-c("temp","temptrop","trop")>round(P,3)temptemptroptroptemp0.7920.1870.022temptrop0.1870.3800.433trop0.0220.4330.545 1766AnalysisofMacroevolutionwithPhylogeniesTheseprobabilitiessuggestthattemperate-tropicalisthemost“unstable”state,andthatmosttransitionsoccurbetweenthisstateandthetropicalone.Temperatespeciesseemtoevolveonlyfromtemperate-tropicalones.Wenowplotthelikelihoodsoftheancestralcharactersonthetreeto-getherwiththevaluesobservedforthespecies.Wefirstcreateavectorofmodecharactertostorethecolorsusedforthesymbolsonthetips:blackfortemperate,whitefortropical,andgreyfortemperate-tropical.co<-rep("grey",24)co[DF$geo.range=="temp"]<-"black"co[DF$geo.range=="trop"]<-"white"Weplotthetreeasacladogramtobetterdisplaytheinformation;theoptionlabel.offsetisusedtoleavesomespaceforthesymbols.Thelatteraredrawnwithtiplabels:thesymbolsarecoloredwiththevectorcopreparedabove,andadj=1avoidsthesymbolsoverlappingwiththetipsofthetree.Finally,thelikelihoodsoftheancestralcharactersareaddedwithnodelabelsusingtheoptionthermo(Fig.6.8):plot(tr,"c",FALSE,no.margin=TRUE,label.offset=1)tiplabels(pch=22,bg=co,cex=2,adj=1)nodelabels(thermo=syl.sym$lik.anc,bg=c("black","grey","white"),cex=0.8)FromthisanalysiswecaninferthattheancestorofthegenusSylviawas,probably,atropicalbird.BecausealltropicalSylviaarealsoresident,thisgenusprobablyevolvedfromatropicalresidentspecies.6.5.2PhylogenyoftheFelidaeWecontinuetheanalysisoftheFelidaephylogenybyfirstreadingbackthetreeinR:tr<-read.tree("felid.chrono.tre")Weareinterestedhereinthediversificationparametersofthisgroup.WefirstestimatetheglobalspeciationrateofthisphylogenybyfittingaYulemodel:>yule(tr)$lambda[1]0.2318725$se[1]0.04036382$loglik[1]7.349097 6.5CaseStudies177SylviaatricapillaSylviaborinSylviaabyssinicaSylviamelanothoraxSylviarueppelliSylviamelanocephalaSylviamystaceaSylviacantillansSylviadeserticolaSylviaundataSylviabalearicaSylviaconspicillataSylviacommunisSylvianisoriaSylviananaSylvialayardiSylviaboehmiSylviaburyiSylvialugensSylviahortensisSylvialeucomelaenaSylviacrassirostrisSylviacurrucaSylviasubcaeruleumFig.6.8.Ancestralestimatesofgeographicalrangefor24speciesofSylvia.Thethermometersonthenodesshowtherelativelikelihoodsofthethreestates:tem-perate(black),temperate-tropical(grey),tropical(white).Thestateoftherecentspeciesareshownonthetipsofthetreeattr(,"class")[1]"yule"Theestimatedspeciationprobabilityisquitehigh(λˆ=0.23±0.08).Wenowtrytofitthesimplebirth–deathmodel:>birthdeath(tr)EstimationofSpeciationandExtinctionRateswithBirth-DeathModelsPhylogenetictree:trNumberoftips:35Deviance:-16.81068Log-likelihood:8.405339Parameterestimates:d/b=0.6280969StdErr=0.2103922b-d=0.1340512StdErr=0.05509144(b:speciationrate,d:extinctionrate)Profilelikelihood95%confidenceintervals:d/b:[0.3227608,0.7970532]b-d:[0.084925,0.2029478] 1786AnalysisofMacroevolutionwithPhylogeniesThisisaninterestingresultbecauseinmostapplicationsofthebirth–deathmodelwithoutfossilstheestimatedextinctionprobabilityisusuallyzero,evenwhentherearespeciations[114].Theestimatedparametersareˆa=0.63andrˆ=0.13.Byback-substitutionusingλ=r/(1−a)andµ=λa,weobtainλˆ=0.36andˆµ=0.23.WecancomparetheYulemodelwiththebirth–deathmodelwithalikelihoodratiotestbecausethelatterhasoneadditionalparameter(µ):>1-pchisq(2*(8.405339-7.349097),1)[1]0.146102Thisisnotsignificantatthe0.05levelleadingustoacceptthenullhypothesisthatµ=0,butweneedtobeverycautiousaboutthisresultbecausetheestimationofextinctionratesisparticularlydifficultwithphylogeniesofrecentspecies[114].Wenowexplorethepossibleimpactofbodymassonspeciationrate.Wefirstloadthepreviouslysavedworkspacewiththedataonbodymass:load("felid.RData")Wecheckthateachspeciesinourtreehasdataonbodymass:>IN<-tr$tip.label%in%names(felid.body.mass)>tr$tip.label[!IN][1]"Prionailurus_rubiginosa""Felis_catus"[3]"Felis_libyca"Thisisnotthecaseasthreespeciesappeartohavenodataonbodymass.Anexaminationofthelatterdatashowsthatamismatchisduetoadifferentterminationofthespeciesnameoftherusty-spottedcat:>names(felid.body.mass)[36][1]"Prionailurus_rubiginosus"Thuswesimplychangethenameofthisspecies,andgiveabodymassof3500gtobothspeciesofcats:names(felid.body.mass)[36]<-"Prionailurus_rubiginosa"x<-rep(3500,2)names(x)<-c("Felis_catus","Felis_libyca")felid.body.mass<-c(felid.body.mass,x)Asafinalcheckbeforeproceeding,weverifythatallspeciesinthetreehaveabodymassinourdata:>all(tr$tip.label%in%names(felid.body.mass))[1]TRUE 6.5CaseStudies179WecannowassesstheeffectofbodymassonspeciationrateofFelidae.Wemustfirstestimatetheancestralvaluesofthisvariableusingace.Thefunctionyule.covissensitivetothedistributionofthepredictors:iftheyaretooskewedthefittingprocedureislikelytofail[115].Consequently,welog-transformbodymassandcenterthevariable:>range(felid.body.mass)[1]1300433200>X<-scale(log(felid.body.mass[tr$tip.label]),scale=FALSE)>range(X)[1]-1.9303362.898372Theoptionscale=FALSEpreventsdatascaling(onlycenteringisdone).Wealsohavesortedthedatainthesameorderasinthetree(whichisrequiredbyyule.cov).Wethenestimatetheancestralbodymasswithaceusingthedefaultmaximumlikelihoodmethod:X.node<-ace(X,tr)$aceThesevaluesmustbesortedaccordingthenodenumbersofthetree,whichisdonebyace.Wecannowfeedthedatatoyule.cov:>yule.cov(tr,˜c(X,X.node))----YuleModelwithCovariates----Phylogenetictree:trNumberoftips:35Numberofnodes:34Deviance:-15.25978Log-likelihood:7.629888Parameterestimates:EstimateStdErr(Intercept)-1.18706420.1612194c(X,X.node)-0.16156850.1539165Theincreaseinlog-likelihoodisverysmallcomparedtotheYulemodelsoitisnotnecessarytocomputetheP-value.Inspiteofthis,wefindaslightnegativeeffectofbodymassonspeciationratemeaningthatthesmallerspeciestendtospeciatemorerapidly.Aneasierwaytointerpretthisresultistousetheinverselogit-transformation(6.16),andplotthecalculatedvaluesofλwithrespecttothepredictor.Inthepresentcase,thepredictorvariesbetween−1.93and2.90,sowecreateasequencebetween−2and3(withareasonableincrementtosmooththeplot)tocovertheobservedvariation:>x<-seq(-2,3,0.05) 1806AnalysisofMacroevolutionwithPhylogeniesWecomputethecorrespondingpredictedvalueofλ:lambda<-1/(1+exp(-(-0.1615685*x+-1.1870642)))Wecouldsimplymaketheplotwithplot(x,lambda),butwecanmakeitmoreinformativebytransformingthescaleofthex-axissothatitissimilartothescaleoftheoriginalbodymassdata:thisimpliesaddingthemeanofthelog-transformedbodymass(theinverseofcentering),andthentakingtheexponential(theinverseofthelogarithmictransformation).Wedotheplotwithtype="l"todrawacurve,andweuserugtoplotonthex-axistheobservedvaluesofbodymass(Fig.6.9):λPredicted0.160.200.240.28050000100000150000Bodymass(g)Fig.6.9.Predictedvariationinspeciationrate(λ)withrespecttobodymassfortheFelidaeox<-exp(x+mean(log(felid.body.mass[tr$tip.label])))plot(ox,lambda,type="l",xlab="Bodymass(g)",ylab=expression("Predicted"*lambda))rug(felid.body.mass[tr$tip.label])Thefunctionexpressionallowsustowritespecialcharactersonaplot.Weshouldkeepinmindthatthedepictedrelationshipisnotstatisticallysignificant(see[115]foranexampleofsignificanteffectswithprimates).6.6Exercises1.Simulatefor99time-stepstwoindependentBrownianmotionmodelswiththesameinitialvalues.Thesevariablesshouldbetakenastwospecies 6.6Exercises181thathavedivergedaftert=1,andtheyshouldbestoredinatwo-columnmatrix.(a)Simulatethedivergenceofeachspeciesintwodaughter-speciesatt=100underthesamemodelfor100time-steps:theresultsshouldbestoredinafour-columnmatrix.Plotthewholeevolutionforthe200time-stepsonasinglegraph.(b)Repeat(a)butusinganOrnstein–Uhlenbeckmodelwithα=0.2,θ1=−1forthefirstpairofspecies,andθ2=1forthesecondone.(c)Repeat(b)withθ1=−20andθ2=20.Comparetheresults.2.CalculatetheexpectedvaluesoftheBrownianmotionandtheOrnstein–Uhlenbeckmodelsafter100time-steps.Comparewiththeobservedvaluesfromthesimulationsabove.3.ImplementDesdevisesetal.’s[24]methodinR(seep.144).4.ConsiderthephylogenyestimatedfortheFelidae(Section5.5.2).Com-putethephylogeneticallyindependentcontrastsforbodymassusingthefollowingbranchlengths:•ThemaximumlikelihoodestimatesfromPHYML(Fig.5.5);•FromthechronogramestimatedbyNPRS(Fig.5.6);•Settingthenodeheightssothattheyareequaltothenumberofde-scendants(seecompute.brlen);•Allequaltoone.Comparetheresultsandcommentontheassumptionsunderlyingtheuseofeachsetofbranchlengths.5.Considertheneighbor-joiningtreeestimatedforthegenusSylviaandtheassociatedbootstrapvalues.(a)Computethephylogeneticallyindependentcontrastsforthecontin-uousvariable(migratorydistance,mig.dist)intheecologicaldataset.(b)Wewanttogivemoreimportanceintheanalysistothecontrastsassociatedwiththenodesthatarewellsupportedbythebootstrapanalysis.Proposeasolution.(c)Comparethetwosetsofcontrasts.6.AnalyzethediversificationpatternfromthephylogenyestimatedinEx-ercise5ofChapter5. 7DevelopingandImplementingPhylogeneticMethodsinRWehaveseenseveraltimesinthisbookthatitisnotnecessarytoknowRindepthtouseitfordataanalysis,eventotacklecomplexanalyses.Ontheotherhand,weneedtoknowmoreofthelanguageandR’sfeaturestodevelopandimplementmethodswithit.ThematerialsinthischapterarenotaformalintroductiontoR,buthigh-lightsomeusefulpointsinthepresentcontext.TheprimaryreferencesarethemanualsdistributedwithR(locatedinthedirectoryRHOME/doc/manual/)andavailableonCRAN.1ThischapteressentiallyusesmaterialsfromWritingRExtensions[129]andtheRLanguageDefinition[128].7.1FeaturesofRRisalanguagethatisqualifiedasadialectofS,alanguageforstatistics[8].Thesyntaxofbothlanguagesisessentiallyidentical,buttheirimplementa-tionsdiffer.ThisimpliesthatprogramswritteninSwillnotnecessarilyrununderR,butcompatibilityisverylarge.ForabriefcomparisonofRandS,onecanseetheR-FAQavailablebothonCRAN,2anddistributedwithR(RHOME/FAQ).Risaninterpretedlanguage:allcommandsarereadbyaparser,theninterpreted,and,ifsyntacticallycorrect,executed.TherearedifferentwaystoentercommandsinR:theycanbetypeddirectlyatR’sprompt(inaconsoleoraterminal),orreadfromafilewiththefunctionsource.7.1.1Object-OrientationRisanobject-orientedlanguage.Object-orientationisoftenseenasacomplexmechanismincomputerprogramming(e.g.,C++isoftencitedasbeingmore1http://cran.r-project.org/manuals.html.2http://cran.r-project.org/faqs.html. 1847DevelopingandImplementingPhylogeneticMethodsinRcomplexthanC).InR,however,thisfeatureisnotascomplexasinJavaorinC++,andconsiderablysimplifiesthings.Wehaveseentheuseofgenericfunctionsseveraltimesinthepreviouschapters.Letusnowseesomedetails.Agenericfunctionisnamedafteritsmainuse:print,summary,plot,andsoon.Allthesefunctionshavesimilarcontent,forinstance:3>printfunction(x,...)UseMethod("print")Consideranobjectxofclass"cls",thenprint(x)isequivalenttoprint.cls(x).Thefunctionprint.cls(aswellasanyfunctionprint.*)iscalledamethod.Ifthemethodofaparticularclassdoesnotexist,thenthegenericusesthedefaultmethod(forinstance,ifprint.clsdoesnotexist,print(x)usesprint.default(x)).Aniceexampleoftheuseofgenerics/methodsiswhenplottinganob-ject.Supposexisanumericvector(say,1,2,3,...),thenthecommandplot(x)willdoasimpleplotofthevaluesofx.Butifxisaphylogenetictree(e.g.,anobjectofclass"phylo"),wedonotwantthis!Becausethefunc-tionplot.phyloisdefinedinthepackageape,plot(x)willcorrectlyplotthetree(Chapter4).Amethodiswritteninexactlythesamewayasanotherfunction:onlyitsnamemustfollowtherulegeneric.classwheregenericisthenameofthegeneric,andclassisthenameoftheclass.Amethodmusthave,atleast,alltheargumentsofthegeneric,withthesamenamesandinthesameorder.Ifthegenericfunctionhasa“dot-dot-dot”argument(whichisoftenthecase),thisisalmostalwaysthelastone.Forinstance,considerthefunctionall.equalthatcomparestwoobjectstakingsomeapproximationsintoaccount.Thegenericis:>all.equalfunction(target,current,...)UseMethod("all.equal")Themethodthatdoesthiscomparisonfortwoobjectsofclass"phylo"is,ofcourse,calledall.equal.phylo,anditsfirstfewlinesare:>all.equal.phylofunction(target,current,...){3Itmaybeusefultorecallthattypingthenameofanobjectresultsinprintingitscontent;thustypingthenameofafunction,withouttheparentheses,printsitscontent. 7.1FeaturesofR185###commandstocomparetwoobjectsofclass"phylo"...Amethodisusedpracticallyasitsgenericis,butitispossibletoforcetheuseofaparticularmethod.Forinstance,becauseanobjectofclass"phylo"isalist,itispossibletocomparetwooftheseobjectswithall.equal.list(tr1,tr2)(whichisdoneinternallybyall.equal.phylo).7.1.2VariableDefinitionandScopeInR,itisnotnecessarytodeclarethevariablesandobjectsusedwithinafunction(incontrasttolanguagessuchasCorFortran).Forinstance,anexpressionlikex<-1createsthevectorxandsetsitsattributesaccordingly;ifxalreadyexiststhenitiserasedbeforehand.Ontheotherhand,foranexpressionlikey<-x,xmustalreadyexist.Whenwritingacomputerprogram(whateverthelanguage),itisoftennecessarytodecidewhetheravariableislocal(usedonlywithinafunction)orglobal(canbeusedbyseveralfunctionsintheprogram).InR,becausethedeclarationofvariablesisimplicit,aruleisneeded.Thisruleiscalledlexicalscoping.Tounderstandthismechanism,letusconsidertheverysimplefunction:>foo<-function()print(x)>x<-1>foo()[1]1Becausenovariablenamedxhasbeencreatedwithinfoo,Rwillseekintheenclosingenvironmentifthereisanobjectcalledx,andwillprintitsvalue(otherwise,amessageerrorisdisplayed,andtheexecutionisstopped).Ifanobjectxiscreatedwithinourfunction,thevalueofxintheglobalenvironmentisnotchanged.>x<-1>foo2<-function(){+x<-2+print(x)+}>foo2()[1]2>print(x)[1]1Nowprint(x)usestheobjectxthatisdefinedwithinitsenvironment,thatis,theenvironmentoffoo2.Thewordenclosingaboveisimportant.Inourtwoexamplefunctions,therearetwoenvironments:theglobaloneandtheoneofthefunctionfoo 1867DevelopingandImplementingPhylogeneticMethodsinRHarddiskActivememory(RAM)Internet"Data"objects(vectors,lists,...)EPSFunctionsand../library/..operatorscommandsFig.7.1.AschematicviewofhowRworksorfoo2.Iftherearethreeormorenestedenvironments,thesearchfortheobjectsismadeprogressivelyfromagivenenvironmenttotheenclosingone,andsoon,uptotheglobalone.7.1.3HowRWorksAlltheactionsofRaredoneonobjectsstoredintheactivememoryofthecomputer:notemporaryfilesareused(Fig.7.1).Filesonthediskarereadandwrittenforinputandoutputofdataandresults(graphics,etc.)Theuserexecutesthefunctionsviasomecommands.Theresultsaredisplayeddirectlyonthescreen,storedinanobject,orwrittenonthedisk(particularlyforgraphics).Becausetheresultsarethemselvesobjects,theycanbeconsideredasdataandanalyzedassuch.DatafilescanbereadonthelocaldiskoronaremoteserverthroughtheInternet.ThefunctionsavailabletotheuserarestoredinadirectorycalledRHOME/library(RHOMEisthedirectorywhereRisinstalled).Thisdi-rectorycontainspackagesoffunctions,whicharethemselvesstructuredin 7.2WritingFunctionsinR187directories.ThepackagenamedbaseisinawaythecoreofRandcontainsthebasicfunctionsofthelanguageforreading,manipulating,andwritingdata.7.2WritingFunctionsinRWritingfunctionscanbesomehowextrapolatedfromwhathasbeensaidintheprevioussections.Quitelogically,afunctionisdefinedwiththefunctionfunctionwhichtakesasargumentsthevariable(s)thatwillbeusedlocallywithinthefunctionwhenitiscalled.Rfunctionsareobjects,andtheresultofthefunctionfunctioncanbeassignedinthesamewayasotherobjects(theexamplesbelowarepurelydidactical):>f<-function(x)print(mode(x))>ffunction(x)print(mode(x))>f(1)[1]"numeric">f(TRUE)[1]"logical">f("a")[1]"character"Inthisexample,theobjectxislocaltothevariableandifanobjectcalledxexistsintheworkspace,itwillnotbeused:>x<-FALSE>print(mode(x))[1]"logical">f(x=1)[1]"numeric"Notethatweusedthetaggedargumentinthelastcalltoemphasizethispoint.Defaultarguments(oftencalledoptions)aresetbypreassigningtheminthefunctiondefinition:>fb<-function(x,prefix="Mode:")+print(paste(prefix,mode(x)))>fb(1)[1]"Mode:numeric">fb(1,"")[1]"numeric">fb(1,"Themodeis")[1]"Themodeisnumeric" 1887DevelopingandImplementingPhylogeneticMethodsinRQuiteoften,defaultargumentsarelogicalstocontrolwhatiscomputedbythefunction.Forinstance,ifwewantafunctionthatcalculatesthemeanofasamplewiththepossibilityofremovingallnegativevalues,wecancontrolthiswithalogicalargumentwhosedefaultvaluewillbeFALSE:>foo<-function(x,rm.negative=FALSE)+if(rm.negative)print(mean(x[x>=0]))+elseprint(mean(x))>y<-rnorm(100)>foo(y)[1]0.04609175>foo(y,TRUE)[1]0.751289Tobeexecuted,afunctionmustbeloadedinmemory,andthiscanbedoneinseveralways.Thecommandsofafunctioncanbetypeddirectlyonthekeyboard,aswithanyothercommand,orcopiedandpastedfromaneditor.Ifthefunctionhasbeenwritteninatextfile,itcanbeloadedwithsourcelikeanotherprogram;asinglefilecancontainseveralfunctions.Similarly,functionscanbesavedinan‘.RData’file,aswithanyRobjects,andloadedinmemorywithload.Finally,itispossibletocreateapackage:thisisdiscussedinSection7.4.Toloadsomefunctions,packages,ordatainmemorywhenRisstarted,thebestoptionistoconfigurethefile‘.Rprofile’.Thisfile,ifitexists,isreadbyRatstart-up:itmustbelocatedintheHOMEdirectoryoftheuser.Thisfileisuserdependent,sothatifacomputerissharedbyseveralusers,theymayhavedifferent‘.Rprofile’files.ThepathtotheHOMEdirectorycanbeprintedinRwiththecommand:>Sys.getenv("HOME")HOME"/home/paradis"ThisdirectoryshouldnotbeconfusedwiththeRHOMEdirectorywhichistheplacewhereRisinstalled,andisuniquetoacomputer.HereisanexampleonaLinuxsystem:>Sys.getenv("R_HOME")R_HOME"/usr/lib/R"Thecontentsof‘.Rprofile’arenormalRcommands,andcommentscanbeincludedaswell.ThisisnormallytheplacewhereyouwillcustomizeRbymodifyingtheoptions.Thelistandmeaningsoftheseoptionsisexplainedin?options.Hereisanexample:options(width=60)#narroweroutputonthescreenoptions(editor="emacs")#thedefaultonLinuxisvi... 7.3InterfacingRwithOtherLanguages189options(show.signif.stars=FALSE)#avoidtheMilkyWaylibrary(ade4)library(ape)library(seqinr)load("/home/paradis/data/always_load_this.RData")source("/home/paradis/data/always_source_this.R")7.3InterfacingRwithOtherLanguagesPhylogeneticmethodsareoftencomputationallyintensive,andthusphyloge-neticprogramsaremostlywritteninlow-levellanguages(mainlyCorC++).Theseprogramsneedtobecompiled(incontrasttoprogramsininterpretedlanguagessuchasR)tobeused.However,andthisiscompletelytransparenttotheuser,Rusescompiledprogramstoo:mostcomputationaltasksinRaremadebycompiledCorFortranprograms.Rhasseveralmechanismstointerfacecompiledprogramswithitsinter-preter(theCLIwehaveseenthroughthisbook).AtleastthreebenefitscanbefoundinusingtheseinterfaceswhenimplementingaphylogeneticmethodinR.•TheperformanceofanRprogramcanbegreatlyimprovedwhenthecom-putationallydemandingpartisdonewithcompiledcodes(seeanexamplebelow);•TheRapplicationprogrammerinterface(API)canbeusedmakingavail-ablemanyCfunctionsusefulincomputationalstatistics(mathematical,matrixcalculus,probabilitydistribution,optimizationfunctions,andsoon);•ExistingprogramsinCorC++canbeportedtoR.Thecostisthatonehastolearntheseinterfaces,butthisisrelativelyeasy,andoutlinedinthissection.7.3.1SimpleInterfacesTheRfunction.CgivesthewaytocallaCfunctionfromRusingasimpleinterfacethatmatchestheargumentsinC.Thelattermustbepointers.Anexamplecouldbe:voidfcn(int*arg1,double*arg2,char**arg3){...} 1907DevelopingandImplementingPhylogeneticMethodsinRThecodeinthisfunctioncanbeanyCcode,andcancallotherfunctions.fcncanbecalledfromRwith:.C("fcn",as.integer(i),as.double(x),as.character(b),PACKAGE="pkg")ItisnecessarythatthedatatypestobecheckedbeforepassingthevariablestotheCcode:thisexplainsthedistinctionbetweenintegersanddoubleshere.Rdoesnotdistinguishthesetwodatatypes,sothereisasinglenumericmode(Section2.2.1).Ontheotherhand,Chasdifferentdatatypesforintegersandreals,hencetheconversionwhenpassingdatafromRtoC."pkg"isthenameoftheRpackagewherefcncanbefound.TobeabletousefcnfromR,thisCfunctionmustbecompiledandloadedintoR.Thecompilationisdonesoastoproducealibraryfile(‘*.dll’underWindows,or‘*.so’fortheotheroperatingsystems).Thelibraryisloadedwiththefunctionlibrary.dynam.Usually,itiseasiertobuildasmallpackagewheretheneededcodesareincluded(Section7.4).Inpractice,.CisnotcalleddirectlybytheuserbutitisincludedinanRfunction,forexample,fcn<-function(i,x){.C("fcn",as.integer(i),as.double(x),as.character(b),PACKAGE="pkg")}sothattheuserdoesnotseewhetherthefunctioncallsacompiledcode:fcn(i,x)ProgramswritteninC++arecalledinawaysimilartoCfromR,butintheC++codeawrappermustbewritten://X_main.cc:#include...extern"C"{voidX_main(){...}}//extern"C"SuchaprogrammustbecompiledwithaC++compiler.7.3.2ComplexInterfacesWehaveseenthatwith.C,onlysimpledatatypescanbepassedtotheCcode.ThismaybeproblematicifonewantstomanipulateRobjectsthathave 7.3InterfacingRwithOtherLanguages191acomplexstructure,suchaslists,andforwhichthenumberofelementsisnotknownapriori.Inthissituation,thefunction.Callcanbeused.Itsuse,fromtheRside,issimplerthan.C:.Call("fcn",a,b)Thereisnodatatypecheckinghere:thisisdoneintheCprogram.Thestructureofthelatterismorecomplex,andmakesuseofthedatatypeSEXP(Sexpression):SEXPfcn(SEXPa,SEXPb){...}AllthedetailsonhowtohandleSEXPdatainCareexplainedin[129].Thereisanevenmorecomplexmechanismwiththefunction.Externalwhichcanbeusedwithanaprioriunknownnumberofarguments.ItisusedinasimilarwayinR:.External("fcn",a,b)ButinCthereisonlyoneargument:SEXPfcn(SEXPargs){...}Theelementspassedwithargsmaybeextractedsequentiallywithspecialfunctions:...first=CADR(args);second=CADDR(args);third=CADDDR(args);fourth=CAD4R(args);...Thesourcesofapeandade4providesomeexamplesoftheuseof.Cand.Callwithphylogeneticdata,andthoseofseqinroftheuseof.Callwithsequencedata. 1927DevelopingandImplementingPhylogeneticMethodsinR7.4WritingRPackagesAllthedetailsofwritinganRpackageareexplainedinaclearwayin[129].WeshowhereonlyhowwecanmakeaminimalpackagethatcouldbeusedtoportsomeCcodestoR.AnicewaytowriteanRpackageistocompileandinstallRandCcodessothatitcanbetested.Ifthisissucessfulandthedeveloperwantstopublishthepackage,thenthenextstageistowritethedocumentation.7.4.1AMinimalistPackageApackagemaycontainonlyRcodeswhichisstraightforwardtomakeandinstall.Weconsidercaseswheresomecodesneedtobecompiled.SupposewehavewrittentheRandCfunctions,andtheyarecollectedinfilescalledac-cordingly(‘*.R’and‘*.c’).Thenweneedtocreatetwootherfiles:‘DESCRIP-TION’and‘zzz.R’.Thefilesmustbearrangedinthefollowingdirectories./pkg/DESCRIPTION/pkg/R/*.R/pkg/src/*.CThefile‘DESCRIPTION’containssomegeneralinformationonthepack-age.Itmustcontainatleastthefollowingfields.Package:pkgVersion:0.1Date:2005-12-25Title:PKGAuthor:JohnMarillionMaintainer:JohnMarillionDescription:Thisisaminimalistinstallforpkg.License:GPLversion2ornewerThisfilemusteventuallybemoredetailediftherearedependencieswithotherpackagesorlibraries.Thefile‘zzz.R’isnecessaryiftherearecompiledcodes.Itscontentis:.First.lib<-function(lib,pkg){library.dynam("pkg",pkg,lib)}where"pkg"shouldbereplacedbythequotednameofthepackage,butpkgshouldbeleftunchanged;forinstance,forapethisislibrary.dynam("ape",pkg,lib).Thefunction.First.libisexecutedwhenthepackageisloadedwithlibrary(pkg).Oncethefilesanddirectorieshavebeenprepared,pkgcanbeinstalledwiththecommand(fromashell): 7.5PerformanceIssuesandStrategies193RCMDINSTALLpkgThepackagemaythenbeusedinR.7.4.2TheDocumentationSystemEveryfunctionwritteninRwhendistributedinapackagemustbedocu-mented.Thisisnotnecessaryfortheinstallation.ThereisasingledocumentationformatcalledRdthatisprocessedduringtheinstallationtocreatehelppagesinsimpletext(readwith?),HTML,andPDF.Oncethehelppageshavebeenpreparedandputinadirectory/pkg/man,itispossibletocheckthepackagewith:RCMDcheckpkg7.5PerformanceIssuesandStrategiesFromallwehaveseeninthisbook,itappearsthatweoftenhaveachoiceamongseveralpossibilitiesforthesametask.Thisiscommonincomputerprogrammingwheredifferentalgorithmscanbeusedtodothesameoperation.Here,wealsohaveachoiceamongdifferentcomputerlanguagesthatcanbeinterfacedamongeachother.Roughly,therearethreestrategieswhenimplementingamethodinR:useonlyRcodes,interfaceCand/orC++codeswithRusingthesimpleinterfacefunction.C,anddoingthesamebutwiththecomplexinterfacefunctions.Calland/or.External.ThesethreestrategiesaredetailedinTable7.1withtheirgainsandcosts.Althoughmorecostsarelistedforthe“R+C”strategies,thisactuallyrevealsacontrastsimplicityversusperformance.InterfacingCprogramswithRwillalmostalwaysresultinasignificantincreaseinperformanceatthecostofmorecomplexprogramming.Togiveanideaofthegaininperformancethatcouldresultfromtrans-ferringacomputationdoneinRtoC,wecanconsideraconcreteexamplefromape.Whenplottingatree,thefunctionplot.phylocomputesthecoor-dinatesofthenodesandtipsinthegraph,andthendrawstheappropriatelines.Originally,allcomputationsweredoneonlyinRcode.Oneofthesefunctionsreturnedthedistancefromtheroottoeachnodeandtipusingedgelengths:node.depth.edgelength<-function(x,el)###Input:thematrix‘edge’ofanobjectofclass###"phylo",andthecorrespondingvector‘edge.length’.{tmp<-as.numeric(x) 1947DevelopingandImplementingPhylogeneticMethodsinRTable7.1.ComparativegainsandcostsofdifferentstrategieswhenimplementingacomputationalmethodinRGainsCostsPureREasilyprogrammed.Performancecanbepoorifvec-Programscanbetesteddirectly.torizationcannotbeachieved.Programscanbeshareddirectlyamongoperatingsystems.Performancecanbeverygood.Bugsareeasilyfixed..CCandC++programscanbeProgramsneedtobecompiledtoportedtoR.betested.CfunctionsalreadyprogrammedCompilationissystemdepen-inRcanbeused.dent.PerformanceisgenerallygreatlyBugsaremoredifficulttofindimproved.thaninR.OnlysimpleRdatatypes(vec-tors)canbepassedtoC..CallSameas.C.Samethan.Cbutthelastpoint.ComplexRobjects(e.g.,lists)NeedtolearntheRmacrostocanbepassedtoC.manipulateRobjectsinC..ExternalSameas.Call.Sameas.Call.ThenumberofobjectspassedtoCmayvary.nb.tip<-max(tmp)nb.node<--min(tmp)xx<-as.numeric(rep(NA,nb.tip+nb.node))names(xx)<-as.character(c(-(1:nb.node),1:nb.tip))xx["-1"]<-0for(iin2:length(xx)){nod<-names(xx[i])ind<-which(x[,2]==nod)base<-x[ind,1]xx[i]<-xx[base]+el[ind]}xx}Fromversion1.4ofape,thisfunctionhasbeenreplacedbyasmallCprogramcalledfromR:voidnode_depth_edgelength(int*ntip,int*nnode,int*edge1,int*edge2,int*nms,double*edge_length,double*xx){inti,j,k; 7.5PerformanceIssuesandStrategies195for(i=1;i<*ntip+*nnode;i++){j=0;while(edge2[j]!=nms[i])j++;if(edge1[j]<0)k=-edge1[j]-1;elsek=nnode+edge1[j]-1;xx[i]=xx[k]+edge_length[j];}}whichiscalledfromRwith:.C("node_depth_edgelength",as.integer(nb.tip),as.integer(nb.node),as.integer(x$edge[,1]),as.integer(x$edge[,2]),as.integer(nms),as.double(x$edge.length),as.double(numeric(nb.tip+nb.node)),DUP=FALSE,PACKAGE="ape")[[7]]AlthoughtheCprogramisslightlyshorterthanitsRversion,thewayargu-mentsarepassedismorecomplexandneedsmorecaution.Itispossibletocomparetheperformanceofbothapproaches(Table7.2).Table7.2.Comparativespeed(inseconds)oftwoprogramsperformingthesametaskonphylogenetictreeswithntips(timesmeasuredwiththefunctionsystem.time)nPureRR+C1000.04<0.0110002.19<0.0120006.620.01500038.630.0410,000185.130.15Twocommentsarisefromthiscomparison.First,aprogramwritteninpureRcanbeveryfastwithsmalldatasets:0.04sisactuallynegligible.Inpractice,atreewithmorethan500tipsisnotreadablewhenplotteddirectlyonthescreen.Thesecondcommentisthatwithlargedatasetsthegaininspeediscritical,andthisshouldbeconsideredwhendevelopingcomputation-allyintensivemethods.AcriticalissueinRprogrammingisvectorization.Thismeansthatre-peatedcallstocompiledcodesbytheinterpreterareavoided.Forinstance,whengeneratingrandomvariables,thenumberofindependentreplicates,say100,ispassedasargument,thusthecompiledcodeiscalledonlyoncewhichismoreefficientthancallingit100times.Tofixideas,wecanuseatrivialexampleconsistingofthesumofmanynumbers.Saywegenerate1,000,000 1967DevelopingandImplementingPhylogeneticMethodsinRnormalrandomvariableswithmeanzeroandvarianceunity,andwewanttocomputetheirsum.Ignoringthe(vectorized)functionsum,apossiblesolutioncouldbe:x<-rnorm(1e6)s<-0for(iin1:1e6)s<-s+x[i]Thetimeneededtoperformtheforloopthatdoesthesummationis2.5s.Ofcourse,abeginnerwithRquicklylearnsthatthereisthefunctionsumandwillneverdotheabove:sum(x)actuallytakes0.01s.Theuseofvectorizationmaybelessobvious.Considerwewanttosumonlythenegativevaluesofx;themostintuitiveapproachmaybetouseanifstatementsuchas:s<-0for(iin1:1e6)if(x[i]<0)s<-s+x[i]Thistakes3.5stobecompleted.Avectorizedversionispossiblewithlogicalindexing:sum(x[x<0])Thecomputationtimeisnow0.12s.TodothesametaskwithadedicatedcompiledCcode,wecanwritethefollowingfunction,#includevoidsum_neg(double*x,int*n,double*sum){inti;*sum=0;for(i=0;i<*n;i++){if(x[i]<0)*sum+=x[i];}}andcallit(aftercompilation)fromRwiththefunction:sumneg<-function(x){sumneg<-0ans<-.C("sum_neg",as.double(x),as.integer(length(x)),as.double(sumneg),package="apex")ans[[3]]} 7.5PerformanceIssuesandStrategies197Thetimeneededtocompletesumneg(x)is0.09s.Thegainwillobviouslybeevensmallerwithasmallerdataset.ThisshowsclearlythatwritingcompiledcodemaynotalwaysbeadvantageouswithR.Thecrucialpoint,intermsofperformance,isthuswhethervectorizationcanbeachievedinanRprogram.WehaveseenaboveanexamplewhereaCcodewasusedtomanipulateobjectsofclass"phylo".Thisisacasewherevectorizationcannotbedoneeasilybecauseweneedtomanipulatetheelementsinacomplexwaysothatweneedrepeatedloopsandifstatements.However,vectorizationcanbeachievedinsomecaseswithobjectsofclass"phylo".Thefunctionsbirthdeath,yule,oryule.covprovidesomeex-amples.Forinstance,thespeciationrateestimatorundertheYulemodelisλˆ=BT/XTwhereBTisthenumberofobservedbranchingeventsduringtimeT,andXTisthesumofallbranchlengthsduringthesametime[78].Thisestimatorcanbecomputedforatree,saytr,relativelyeasily:-min(as.numeric(tr$edge))/sum(tr$edge.length)Thisconsidersthatthenodesarenumberedwithnegativenumbers,thusthesmallestoneisthenumberofnodes.Thebranchlengthsarestoredinasinglenumericvector,thusthesecondtermiseasilycomputed.AstrategyoftenusedbyRdevelopersistofirstdeveloptheprograminpureR.Whenitisstableandsome“computationalbottlenecks”havebeeneventuallyidentified,sometaskscanbetransferredtoCprograms.Amixedstrategyistokeepthemostcomplexdatamanipulation(e.g.,involvinglists,names,etc.)inR,andusingcompiledcodestodocomputationsonvectors:thisisthestrategyusedinplot.phylo. References[1]AgapowP.-M.&PurvisA.2002.Powerofeighttreeshapestatisticstodetectnonrandomdiversification:Acomparisonbysimulationoftwomodelsofcladogenesis.SystematicBiology51:866–872.[2]AldousD.1996.Probabilitydistributionsoncladograms.In:Ran-domDiscreteStructures,AldousD.&PemantleR.,editors,pages1–18.IMA,.[3]AldousD.J.2001.Stochasticmodelsanddescriptivestatisticsforphylogenetictrees,fromYuletotoday.StatisticalScience16:23–34.[4]BaldaufS.L.2003.Phylogenyforthefaintofheart:Atutorial.TrendsinGenetics19:345–351.[5]BaldaufS.L.,BhattacharyaD.,CockrillJ.,HugenholtzP.,PawlowskiJ.&SimpsonA.G.B.2004.Thetreeoflife:Anoverview.In:Assemblingthetreeoflife,CracraftJ.&DonoghueM.J.,editors,pages43–75.OxfordUniversityPress,Oxford.[6]BarhenJ.,ProtopopescuV.&ReisterD.1997.TRUST:Adeterministicalgorithmforglobaloptimization.Science276:1094–1097.[7]Barndorff-NielsenO.E.&ShephardN.2001.Non-GaussianOrnstein–Uhlenbeck-basedmodelsandsomeoftheirusesinfinancialeconomics(withdiscussion).JournaloftheRoyalStatisticalSociety.SeriesB.Methodological63:167–241.[8]BeckerR.A.,ChambersJ.M.&WilksA.R.1988.TheNewSLanguage.Chapman&Hall,London.[9]BilleraL.J.,HolmesS.P.&VogtmannK.2001.Geometryofthespaceofphylogenetictrees.AdvancesinAppliedMathematics27:733–767.[10]B¨ohning-GaeseK.,SchudaM.D.&HelbigA.J.2003.Weakphyloge-neticeffectsonecologicalnichesofSylviawarblers.JournalofEvolu-tionaryBiology16:956–965.[11]BokmaF.2003.Testingforequalratesofcladogenesisindiversetaxa.Ecology57:2469–2474.[12]BrocchieriL.2001.Phylogeneticinferencesfrommolecularsequences:Reviewandcritique.TheoreticalPopulationBiology59:27–40. 200References[13]BucklandS.T.,BurnhamK.P.&AugustinN.H.1997.Modelselection:Anintegralpartofinference.Biometrics53:603–618.[14]BurnhamK.P.&AndersonD.R.2002.ModelSelectionandMulti-modelInference.APracticalInformation-TheoreticApproach(SecondEdition).Springer,NewYork.[15]BurnhamK.P.&WhiteG.C.2002.Evaluationofsomerandomeffectsmethodologyapplicabletobirdringingdata.JournalofAppliedStatistics29:245–264.[16]ButlerM.A.&KingA.A.2004.Phylogeneticcomparativeanalysis:Amodelingapproachforadaptiveevolution.AmericanNaturalist164:683–695.[17]ChennaR.,SugawaraH.,KoikeT.,LopezR.,GibsonT.J.,HigginsD.G.&ThompsonJ.D.2003.MultiplesequencealignmentwiththeClustalseriesofprograms.NucleicAcidsResearch31:3497–3500.[18]CheverudJ.M.,DowM.M.&LeuteneggerW.1985.Thequantitativeassessmentofphylogeneticconstraintsincomparativeanalyses:Sexualdimorphisminbodyweightamongprimates.Evolution39:1335–1351.[19]ChorB.&TullerT.2005.Maximumlikelihoodofevolutionarytrees:Hardnessandapproximation.Bioinformatics21:i97–i106.[20]CoxD.R.&OakesD.1984.AnalysisofSurvivalData.Monographsonstatisticsandappliedprobability.ChapmanandHall,London.[21]CrosbieS.F.&ManlyB.F.J.1985.Parsimoniousmodellingofcapture-mark-recapturestudies.Biometrics41:385–398.[22]DarwinC.1859.OntheOriginofSpeciesbyMeansofNaturalSelection.JohnMurray,London.[23]DempsterA.P.,LairdN.M.&RubinD.B.1977.MaximumlikelihoodfromincompletedataviatheEMalgorithm(withdiscussion).JournaloftheRoyalStatisticalSociety.SeriesB.Methodological39:1–38.[24]DesdevisesY.,LegendreP.,AzouziL.&MorandS.2003.Quantify-ingphylogeneticallystructuredenvironmentalvariation.Evolution57:2647–2652.[25]DiaconisP.W.&HolmesS.P.1998.Matchingsandphylogenetictrees.ProceedingsoftheNationalAcademyofSciencesUSA95:14600–14602.[26]Diniz-FilhoJ.A.F.,deSant’AnaC.E.R.&BiniL.M.1998.Aneigenvectormethodforestimatingphylogeneticinertia.Evolution52:1247–1262.[27]EdwardsA.W.F.1992.Likelihood(ExpandedEdition).JohnsHopkinsUniversityPress,Baltimore.[28]EdwardsA.W.F.1998.HistoryandPhilosophyofPhylogenyMethods.TalkattheECSummerSchoolMethodsforMolecularPhylogenies,NewtonInstitute,Cambridge,UK.[29]EfronB.1981.Nonparametricestimatesofstandarderror:thejacknife,thebootstrapandothermethods.Biometrika68:589–599.[30]EfronB.1998.R.A.Fisherinthe21stcentury(withdiscussion).StatisticalScience13:95–114. References201[31]EfronB.,HalloranE.&HolmesS.1996.Bootstrapconfidencelevelsforphylogenetictrees.ProceedingsoftheNationalAcademyofSciencesUSA93:13429–13434.[32]EfronB.&TibshiraniR.1991.Statisticalanalysisinthecomputerage.Science253:390–395.[33]EmersonB.,ParadisE.&ThbaudC.2001.RevealingthedemographichistoriesofspeciesusingDNAsequences.TrendsinEcology&Evolution16:707–716.[34]FelsensteinJ.1981.EvolutionarytreesfromDNAsequences:Amaxi-mumlikelihoodapproach.JournalofMolecularEvolution17:368–376.[35]FelsensteinJ.1985.Confidencelimitsonphylogenies:Anapproachusingthebootstrap.Evolution39:783–791.[36]FelsensteinJ.1985.Phylogeniesandthecomparativemethod.AmericanNaturalist125:1–15.[37]FelsensteinJ.1988.Phylogeniesandquantitativecharacters.AnnualReviewofEcologyandSystematics19:445–471.[38]FelsensteinJ.1993.Phylip(PhylogenyInferencePackage)Version3.5c.http://evolution.genetics.washington.edu/phylip/phylip.html.De-partmentofGenetics,UniversityofWashington,Seattle.[39]FelsensteinJ.2004.InferringPhylogenies.SinauerAssociates,Sunder-land,MA.[40]FelsensteinJ.&ChurchillG.A.1996.AHiddenMarkovmodelapproachtovariationamongsitesinrateofevolution.MolecularBiologyandEvolution13:93–104.[41]FryB.G.2005.Fromgenometo“venome”:molecularoriginandevo-lutionofthesnakevenomproteomeinferredfromphylogeneticanalysisoftoxinsequencesandrelatedbodyproteins.GenomeResearch15:403–420.[42]FutuymaD.J.1998.EvolutionaryBiology(ThirdEdition).SinauerAssociates,Sunderland,MA.[43]GaltierN.&GouyM.1995.InferringphylogeniesfromDNAsequencesofunequalbasecompositions.ProceedingsoftheNationalAcademyofSciencesUSA92:11317–11321.[44]GaltierN.&GouyM.1998.Inferringpatternandprocess:Maximum-likelihoodimplementationofanonhomogeneousmodelofDNAsequenceevolutionforphylogeneticanalysis.MolecularBiologyandEvolution15:871–879.[45]Garland,Jr.T.&AdolphS.C.1991.Physiologicaldifferentiationofvertebratepopulations.AnnualReviewofEcologyandSystematics22:193–228.[46]Garland,Jr.T.&CarterP.A.1994.Evolutionaryphysiology.AnnualReviewofPhysiology56:579–621.[47]Garland,Jr.T.,DickermanA.W.,JanisC.M.&JonesJ.A.1993.Phylogeneticanalysisofcovariancebycomputersimulation.SystematicBiology42:265–292. 202References[48]Garland,Jr.T.,HarveyP.H.&IvesA.R.1992.Proceduresfortheanalysisofcomparativedatausingphylogeneticallyindependentcon-trasts.SystematicBiology41:18–32.[49]GentlemanR.2004.Someperspectivesonstatisticalcomputing.Cana-dianJournalofSatistics32:209–226.[50]GianniniN.P.2003.Canonicalphylogeneticordination.SystematicBiology52:684–695.[51]GibsonA.,Gowri-ShankarV.,HiggsP.G.&RattrayM.2005.Acomprehensiveanalysisofmammalianmitochondrialgenomebasecom-positionandimprovedphylogeneticmethods.MolecularBiologyandEvolution22:251–264.[52]GittlemanJ.L.1986.Carnivorelifehistorypatterns:Allometric,phylo-geneticandecologicalassociations.AmericanNaturalist127:744–771.[53]GittlemanJ.L.&KotM.1990.Adaptation:Statisticsandanullmodelforestimatingphylogeneticeffects.SystematicZoology39:227–241.[54]GrafenA.1989.Thephylogeneticregression.PhilosophicalTransactionsoftheRoyalSocietyofLondon.SeriesB.BiologicalSciences326:119–157.[55]GrafenA.&RidleyM.1996.Statisticaltestsfordiscretecross-speciesdata.JournalofTheoreticalBiology183:255–267.[56]GrafenA.&RidleyM.1997.Anewmodelfordiscretecharacterevolution.JournalofTheoreticalBiology184:7–14.[57]GrafenA.&RidleyM.1997.Non-independenceinstatisticaltestsfordiscretecross-speciesdata.JournalofTheoreticalBiology188:507–514.[58]GuindonS.&GascuelO.2003.Asimple,fast,andaccuratealgorithmtoestimatelargephylogeniesbymaximumlikelihood.SystematicBiology52:696–704.[59]HackettS.J.1996.Molecularphylogeneticsandbiogeographyoftan-agersinthegenusRamphocelus(Aves).MolecularPhylogeneticsandEvolution5:368–382.[60]HallB.G.2004.PhylogeneticTreesMadeEasy:Ahow-toManual(SecondEdition).SinauerAssociates,Sunderland,MA.[61]HansenT.F.1997.Stabilizingselectionandthecomparativeanalysisofadaptation.Evolution51:1341–1351.[62]HansenT.F.&MartinsE.P.1996.Translatingbetweenmicroevolu-tionaryprocessandmacroevolutionarypatterns:Thecorrelationstruc-tureofinterspecificdata.Evolution50:1404–1417.[63]HarveyP.H.,MayR.M.&NeeS.1994.Phylogenieswithoutfossils.Evolution48:523–529.[64]HarveyP.H.&PagelM.D.1991.ThecomparativeMethodinEvolu-tionaryBiology.OxfordUniversityPress,Oxford.[65]HarveyP.H.&PurvisA.1991.Comparativemethodsforexplainingadaptations.Nature351:619–624. References203[66]HasegawaM.,KishinoH.&YanoT.-a.1985.Datingofthehuman-apesplittingbyamolecularclockofmitochondrialDNA.JournalofMolecularEvolution22:160–174.[67]HebertP.D.N.,PentonE.H.,BurnsJ.M.,JanzenD.H.&HallwachsW.2004.Tenspeciesinone:DNAbarcodingrevealscrypticspeciesintheneotropicalskipperbutterflyAstraptesfulgerator.ProceedingsoftheNationalAcademyofSciencesUSA101:14812–14817.[68]HolderM.&LewisP.O.2003.Phylogenyestimation:TraditionalandBayesianapproaches.NatureReviewsGenetics4:275–284.[69]HolmesS.2003.Statisticsforphylogenetictrees.TheoreticalPopulationBiology63:17–32.[70]HousworthE.A.,MartinsE.P.&LynchM.2004.Thephylogeneticmixedmodel.AmericanNaturalist163:84–96.[71]HuelsenbeckJ.P.&RannalaB.1997.Phylogeneticmethodscomeofage:testinghypothesesinanevolutionarycontext.Science276:227–232.[72]HuelsenbeckJ.P.,RannalaB.&MaslyJ.P.2000.Accomodatingphylogeneticuncertaintyinevolutionarystudies.Science288:2349–2350.[73]HunterJ.P.1998.Keyinnovationsandtheecologyofmacroevolution.TrendsinEcology&Evolution13:31–36.[74]IhakaR.&GentlemanR.1996.R:Alanguagefordataanalysisandgraphics.JournalofComputationalandGraphicalStatistics5:299–314.[75]JohnsonW.E.&O’BrienS.J.1997.PhylogeneticreconstructionoftheFelidaeusing16SrRNAandNADH-5mitochondrialgenes.JournalofMolecularEvolution44:S98–S116.[76]JonesK.E.,PurvisA.,MacLarnonA.,Bininda-EmondsO.R.P.&SimmonsN.B.2002.Aphylogeneticsupertreeofthebats(Mammalia:Chiroptera).BiologicalReviewsoftheCambridgePhilosophicalSociety77:223–259.[77]JukesT.H.&CantorC.R.1969.Evolutionofproteinmolecules.In:MammalianProteinMetabolism,MunroH.N.,editor,pages21–132.AcademicPress,NewYork.[78]KeidingN.1975.Maximumlikelihoodestimationinthebirth-and-deathprocess.AnnalsofStatistics3:363–372.[79]KendallD.G.1948.Onthegeneralized“birth-and-death”process.AnnalsofMathematicalStatistics19:1–15.[80]KendallD.G.1949.Stochasticprocessesandpopulationgrowth.Jour-naloftheRoyalStatisticalSociety.SeriesB.Methodological11:230–264.[81]KimuraM.1980.Asimplemethodforestimatingevolutionaryratesofbasesubstitutionsthroughcomparativestudiesofnucleotidesequences.JournalofMolecularEvolution16:111–120. 204References[82]KimuraM.1981.Estimationofevolutionarydistancesbetweenho-mologousnucleotidesequences.ProceedingsoftheNationalAcademyofSciencesUSA78:454–458.[83]KirkpatrickM.&SlatkinM.1993.Searchingforevolutionarypatternsintheshapeofaphylogenetictree.Evolution47:1171–1181.[84]KosakovskyPondS.L.&MuseS.V.2004.Columnsorting:rapidcalculationofthephylogeneticlikelihoodfunction.SystematicBiology53:685–692.[85]KosiolC.&GoldmanN.2005.DifferentversionsoftheDayhoffratematrix.MolecularBiologyandEvolution22:193–199.[86]LachaudB.2005.Cut-offandhittingtimesofasampleofOrnstein–Uhlenbeckprocessesanditsaverage.JournalofAppliedProbability42:1069–1080.[87]LanaveC.,PreparataG.,SaconneC.&SerioG.1984.Anewmethodforcalculatingevolutionarysubstitutionrates.JournalofMolecularEvolution20:86–93.[88]LargetB.,SimonD.L.&KadaneJ.B.2002.Bayesianphylogeneticinferencefromanimalmitochondrialgenomearrangements.JournaloftheRoyalStatisticalSociety.SeriesB.Methodological64:681–693.[89]LecompteE.,GranjonL.,PeterhansJ.K.&DenysC.2002.Cytochrome´b-basedphylogenyofthePraomysgroup(Rodentia,Murinae):AnewAfricanradiation?ComptesRendusBiologies325:827–840.[90]LegendreP.&MakarenkovV.2002.Reconstructionofbiogeographicandevolutionarynetworksusingreticulograms.SystematicBiology51:199–216.[91]LeischF.2002.Dynamicgenerationofstatisticalreportsusingliteratedataanalysis.In:Compstat2002—ProceedingsinComputationalStatis-tics,HaerdleW.&RoenzB.,editors,pages575–580.PhysikaVerlag,Heidelberg.[92]LiangK.-Y.&ZegerS.L.1986.Longitudinaldataanalysisusinggeneralizedlinearmodels.Biometrika73:13–22.[93]LososJ.B.&AdlerF.R.1995.Stumpedbytrees?Ageneralizednullmodelforpatternsoforganismaldiversity.AmericanNaturalist145:329–342.[94]LynchM.1991.Methodsfortheanalysisofcomparativedatainevolu-tionarybiology.Evolution45:1065–1080.[95]MaddisonD.R.,SwoffordD.L.&MaddisonW.P.1997.NEXUS:Anextensiblefileformatforsystematicinformation.SystematicBiology46:590–621.[96]MartinsE.P.&HansenT.F.1997.Phylogeniesandthecomparativemethod:Ageneralapproachtoincorporatingphylogeneticinformationintotheanalysisofinterspecificdata[erratuminvol.153,no.4,p.488].AmericanNaturalist149:646–667.[97]McCulloughB.D.1999.Assessingthereliabilityofstatisticalsoftware:PartII.AmericanStatistician53:149–159. References205[98]McCulloughB.D.&VinodH.D.1999.Thenumericalreliabilityofeconometricsoftware.JournalofEconomicLiterature37:633–665.[99]McLeodA.I.1993.Parsimony,modeladequacyandperiodiccorrelationintime-seriesforecasting.InternationalStatisticalReview61:387–393.[100]MichauxJ.,ChevretP.,FilipucciM.-G.&MacholanM.2002.Phy-logenyofthegenusApodemuswithaspecialemphasisonthesubgenusSylvaemususingthenuclearIRBPgeneandtwomitochondrialmarkers:cytochromeband12SrRNA.MolecularPhylogeneticsandEvolution23:123–136.[101]MininV.,AbdoZ.,JoyceP.&SullivanJ.2003.Performance-basedselectionoflikelihoodmodelsforphylogenyestimation.SystematicBi-ology52:674–683.[102]MoranP.A.P.1950.Notesoncontinuousstochasticphenomena.Biometrika37:17–23.[103]NeeS.,HolmesE.C.,RambautA.&HarveyP.H.1995.Inferringpopulationhistoryfrommolecularphylogenies.PhilosophicalTransac-tionsoftheRoyalSocietyofLondon.SeriesB.BiologicalSciences349:25–31.[104]NeeS.,MayR.M.&HarveyP.H.1994.Thereconstructedevolutionaryprocess.PhilosophicalTransactionsoftheRoyalSocietyofLondon.SeriesB.BiologicalSciences344:305–311.[105]NeeS.,MooersA.Ø.&HarveyP.H.1992.Tempoandmodeofevo-lutionrevealedfrommolecularphylogenies.ProceedingsoftheNationalAcademyofSciencesUSA89:8322–8326.[106]NeiM.&KumarS.2000.MolecularEvolutionandPhylogenetics.OxfordUniversityPress,Oxford.[107]OakleyT.H.2003.Maximumlikelihoodmodelsoftraitevolution.CommentsonTheoreticalBiology8:1–17.[108]OllierS.,CouteronP.&ChesselD.2005.Orthonormaltransformtodecomposethevarianceofalife-historytraitacrossaphylogenetictree.Biometricsdoi:10.1111/j.1541-0420.2005.00497.x.[109]PagelM.1994.Detectingcorrelatedevolutiononphylogenies:Ageneralmethodforthecomparativeanalysisofdiscretecharacters.ProceedingsoftheRoyalSocietyofLondon.SeriesB.BiologicalSciences255:37–445.[110]PagelM.&MeadeA.2004.Aphylogeneticmixturemodelforde-tectingpattern-heterogeneityingenesequenceorcharacter-statedata.SystematicBiology53:571–581.[111]ParadisE.1997.Assessingtemporalvariationsindiversificationratesfromphylogenies:Estimationandhypothesistesting.ProceedingsoftheRoyalSocietyofLondon.SeriesB.BiologicalSciences264:1141–1147.[112]ParadisE.1998.Testingforconstantdiversificationratesusingmolec-ularphylogenies:Ageneralapproachbasedonstatisticaltestsforgood-nessoffit.MolecularBiologyandEvolution15:476–479. 206References[113]ParadisE.2003.Analysisofdiversification:Combiningphylogeneticandtaxonomicdata.ProceedingsoftheRoyalSocietyofLondon.SeriesB.BiologicalSciences270:2499–2505.[114]ParadisE.2004.Canextinctionratesbeestimatedwithoutfossils?JournalofTheoreticalBiology229:19–30.[115]ParadisE.2005.Statisticalanalysisofdiversificationwithspeciestraits.Evolution59:1–12.[116]ParadisE.&ClaudeJ.2002.Analysisofcomparativedatausinggeneralizedestimatingequations.JournalofTheoreticalBiology218:175–185.[117]ParadisE.,ClaudeJ.&StrimmerK.2004.APE:Analysesofphyloge-neticsandevolutioninRlanguage.Bioinformatics20:289–290.[118]PennyD.&HendyM.D.1985.Theuseoftreecomparisonmetrics.SystematicZoology34:75–82.[119]PinheiroJ.C.&BatesD.M.2000.Mixed-EffectsModelsinSandS-PLUS.Springer,NewYork.[120]PosadaD.&BuckleyT.R.2004.Modelselectionandmodelaver-aginginphylogenetics:AdvantagesofAkaikeinformationcriterionandBayesianapproachesoverlikelihoodratiotests.SystematicBiology53:793–808.[121]PosadaD.&CrandallK.A.1998.MODELTEST:TestingthemodelofDNAsubstitution.Bioinformatics14:817–818.[122]PosadaD.&CrandallK.A.2001.Selectingthebest-fitmodelofnucleotidesubstitution.SystematicBiology50:580–601.[123]PupkoT.,HuchonD.,CaoY.,OkadaN.&HasegawaM.2002.Com-biningmultipledatasetsinalikelihoodanalysis:Whichmodelsarethebest?MolecularBiologyandEvolution19:2294–2307.[124]PurvisA.&Garland,Jr.T.1993.Polytomiesincomparativeanalysesofcontinuouscharacters.SystematicBiology42:569–575.[125]PybusO.G.&HarveyP.H.2000.Testingmacro-evolutionarymod-elsusingincompletemolecularphylogenies.ProceedingsoftheRoyalSocietyofLondon.SeriesB.BiologicalSciences267:2267–2272.[126]PybusO.G.,RambautA.,HolmesE.C.&HarveyP.H.2002.Newinferencesfromtreeshape:Numbersofmissingtaxaandpopulationgrowthrates.SystematicBiology51:881–888.[127]QuaderS.,IsvaranK.,HaleR.E.,MinerB.G.&SeavyN.E.2004.Non-linearrelationshipsandphylogeneticallyindependentcontrasts.JournalofEvolutionaryBiology17:709–715.[128]RDevelopmentCoreTeam.2005.RLanguageDefinition.Version2.2.0.RFoundationforStatisticalComputing,Vienna.[129]RDevelopmentCoreTeam.2005.WritingRExtensions.Version2.2.0.RFoundationforStatisticalComputing,Vienna.[130]ReadA.F.&NeeS.1995.Inferencefrombinarycomparativedata.JournalofTheoreticalBiology173:99–108. References207[131]RidleyM.1992.Darwinsoundoncomparativemethod.TrendsinEcology&Evolution7:37.[132]RohlfF.J.2001.Comparativemethodsfortheanalysisofcontinuousvariables:Geometricinterpretations.Evolution55:2143–2160.[133]RzhetskyA.&NeiM.1992.Asimplemethodforestimatingandtestingminimum-evolutiontrees.MolecularBiologyandEvolution9:945–967.[134]SaitouN.&NeiM.1987.Theneighbor-joiningmethod:Anewmethodforreconstructingphylogenetictrees.MolecularBiologyandEvolution4:406–425.[135]SandersonM.J.1997.Anonparametricapproachtoestimatingdiver-gencetimesintheabsenceofrateconstancy.MolecularBiologyandEvolution14:1218–1231.[136]SandersonM.J.2002.Estimatingabsoluteratesofmolecularevolu-tionanddivergencetimes:Apenalizedlikelihoodapproach.MolecularBiologyandEvolution19:101–109.[137]SandersonM.J.,PurvisA.&HenzeC.1998.Phylogeneticsupertrees:Assemblingthetreesoflife.TrendsinEcology&Evolution13:105–109.[138]SchluterD.,PriceT.,MooersA.Ø.&LudwigD.1997.Likelihoodofancestorstatesinadaptiveradiation.Evolution51:1699–1711.[139]SchnabelR.B.,KoontzJ.E.&WeissB.E.1985.Amodularsystemofalgorithmsforunconstrainedminimization.ACMTransactionsonMathematicalSoftware11:419–440.[140]SibleyC.G.&AhlquistJ.E.1990.PhylogenyandClassificationofBirds:AStudyinMolecularEvolution.YaleUniversityPress,NewHaven,CT.[141]SibleyC.G.&Monroe,Jr.B.L.1990.DistributionandTaxonomyofBirdsoftheWorld.YaleUniversityPress,NewHaven,CT.[142]SkeltonP.,editor.1993.Evolution:ABiologicalandPalaeontologicalApproach.Addison-WesleyandTheOpenUniversity,Harlow,UK.[143]SmithF.A.,LyonsS.K.,ErnestS.K.M.,JonesK.E.,KaufmanD.M.,DayanT.,MarquetP.A.,BrownJ.H.&HaskellJ.P.2003.Bodymassoflatequaternarymammals.Ecology84:3403.[144]StamatakisA.,LudwigT.&MeierH.2005.RAxML-III:Afastprogramformaximumlikelihood-basedinferenceoflargephylogenetictrees.Bioinformatics21:456–463.[145]StaufferR.L.,WalkerA.,RyderO.A.,Lyons-WeilerM.&HedgesS.B.2001.Humanandapemolecularclocksandconstraintsonpaleontolog-icalhypotheses.JournalofHeredity92:469–474.[146]StephensM.A.1974.EDFstatisticsforgoodnessoffitandsomecomparisons.JournalofAmericanStatisticalAssociation69:730–737.[147]StephensM.A.1982.Anderson-Darlingtestforgoodnessoffit.In:EncyclopediaofStatisticalScience.Volume1,KotzS.&JohnsonN.L.,editors,pages81–85.JohnWiley&Sons,NewYork. 208References[148]SuzukiY.,GlazkoG.V.&NeiM.2002.Overcredibilityofmolecu-larphylogeniesobtainedbyBayesianphylogenetics.ProceedingsoftheNationalAcademyofSciencesUSA99:16138–16143.[149]TallisG.M.1983.Goodnessoffit.In:EncyclopediaofStatisticalScience.Volume3,KotzS.&JohnsonN.L.,editors,pages451–461.JohnWiley&Sons,NewYork.[150]TamuraK.1992.Estimationofthenumberofnucleotidesubstitutionswhentherearestrongtransition-transversionandG+C-contentbiases.MolecularBiologyandEvolution9:678–687.[151]TamuraK.&NeiM.1993.EstimationofthenumberofnucleotidesubstitutionsinthecontrolregionofmitochondrialDNAinhumansandchimpanzees.MolecularBiologyandEvolution10:512–526.[152]TamuraK.,NeiM.&KumarS.2004.Prospectsforinferringverylargephylogeniesbyusingtheneighbor-joiningmethod.ProceedingsoftheNationalAcademyofSciencesUSA101:11030–11035.[153]ThompsonJ.D.,GibsonT.J.,PlewniakF.,JeanmouginF.&HigginsD.G.1997.TheCLUSTALXwindowsinterface:Flexiblestrategiesformultiplesequencealignmentaidedbyqualityanalysistools.NucleicAcidsResearch25:4876–4882.[154]VenablesW.N.&RipleyB.D.2002.ModernAppliedStatisticswithS(FourthEdition).Springer,NewYork.[155]WhelanS.,Li`oP.&GoldmanN.2001.Molecularphylogenetics:state-of-the-artmethodsforlookingintothepast.TrendsinGenetics17:262–272.[156]YangZ.1994.Estimatingthepatternofnucleotidesubstitution.JournalofMolecularEvolution39:105–111.[157]YangZ.1994.MaximumlikelihoodphylogeneticestimationfromDNAsequenceswithvariableratesoversites:Approximatemethods.JournalofMolecularEvolution39:306–314.[158]YangZ.1996.Maximum-likelihoodmodelsforcombinedanalysesofmultiplesequencedata.JournalofMolecularEvolution42:587–596.[159]YangZ.2000.MaximumlikelihoodestimationonlargephylogeniesandanalysisofadaptiveevolutioninhumaninfluenzavirusA.JournalofMolecularEvolution51:423–432. Index...,71Brownianmotion,135,155.C,189.Call,191c2s,41.External,191char2genet,98.Rprofile,188choosebank,31,52?,18chronogram,119,128$,17,18,56chronopl,120ClustalX,6a,aaa,42coalescent.intervals,38AAstat,43colless,171ace,154,167,173,179colless.test,171ACNUC,31comp,41add.scale.bar,78compar.cheverud,138aldous.test,171compar.gee,148all.equal.phylo,39compar.lynch,149all.equal.treeshape,39compar.ou,152apply,22,41compute.brlen,36,134,181argumentsconsensus,118,123function,20cophenetic,99,138,140as.dist,97corBrownian,145as.matching,40corGrafen,145as.phylo,40corMartins,145as.treeshape,38correlationstructure,145,148axisPhylo,78,128correlogram.formula,141correlogram.phylo,141balance,38count,43basecomposition,42,43,58CRAN,5,183base.freq,42,59bd.ext,165daisy,98bind.tree,35data.frame,17birthdeath,164,177dev.copy,20boot.phylo,117,123dev.copy2eps,20branching.times,38,168dev.off,19break,22dev.print,20 210Indexdi2multi,38is.ultrametric,38dist,96dist.binary,98lapply,22,41,50,54,61,63,100dist.dna,98,103,104,121,129LATEX,3dist.gene,98length,12,61dist.genet,98library,6dist.prop,98likelihood.test,171dist.topo,118,123list,17distance,96load,20topology,118ls,10diversi.gof,169ls.str,11diversi.time,168ltt.lines,162DNAmodel,108ltt.plot,161dotchart.phylog,83margins(plot),66,78drop.tip,35,90Matchings,27matrix,15Emacs,7mlphylo,103,106,110ESS,7mltt.plot,162expression,180mode,12Moran.I,140factor,14mtext,69,78for,21multi2di,38,164function,187names,14gammaStat,170neighbor-joining,100GC,GC2,GC3,43Newick,28GC.content,43newick2phylog,29gearymoran,140next,22getSequence,33,53,54NEXUS,28Ghostscript,7nj,100,122,130glm,148nodelabels,71,125,136,176gls,145grep,57options,188gsub,56,57,87Ornstein–Uhlenbeckmodel,145,151hclust,99package,5,192help,18pandit,30,38help.search,19par,68,88HOME,188paste,41,46,50,62pchisq,174if,22phylogenyindexingbootstrap,115logical,13distance,96numeric,13,16maximumlikelihood,100withnames,14,16PHYML,6,111intersystemsinterface,2,189phymltest,103,106,111,125invers,41pic,135is.binary.tree,38plot,65is.rooted,36plot.phylo,65 Index211plot.tresshape,70source,21,183,188postscript,19,67splitseq,42prop.clades,116str,10prop.part,116substitutionmodels,98,108,111summary.phylo,38query,32,52summary.SeqFastaAA,43summary.SeqFastadna,43RHOME,188Sweave,3radial.phylog,70symbols.phylog,83ratematrix,101ratogram,119table,15,50,58,61,63,173rcoal,45table.phylog,84read.alignment,31tapply,22read.dna,30,48,51,56,63text,71,72,77,78,81read.fasta,31tiplabels,77,176read.GenBank,30,46,50,62translate,42read.nexus,28treebase,30,38read.table,49,51,52,55treeshape,28read.tree,28regularexpression,57,87unique,156replicate,22unroot,37rev,41UPGMA,99rm,11root,37variance.phylog,143rotate,36vcv.phylo,138rtree,44rtreeshape,45weight.taxo,99rug,180which.edge,81runif,36workingdirectory,7write.dna,35,47,50,63s2c,41write.nexus,34sackin,171write.tree,33sackin.test,171sample,113X11,90sapply,22save,20,33yule,167,176scale,179yule.cov,167,179seg.sites,43seq,40,50zoom,91,130

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
大家都在看
近期热门
关闭