Digital signal processor fundamentals and system design.pdf

Digital signal processor fundamentals and system design.pdf

ID:34890983

大小:4.65 MB

页数:63页

时间:2019-03-13

上传者:不努力梦想只是梦
Digital signal processor fundamentals and system design.pdf_第1页
Digital signal processor fundamentals and system design.pdf_第2页
Digital signal processor fundamentals and system design.pdf_第3页
Digital signal processor fundamentals and system design.pdf_第4页
Digital signal processor fundamentals and system design.pdf_第5页
资源描述:

《Digital signal processor fundamentals and system design.pdf》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库

DigitalsignalprocessorfundamentalsandsystemdesignM.E.AngolettaCERN,Geneva,SwitzerlandAbstractDigitalSignalProcessors(DSPs)havebeenusedinacceleratorsystemsformorethanfifteenyearsandhavelargelycontributedtotheevolutiontowardsdigitaltechnologyofmanyacceleratorsystems,suchasmachineprotection,diagnosticsandcontrolofbeams,powersupplyandmotors.ThispaperaimsatfamiliarisingthereaderwithDSPfundamentals,namelyDSPcharacteristicsandprocessingdevelopment.SeveralDSPexamplesaregiven,inparticularonTexasInstrumentsDSPs,astheyareusedintheDSPlaboratorycompanionofthelecturesthispaperisbasedupon.Thetypicalsystemdesignflowisdescribed;commondifficulties,problemsandchoicesfacedbyDSPdevelopersareoutlined;andhintsaregivenonthebestsolution.1Introduction1.1OverviewDigitalSignalProcessors(DSPs)aremicroprocessorswiththefollowingcharacteristics:a)Real-timedigitalsignalprocessingcapabilities.DSPstypicallyhavetoprocessdatainrealtime,i.e.,thecorrectnessoftheoperationdependsheavilyonthetimewhenthedataprocessingiscompleted.b)Highthroughput.DSPscansustainprocessingofhigh-speedstreamingdata,suchasaudioandmultimediadataprocessing.c)Deterministicoperation.TheexecutiontimeofDSPprogramscanbeforeseenaccurately,thusguaranteeingarepeatable,desiredperformance.d)Re-programmabilitybysoftware.Differentsystembehaviourmightbeobtainedbyre-codingthealgorithmexecutedbytheDSPinsteadofbyhardwaremodifications.DSPsappearedonthemarketintheearly1980s.Overthelast15yearstheyhavebeenthekeyenablingtechnologyformanyelectronicsproductsinfieldssuchascommunicationsystems,multimedia,automotive,instrumentationandmilitary.Table1givesanoverviewofsomeofthesefieldsandofthecorrespondingtypicalDSPapplications.Figure1showsareal-lifeDSPapplication,namelytheuseofaTexasInstruments(TI)DSPinaMP3voicerecorder–player.TheDSPimplementstheaudioandencodefunctions.Additionaltaskscarriedoutarefilemanagement,userinterfacecontrol,andpost-processingalgorithmssuchasequalizationandbassmanagement.1167 M.E.ANGOLETTATable1:AshortselectionofDSPfieldsofuseandspecificapplicationsFieldApplicationVideoconferencing/phoneBroadbandVoice/multimediaoverIPCommunicationDigitalmediagateways(VOD)SatellitephoneWirelessBasestationBiometricsSecurityVideosurveillanceDigitalstill/videocameraConsumerEntertainmentDigitalradioPortablemediaplayer/entertainmentconsoleInteractivetoysToysVideogameconsoleMRIMedicalUltrasoundX-rayIndustrialandScannerPointofsaleentertainmentVendingmachineFactoryautomationIndustrialIndustrial/machine/motorcontrolVisionsystemGuidance(radar,sonar)AvionicsMilitaryandaerospaceDigitalradioSmartmunitions,targetdetection2168 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNFig.1:UseofTexasInstrumentsDSPinaMP3player/recordersystem.PicturecourtesyofTexasInstrumentsfromwww.ti.com.1.2UseinacceleratorsDSPshavebeenusedinacceleratorssincethemid-1980s.Typicalusesincludediagnostics,machineprotectionandfeedforward/feedbackcontrol.Indiagnostics,DSPsimplementbeamtune,intensity,emittanceandpositionmeasurementsystems.Formachineprotection,DSPsareusedinbeamcurrentandbeamlossmonitors.Forcontrol,DSPsoftenimplementbeamcontrols,acomplextaskwherebeamdynamicsplaysanimportantfactorforthecontrolrequirementsandimplementations.Othertypesofcontrolincludemotorcontrol,suchascollimationorpowerconvertercontrolandregulation.ThereadercanfindmoreinformationonDSPapplicationstoacceleratorsinRefs.[1–3].DSPsarelocatedinthesystemfront-end.Figure2showsCERN’shierarchicalcontrolsinfrastructure,athree-tierdistributedmodelprovidingaclearseparationbetweenGraphicalUserInterface(GUI),server,anddevice(front-end)tiers.DSPsaretypicallyhostedonVMEboardswhichcanincludeoneormoreprogrammabledevicessuchasComplexProgrammableLogicDevices(CPLDs)orFieldProgrammableGateArrays(FPGAs).Daughtercards,indicatedinFig.2asdashedboxes,areoftenused;theiraimistoconstructasystemfrombuildingblocksandtocustomizeitbydifferentFPGA/DSPcodesandbythedaughtercardstype.DSPsandFPGAsareoftenconnectedtootherpartsofthesystemvialow-latencydatalinks.Digitalinput/output,timing,andreferencesignalsarealsotypicallyavailable.Dataareexchangedbetweenthefront-endcomputerandtheDSPovertheVMEbusviaadriver.3169 M.E.ANGOLETTAFig.2:TypicalcontrolsinfrastructureusedatCERNandDSPcharacteristicslocation2DSPevolutionandcurrentsceneryDSPsappearedonthemarketintheearly1980s.Sincethen,theyhaveundergoneanintenseevolutionintermsofhardwarefeatures,integration,andsoftwaredevelopmenttools.DSPsarenowamaturetechnology.ThissectiongivesanoverviewoftheevolutionoftheDSPovertheir25-yearlifespan;specializedtermssuchas‘Harvardarchitecture’,‘pipelining’,‘instructionset’or‘JTAG’areused.Thereaderisreferredtothefollowingparagraphsforexplanationsoftheirmeaning.MoredetailedinformationonDSPevolutioncanbefoundinRefs.[4],[5].2.1DSPevolution:hardwarefeaturesInthelate1970sthereweremanychipsaimedatdigitalsignalprocessing;however,theyarenotconsideredtobedigitalsignalprocessingowingtoeithertheirlimitedprogrammabilityortheirlackofhardwarefeaturessuchashardwaremultipliers.ThefirstmarketedchiptoqualifyasaprogrammableDSPwasNEC’sMPD7720,in1981:ithadahardwaremultiplierandadoptedtheHarvardarchitecture(moreinformationonthisarchitectureisgiveninSection3.1).AnotherearlyDSPwastheTMS320C10,marketedbyTIin1982.Figure3showsaselectivechronologicallistofDSPsthathavebeenmarketedfromtheearly1980suntilnow.Fromamarketevolutionviewpoint,wecandividethetwoandahalfdecadesofDSPlifespanintotwophases:adevelopmentphase,whichlasteduntiltheearly1990s,andaconsolidationphase,lastinguntilnow.Figure3givesanoverviewoftheevolutionofDSPfeaturestogetherwiththefirstyearofmarketingforsomeDSPfamilies.4170 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNFig.3:EvolutionofDSPfeaturesfromtheirearlydaysuntilnow.ThefirstyearofmarketingisindicatedatthetopforsomeDSPfamilies.Duringthemarketdevelopmentphase,DSPsweretypicallybasedupontheHarvardarchitecture.ThefirstgenerationofDSPsincludedmultiply,add,andaccumulatorunits.ExamplesareTI’sTMS320C10andAnalogDevices’(ADI)ADSP-2101.ThesecondgenerationofDSPsretainedthearchitecturalstructureofthefirstgenerationbutaddedfeaturessuchaspipelining,multiplearithmeticunits,specialaddressgeneratorunits,andDirectMemoryAccess(DMA).ExamplesincludeTI’sTMS320C20andMotorola’sDSP56002.WhilethefirstDSPswerecapableoffixed-pointoperationsonly,towardstheendofthe1980sDSPswithfloatingpointcapabilitiesstartedtoappear.ExamplesareMotorola’sDSP96001andTI’sTMS320C30.Itshouldbenotedthatthefloating-pointformatwasnotalwaysIEEE-compatible.Forinstance,theTMS320C30internalcalculationswerecarriedoutinaproprietaryformat;ahardwarechipconverter[6]wasavailabletoconverttothestandardIEEEformat.DSPsbelongingtothedevelopmentphasewerecharacterizedbyfixed-widthinstructionsets,whereoneofeachinstructionwasexecutedperclockcycle.Theseinstructionscouldbecomplex,andencompassingseveraloperations.ThewidthoftheinstructionwastypicallyquiteshortanddidnotovercometheDSPnativewordwidth.AsforDSPproducers,themarketwasnearlyequallysharedbetweenmanymanufacturerssuchasAT&T,Fujitsu,Hitachi,IBM,NEC,Toshiba,TexasInstrumentsand,towardstheendofthe1980s,Motorola,AnalogDevicesandZoran.Duringthemarketconsolidationphase,enhancedDSParchitecturessuchasVeryLongInstructionWord(VLIW)andSingleInstructionMultipleData(SIMD)emerged.ThesearchitecturesincreasetheDSPperformancethroughparallelism.ExamplesofDSPswithenhancedarchitecturesareTI’sTMS320C6xxxDSPs,whichwasthefirstDSPtoimplementtheVLIWarchitecture,andADI’sTigerSHARC,thatincludesbothVLIWandSIMDfeatures.Thenumberofon-chipperipheralsincreasedgreatlyduringthisphase,aswellasthehardwarefeaturesthatallowmanyprocessorstoworktogether.Technologiesthatallowreal-timedataexchangebetweenhostprocessorandDSPstartedtoappeartowardstheendofthe1990s.ThisconstitutedarealseachangeinDSPsystemdebuggingandhelpedthedevelopersenormously.AnotherphenomenonobservedduringthisphasewasthereductionofthenumberofDSPmanufacturers.ThenumberofDSPfamilieswasalsogreatlyreduced,infavourofwiderfamiliesthatgrantedincreasedcodecompatibilitybetweenDSPsof5171 M.E.ANGOLETTAdifferentgenerationsbelongingtothesamefamily.Additionally,manyDSPfamiliesarenot‘general-purpose’butarefocusedonspecificdigitalsignalprocessingapplications,suchasaudioequipmentorcontrolloops.2.2DSPevolution:deviceintegrationTable2showstheevolutionoverthelast25yearsofsomekeydevicecharacteristicsandtheirexpectedvaluesaftertheyear2010.Table2:OverviewofDSPdevicecharacteristicsasafunctionoftime.Thelastcolumnreferstoexpectedvalues.Year198019902000>2010CharacteristicWafersize[inches]361218Diesize[mm]5050505Feature[µm]30.80.10.02RAM[Bytes]2562000320001millionClockfrequency[MHz]2080100010000Power[mW/MIPS]25012.50.10.001Price[USD]1501550.15Wafer,die,andfeaturesizesarethebasickeyfactorsthatdefineachiptechnology.Thewafersizeisthediameterofthewaferusedinthesemiconductormanufacturingprocess.Thediesizeisthesizeoftheactualchipscarvedupinawafer.Thefeaturesizeisthesizeofthesmallestcircuitcomponent(typicallyatransistor)thatcanbeetchedonawafer;thisisusedasanoverallindicatorofthedensityofanIntegratedCircuit(IC)fabricationprocess.Thetrendinindustryistogotowardslargerwafersandchipdies,soastoincreasethenumberofworkingchipsthatcanbeobtainedfromthesamewafer;alsocalledyield.Forinstance,thecurrenttypicalwafersizeis12inches(300mm),andsomeleadingchipmakercompaniesplantomoveto18inches(450mm)withinthefirsthalfofthenextdecade.(Itshouldbeaddedthattheissueissomewhatcontroversial,asmanyequipmentmanufacturersfearthatthe18incheswafersizewillleadtoscaleproblemsevenworsethanforthe12inches.)Featuresizeisdecreasing,allowingonetoeitherhavemorefunctionalityonadieortoreducethediesizewhilekeepingthesamefunctionality.Transistorswithsmallersizesrequirelessvoltagetodrivethem;thisresultsinadecreaseofthecorevoltagefrom5Vto1.5V.TheI/Ovoltagehasbeenloweredaswell,withthecaveatthatitremainscompatiblewiththeexternaldevicesusedandtheirstandard.Alowercorevoltagehasbeenoneofthekeyfactorsenablinghigherclockfrequencies:infact,thegapbetweenhighandlowstatethresholdsistightenedthusallowingafasterlogicleveltransition.Additionally,thereduceddiesizeandloweredcorevoltageallowlowerpowerconsumption,animportantfactorforportableormobilesystem.Finally,theglobalcostofachiphasdecreasedbyatleastafactor30overthelast25years.Thetrendtowardsafasterswitchinghardware(includingchipover-clocking)andsmallerfeaturesizecarriesthebenefitofincreasedprocessingpowerandthroughput.Thereisadownsidetoit,however,representedbytheelectromigrationphenomenon.Electromigrationoccurswhensomeofthemomentumofamovingelectronistransferredtoanearbyactivatedion,hencecausingtheiontomovefromitsoriginalposition.Gapsor,onthecontrary,unintendedelectricalconnectionscandevelopwithtimeintheconductingmaterialifasignificantnumberofatomsaremovedfarfromtheir6172 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNoriginalposition.Theconsequenceistheelectricalfailureoftheelectronicinterconnectsandtheconsequentshortenedchiplifetime.2.3DSPevolution:softwaretoolsTheimprovementofDSPsoftwaretoolsfromtheearlydaysuntilnowhasbeenspectacular.CodecompilershaveevolvedgreatlytobeabletodealwiththeunderlyinghardwarecomplexityandtheenhancedDSParchitectures.Atthesametime,theyallowthedevelopertoprogrammoreandmoreefficientlyinhigh-levellanguagesasopposedtoassemblycoding.Thisspeedsupconsiderablythecodedevelopmenttimeandmakesthecodeitselfmoreportableacrossdifferentplatforms.AdvancedtoolsnowallowtheprogrammingofDSPsgraphically,i.e.,byinterconnectingpre-definedblocksthatarethenconvertedtoDSPcode.ExamplesofthesetoolsareMATLABCodeGenerationandembeddedtargetproductsandNationalInstruments'LabVIEWDSPModule.High-performancesimulators,emulatoranddebuggingfacilitiesallowthedevelopertohaveahighvisibilityintotheDSPwithlittleornointerferenceontheprogramexecution.Additionally,multipleDSPscanbeaccessedinthesameJTAGchainforbothcodedevelopmentanddebugging.2.4DSPcurrentsceneryThenumberofDSPvendorsiscurrentlysomewhatlimited:AnalogDevices(ADI),Freescale(formerlyMotorola),TexasInstruments(TI),Renesas,MicrochipandVeriSiliconarethebasicplayers.Amongstthem,thebiggestshareofthemarketistakenbyonlythreevendors,namelyADI,TIandFreescale.IntheacceleratorsectoronecanfindmostlyADIandTIDSPs,hencemostoftheexamplesinthisdocumentwillbefocusedonthem.Table3liststhemainDSPfamiliesforADIandTIDSPs,togetherwiththeirtypicaluseandperformance.Table3:MainADIandTIDSPfamilies,togetherwiththeirtypicaluseandperformanceManufacturerFamilyTypicaluseandperformanceTMS320C2xDigitalsignalcontrollersTITMS320C5xPowerefficientTMS320C6xHighperformanceSHARCMediumperformance.FirstADIfamily(nowthreegenerations)ADITigerSHARCHighperformanceformulti-processorsystemsBlackfinHighperformanceandlowpower3DSPcorearchitecture3.1IntroductionDSParchitecturehasbeenshapedbytherequirementsofpredictableandaccuratereal-timedigitalsignalprocessing.AnexampleistheFiniteImpulseResponse(FIR)filter,withthecorrespondingmathematicalequation(1),whereyisthefilteroutput,xistheinputdataandaisavectoroffiltercoefficients.Dependingontheapplication,theremightbejustafewfiltercoefficientsormanyhundredsormore.7173 M.E.ANGOLETTAMy(n)=!ak#x(n"k).(1)k=0AsshowninEq.(1),themaincomponentofafilteralgorithmisthe‘multiplyandaccumulate’operation,typicallyreferredtoasMAC.Coefficientsdatahavetoberetrievedfromthememoryandthewholeoperationmustbeexecutedinapredictableandfastway,soastosustainahighthroughputrate.Finally,highaccuracyshouldtypicallybeguaranteed.Theserequirementsarecommontomanyotheralgorithmsperformedindigitalsignalprocessing,suchasInfiniteImpulseResponse(IIR)filtersandFourierTransforms.Table4showsaselectionofprocessingrequirementstogetherwiththemainDSPhardwarefeaturessatisfyingthem.ThesehardwarefeaturesarediscussedinmoredetailinSections3.2to3.5andafulloverviewofatypicalDSPcorewillbebuiltstepbystep(seeFigs.4,7,10,13).MoredetailedinformationonDSParchitecturalfeaturescanbefoundinRefs.[7]–[14].Table4:MainrequirementsandcorrespondingDSPhardwareimplementationsforpredictableandaccuratereal-timedigitalsignalprocessing.Thenumbersinthefirstcolumnrefertothesectiontreatingthetopic.ProcessingrequirementsHardwareimplementationssatisfyingtherequirement•High-bandwidthmemoryarchitectures3.2Fastdataaccess•Specializedaddressingmodes•DirectMemoryAccess(DMA)•MAC-centred3.3Fastcomputation•Pipelining•Parallelarchitectures(VLIW,SIMD)3.4Numericalfidelity•Wideaccumulatorregisters,guardbits,etc.3.5Fastexecutioncontrol•Hardware-assisted,zero-overheadloops,shadowregisters,etc.3.2FastdataaccessFastdataaccessreferstotheneedoftransferringdatato/frommemoryorDSPperipherals,aswellasretrievinginstructionsfrommemory.Thehardwareimplementationsconsideredforthisarethree,namelya)high-bandwidthmemoryarchitectures,discussedinSub-section3.2.1;b)specializedaddressingmodes,discussedinSub-section3.2.2;c)directmemoryaccessdiscussedinSub-section3.2.3.3.2.1High-bandwidthmemoryarchitecturesTraditionalgeneral-purposemicroprocessorsarebasedupontheVonNeumannarchitecture,showninFig.4(a).Thisconsistsofasingleblockofmemory,containingbothdataandprograminstructions,andofasinglebus(calleddatabus)totransferdataandinstructionsfrom/totheCPU.Thedisadvantageofthisarchitectureisthatonlyonememoryaccessperinstructioncycleispossible,thusconstitutingabottleneckinthealgorithmexecution.DSPsaretypicallybasedupontheHarvardarchitecture,showninFig.4(b),oruponmodifiedversionsofit,suchastheSuper-HarvardarchitectureshowninFig.4(c).IntheHarvardarchitecturethereareseparatememoriesfordataandprograminstructions,andtwoseparatebusesconnectthemtotheDSPcore.Thisallowsfetchingprograminstructionsanddataatthesametime,thusprovidingbetterperformanceatthepriceofanincreasedhardwarecomplexityandcost.TheHarvard8174 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNarchitecturecanbeimprovedbyaddingtotheDSPcoreasmallbankoffastmemory,called‘instructioncache’,andallowingdatatobestoredintheprogrammemory.Thelast-executedprograminstructionsarerelocatedatruntimeintheinstructioncache.ThisisadvantageousforinstanceiftheDSPisexecutingaloopsmallenoughsothatallitsinstructionscanfitinsidetheinstructioncache:inthiscase,theinstructionsarecopiedtotheinstructioncachethefirsttimetheDSPexecutestheloop.Furtherloopiterationsareexecuteddirectlyfromtheinstructioncache,thusallowingdataretrievalfromprogramanddatamemoriesatthesametime.Fig.4:(a)VonNeumannarchitecture,typicaloftraditionalgeneral-purposemicroprocessors.(b)Harvardand(c)Super-Harvardarchitectures,typicalofDSPs.AnothermorerecentimprovementoftheHarvardarchitectureisthepresenceofa‘datacache’,namelyafastmemorylocatedclosetotheDSPcorewhichisdynamicallyloadedwithdata.Ofcourse,thefactofhavingthecachememoryveryclosetotheDSPallowsclockingitathighspeed,asroutingwiredelaysareshort.Figure5showsthecachearchitectureforTITMS320C67xxDSP,includingbothprogramanddatacache.Therearetwolevelsofcache,calledLevel1(L1)andLevel2(L2).TheL1cachecomprises8kbyteofmemorydividedinto4kbyteofprogramcacheand4kbyteofdatacache.TheL2cachecomprises256kbyteofmemorydividedinto192kbytemapped-SRAMmemoryand64kbytedualcachememory.Thelattercanbeconfiguredasmappedmemory,cacheoracombinationofthetwo.ThereadercanfindmoreinformationonTITMS320C67xxDSPtwo-levelmemoryarchitectureandconfigurationpossibilitiesinRef.[12].Fig.5:TIDSPTMS320C67xxfamilytwo-levelcachearchitecture9175 M.E.ANGOLETTAFigure6showsthehierarchicalmemoryarchitecturetobefoundinamodernDSP[13].Typicallevelsofmemoryandcorrespondingaccesstime,hardwareimplementation,andsizearealsoshown.Asremarkedabove,ahierarchicalmemoryallowsonetotakeadvantageofboththespeedandthecapacityofdifferentmemorytypes.Registersarebanksofveryfastinternalmemory,typicallywithsingle-cycleaccesstime.TheyareapreciousDSPresourceusedfortemporarystorageofcoefficientsandintermediateprocessingvalues.TheL1cacheistypicallyhigh-speedstaticRAMmadeoffiveorsixtransistors.TheamountofL1cacheavailablethusdependsdirectlyontheavailablechipspace.AL2cacheneedstypicallyasmallernumberoftransistorshencecanbepresentinhigherquantitiesinsidetheDSPs.RecentyearshavealsoseentheintegrationofDRAMmemoryblocksintotheDSPchip[14],thusguaranteeinglargerinternalmemorieswithrelativelyshortaccesstimes.TheLevel3(L3)memoryshowninFig.6israrelypresentinDSPswhiletheexternalmemoryistypicallyavailable.Thisisoftenalargememorywithlongaccesstimes.Fig.6:DSPhierarchicalmemoryarchitectureandtypicalnumberofaccessclockcycles,hardwareimplementation,andsizefordifferentmemorytypesAsshownabove,cachememoriesimprovetheaveragesystemperformance.However,therearedrawbackstothepresenceofacacheinDSP-basedsystems,owingtothelackoffullpredictabilityforcachehits.AmissingcachehithappenswhenthedataortheinstructionsneededbytheDSParenotstoredincachememory,hencetheyhavetobefetchedfromaslowermemorywithanexecutionspeedpenalty.Asituationcausingamissingcachehitis,forinstance,theflowchangeduetobranchinstructions.Theconsequenceisadifficultworst-case-scenarioprediction,whichisparticularlynegativeforDSP-basedsystemswhereitisimportanttobeabletocalculateandpredictthesystemtimeresponse.Theremay,however,bemethodsusedtolimittheseeffects,suchasthepossibilityfortheusertolockthecachesoastoexecutetime-criticalsectionsinadeterministicway.Advancedcacheorganizationscharacterizedbyauniformmemoryaddressingarealsounderstudy[15].3.2.2SpecializedaddressingmodesDSPsincludespecializedaddressingmodesandcorrespondinghardwaresupporttoallowarapidaccesstoinstructionoperandsthroughrapidgenerationoftheirlocationinmemory.DSPstypicallysupportawiderangeofspecializedaddressingmodes,tailoredforanefficientimplementationofdigitalsignalprocessingalgorithms.Figure7addstheaddressgeneratorunitstothebasicDSParchitectureshowninFig.4(c).Asingeneral-purposeprocessors,DSPsincludeaProgramSequencerblock,whichmanagesprogramstructureandprogramflowbysupplyingaddressestomemoryforinstructionfetches.Unlikegeneral-purposeprocessors,DSPsincludeaddressgeneratorblocks,whichcontroltheaddressgenerationforspecializedaddressingmodessuchasindexingaddressing,circularbuffers,andbit-reversaladdressing.Thetwolastaddressingmodesarediscussedbelow.10176 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNFig.7:ProgramsequencerandaddressgeneratorunitslocationwithinagenericDSPcorearchitectureCircularbuffersarelimitedmemoryregionswheredataarestoredinaFirst-InFirst-Out(FIFO)way;thesememoryregionsaremanagedina‘wrap-around’way,i.e.,thelastmemorylocationisfollowedbythefirstmemorylocation.Twosetsofpointersareused,oneforreadingandoneforwriting;thelengthofthestepatwhichsuccessivememorylocationsareaccessediscalled‘stride’.Addressgeneratorunitsallowstridingthroughthecircularbufferswithoutrequiringdedicatedinstructionstodeterminewheretoaccessthefollowingmemorylocation,errordetectionandsoon.Circularbuffersallowstoringburstsorcontinuousstreamsofdataandprocessingthemintheorderinwhichtheyhavearrived.Circularbuffersareusedforinstanceintheimplementationofdigitalfilters;strideshigherthanoneareusefulincaseofmulti-ratesignalprocessing.Figure8showstheorderinwhichdataareaccessedforareadoperationincaseofaneleven-elementcircularbufferandwithastrideequaltofour.Fig.8:Exampleofreaddataaccessorderinacircularbuffercomposedof11elementsandwithstrideequalto4elementsBit-reversaladdressing,showninFig.9,isanessentialstepinthediscreteFouriertransformscalculation.Infact,manyimplementationsoftheFouriertransformsrequireare-orderingofeithertheinputortheoutputdatathatcorrespondstoreversingtheorderofthebitsinthearrayindex.Figure9givesanexampleofthebit-reversalmechanism.Carryingitoutbysoftwareisverydemandingand11177 M.E.ANGOLETTAwouldresultinusingmanyCPUcycles,whicharesavedthankstothehardwarebit-reversalfunctionality.Fig.9:Bit-reversalmechanism3.2.3DirectMemoryAccess(DMA)controllerTheDMAcontrollerisasecondprocessorworkinginparallelwiththeDSPcoreanddedicatedtotransferringinformationbetweentwomemoryareasorbetweenperipheralsandmemory.IndoingsotheDMAcontrollerfreestheDSPcoreforotherprocessingtasks.Figure10showsanexampleoftheDMAlocationwithinageneralDSPcorearchitecture.Fig.10:AnexampleofDMAcontrollerlocationwithinagenericDSPcorearchitectureADMAcoprocessorcantransferdataaswellasprograminstructions,thelattertransfercorrespondingtypicallytothecaseofcodeoverlay,i.e.,ofcodestoredinanexternalmemoryandmovedtoaninternalmemory(forinstanceL1)whenneeded.MultipleandindependentDMAchannelsarealsoavailableforgreaterflexibility.BusarbitrationbetweentheDMAandtheDSPcoreisneededtoavoidcollidingmemoryaccesseswhentheDMAandtheDSPcoresharethesamebustoaccessperipheralsand/ormemories.Topreventbottlenecks,recentDSPstypicallyfitDMAcontrollerswithdedicatedbuses.12178 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNFigure11showstheadvantagesofDMAfortheDSPcoreefficientuse:theDSPcoremustsetuptheDMAbutstillthereisanetgainintheDSPcoreavailabilityforotherprocessingactivities.NowadaystherearetwoclassesofDMAtransferconfigurations:register-basedandRAM-based,thelatteronealsocalleddescriptor-based.Inregister-basedDMAcontrollersthetransferset-upisdonebytheDSPcoreviatheregistersset-up.ThismethodisveryefficientbutallowsmainlysimpleDMAoperations.InRAM-basedDMAcontrollerstheset-upparametersarestoredinmemory.ThismethodispreferredbypowerfulandrecentDSPsasitallowsgreatDMAtransferflexibility.Fig.11:(a)Read–process–writedatawhentheDSPcoreonlyispresent;(b)sameactivitywhentheDMAtakescareofdatatransfersFigure12providestwoexamplesoftransferconfigurations.Plot(a)showsachainedDMAtransfer,wherethecompletionofadatatransfertriggersanewtransfer.Thistypeofdatatransferisparticularlysuitedtoapplicationsthatrequireacontinuousdatastreamininput.Plot(b)showsamulti-dimensionaldatatransfer,obtainedbychangingthestrideoftheDMAtransfer.Thistypeofdatatransferisparticularlyusefulforvideoapplications.Fig.12:ExamplesofDMAtransferconfigurations.(a):chainedDMAtransfer;(b):Multi-dimensionaldatatransfer.DSPexternaleventsandinterruptscanbeusedtotriggeraDMAdatatransfer.DMAcontrollerscanalsogenerateinterruptstocommunicatewiththeDSPcore,forinstancetoinformitthatadatatransferhasbeencompleted.AnexampleofapowerfulandhighlyflexibleDMAcontrolleristhatimplementedforTI’sTMS320C6000family[16].13179 M.E.ANGOLETTA3.3FastcomputationHerewediscusstechniquesandarchitecturesusedinDSPsforafastcomputation.TheMAC-centredarchitecturedescribedinSub-section3.3.1hasbeencommontoallDSPssincetheirearlydays.ThetechniquesandarchitecturesdescribedinSub-sections3.3.2and3.3.3wereintroducedfromthe1990sonwards.3.3.1MAC-centredTheMACoperationisusedbymanydigitalprocessingalgorithms,asdiscussedatthebeginningofSection3;consequentlyitsexecutionmustbeoptimizedsoastoimprovetheDSPoverallperformance.ThebasicDSParithmeticprocessingblocksarea)manyregisters;b)oneormoremultipliers;c)oneormoreArithmeticLogicUnits(ALUs);d)oneormoreshifters.TheseblocksworkinparallelduringthesameclockcyclethusoptimizingMACaswellasotherarithmeticoperations.TheblocksareshowninFig.13andarebrieflydescribedbelow.a)Registers:thesearebanksofveryfastmemoryusedtostoreintermediatedataprocessing.VeryoftentheyarewiderthantheDSPnormalwordwidth,soastoprovideahigherresolutionduringtheprocessing.b)Multiplier:itcancarryoutsingle-cyclemultiplicationsandveryoftenitincludesverywideaccumulatorregisterstoreduceround-offortruncationerrors.Asaconsequence,truncationandround-offerrorswillhappenonlyattheendofthedataprocessing,whenthedataisstoredontomemory.Sometimesanadderisintegratedinthemultiplierunit.c)ALU:itcarriesoutarithmeticandlogicaloperations.d)Shifters:itshiftstheinputvaluebyoneormorebits,leftorright.Inthelattercase,theshifteriscalledabarrelshifterandisespeciallyusefulintheimplementationoffloatingpointaddandsubtractoperations.Fig.13:BasicDSParithmeticprocessingblocks.ThestructureshownisthatofADISHARC.14180 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN3.3.2InstructionpipeliningInstructionpipelininghasbecomeanimportantelementtoachievehighDSPperformance.Itconsistsofdividingtheexecutionofinstructionsintodifferentstagesandexecutingthedifferentinstructionsinparallelstages.Thenetresultisanincreasedthroughputoftheinstructionexecution.Thewholeprocesscanbecomparedtoafactoryassemblyline,whichproducescarsforinstance:morethanonecarisintheassemblylineatthesamemoment,atdifferentstagesofassembly.Thisprovidesaproductionhigherthanthecasewhereonlyonecaratatimeisproduced,wheremanyspecializedcrewsareidlewaitingforthenextcartorequiretheirwork.Table5showsthebasicpipeliningstageintowhicheachinstructionisdivided:1.Fetch.TheDSPcalculatestheaddressofthenextinstructiontoexecuteandretrievetheop-code,i.e.,thebinarywordcontainingtheoperandsandtheoperationtobecarriedoutonthem.2.Decode.Theop-codeisinterpretedandsenttothecorrespondingfunctionalunit.Theinstructionisinterpretedandtheoperandsareretrieved.3.Execute.Theinstructionisexecutedandtheresultsarewrittenontotheregisters.Table5:ThethreebasicpipeliningstagesandcorrespondingactionsBasicpipeliningstagesAction•GenerateprogramfetchaddressFetch•Readop-code•Routeop-codetofunctionalunitDecode•Decodeinstruction•Readoperands•ExecuteinstructionExecute•WriteresultsbacktoregistersFigure14showstheadvantageofapipelinedCPUwithrespecttoanon-pipelinedCPU,intermsofprocessingtimegain.Inanon-pipelinedCPUthedifferentinstructionsareexecutedserially,whileinapipelinedCPUonlythesametypeofstages(e.g.Fetch,DecodeandExecute)areserializedanddifferentinstructionsareexecutedinparallel.Apipelineiscalledfully-loadedifallstagesareexecutedatthesametime;thiscorrespondstothemaximumpossibleinstructionthroughput.Thedepthofthepipeline,i.e.,thenumberofstagesintowhichaninstructionisdivided,canvaryfromoneprocessortoanother.Generallyspeakingadeeperpipelineallowstheprocessortoexecutefaster,hencemanyprocessorssub-dividepipelinestagesintosmallersteps,eachoneexecutedateachclockcycle.Thesmallerthestep,thefastertheprocessorclockspeedcanbe.AnexampleofdeeppipelineistheTITMS320C6713DSP,whichincludesfourfetchstages,twodecodestages,anduptotenexecutionstages.Therearedrawbacksandlimitationstothepipeliningtechnique.Onedrawbackisthehardwareandprogrammingcomplexityrequiredbyit,forinstanceintermsofcapabilitiesneededinthecompilerandthescheduler.Thisisespeciallytrueinthecaseofdeeppipelines.Alimitationintheeffectiveinstructionexecutionthroughputisgivenbysituationsthatpreventthepipelinefrombeingfully-loaded.Thesesituationsincludepipelineflushesduetochangesintheprogramflow,suchascodebranchesorinterrupts.Inthiscase,theDSPdoesnotknowwhichinstructionsitshouldexecutenextuntilthebranchinstructionisexecuted.Othersituationsaredatahazards,namelywhenoneinstructionneedstheresultofapreviousinstructiontobeexecuted.Apartfromareducedthroughput,15181 M.E.ANGOLETTAFig.14:InstructionexecutionandprocessingtimegainofapipelinedCPU(plotb)withrespecttoanon-pipelinedone(plota)thesepipelinelimitationscauseamoredifficultpredictionoftheworst-casescenario.TechniquesnotdescribedhereareavailabletoprovidetheDSPprogrammerwithapipelinecontrol;theyincludetime-stationarypipelinecontrol,data-stationarycontrol,andinterlockedpipeline.3.3.3ParallelarchitecturesTheDSPperformancecanbeincreasedbyanincreasedparallelismintheinstructionsexecution.Parallel-enhancedDSParchitecturesstartedtoappearonthemarketinthemid1990sandwerebasedoninstruction-levelparallelism,data-levelparallelism,oracombinationofboth.ThesetwoapproachesarecalledVeryLongInstructionWord(VLIW)andSingle-InputMultiple-Data(SIMD),respectivelyandarediscussedbelow.ThereaderisreferredtoRefs.[17]and[18]formoreinformationonthesubject.VLIWarchitecturesarebaseduponinstructionlevelparallelism,i.e.,manyinstructionsareissuedatthesametimeandareexecutedinparallelbymultipleexecutionunits.Asaconsequence,DSPsbasedonthisarchitecturearealsocalled‘multi-issue’DSP.ThisisaninnovativearchitecturethatwasfirstusedintheTITMS320C62xxDSPfamily.Figure15showsanexampleoftheVLIWarchitecture:eight,32-bitinstructionsarepackedtogetherina256-bitwideinstructionwhichisfedtoeightseparateexecutionunits.CharacteristicsofVLIWarchitecturesincludesimpleandregularinstructionsets.Instructionschedulingisdoneatcompile-timeandnotatrun-timesoastoguaranteeadeterministicbehaviour.Thismeansthatthedecisiononwhichinstructionshavetobeexecutedinparallelisdonewhentheprogramiscompiled,hencetheorderdoesnotchangeduringtheprogramexecution.Arun-timeschedulingwouldinsteadmaketheschedulingdependentondataandresourcesavailability,whichcouldchangefordifferentprogramexecutions.AnimportantadvantageoftheVLIWarchitectureisthatitcanincreasetheDSPperformanceforawiderangeofalgorithms.Additionally,thearchitectureispotentiallyscalable,i.e.,moreexecutionunitscouldbeaddedtoallowahighernumberofinstructionstobeexecutedinparallel.Therearedisadvantagesaswell,suchasthehighmemoryuseandpowerconsumptionrequiredbythisarchitecture.Fromaprogrammer’sviewpoint,writingassemblycodeforVLIWarchitectureisverycomplexandtheoptimizationisoftenbetterlefttothecompiler.16182 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNFig.15:TITMS320C6xxxfamilyVLIWarchitectureSIMDarchitecturesarebasedondata-levelparallelism,i.e.,onlyoneinstructionisissuedatatimebutthesameoperationspecifiedbytheinstructionisperformedonmultipledatasets.Figure16showstheexampleofaDSPbasedupontheSIMDarchitecture:two32-bitinputregistersprovidefour,16-biteach,datainputs.Theyareprocessedinparallelbytwoseparateexecutionunitsthatcarryoutthesameoperation.Thetwo,16-bitdataoutputsarepackedintoa32-bitregister.TypicalSIMDarchitecturecansupportmultipledatawidthandismosteffectiveonalgorithmsthatrequiretheprocessingoflargedatachunks.TheSIMDoperationmodecanbeswitchedONorOFF,forinstanceintheADISHARCDSP.AnadvantageoftheSIMDarchitectureisthatitisapplicabletootherarchitectures;anexampleistheADITigerSHARCDSPthatcomprisesbothVLIWandSIMDcharacteristics.SIMDdrawbacksincludethefactthatSIMDarchitecturesarenotusefulforalgorithmsthatprocessdataseriallyorthatcontaintightfeedbackloops.Itissometimespossibletoconvertserialalgorithmstoparallelones;however,thecostisinreorganizationpenaltiesandinahigherprogram-memoryusage,owingtotheneedtore-arrangetheinstructions.Fig.16:SimplifiedschematicsforADISHARCDSPasanexampleofSIMDarchitecture3.4NumericalfidelityArithmeticoperationssuchasadditionsandmultiplicationsaretheheartofDSPsystems.Itisthusessentialthatthenumericalfidelitybemaximized,i.e.,thaterrorsduetothefinitenumberofbitsusedinthenumberrepresentationandinthearithmeticoperationsbeminimized.DSPshavemanywaystoobtainthis,rangingfromthenumericrepresentationtodedicatedhardwarefeatures.17183 M.E.ANGOLETTAAsfarasthenumberrepresentationisconcerned,DSPscanbedividedintotwocategories:fixedpointandfloatingpoint.Fixed-pointDSPsperformintegeraswellasfractionalarithmetic,andcansupportdatawidthsof16,24or32bits.Afixed-pointformatcanrepresentbothsignedandunsignedintegersandfractions.Fractionalnumberscantakevaluesinthe[−1.0,1.0]rangeandareoftenindicatedasQx.y,where‘x’indicatesthenumberofbitslocatedbeforethebinarypointand‘y’thenumberofbitsafterit.Figure17(a)showshow16-bitsignedfractionalpointnumbersarecoded.Signedfractionalnumberswith24-bitand32-bitdatawidtharecodedinanequivalentwayasQ1.23andQ1.31,respectively.Theycantakevaluesinthesame[−1.0,1.0]range,however,theirresolutionishigherthanthe16-bitimplementation.Anexampleofextendedprecisionfixed-pointcanbefoundinRef.[19].Floating-pointDSPsrepresentnumberswithamantissaandanexponent,nowadaysfollowingtheIEEE754[20]standardshowninFig.17(b)fora32-bitnumber.Themantissadictatesthenumberprecisionandtheexponentcontrolsitsdynamicrange.Numbersarescaledsoastousethefullword-lengthavailable,hencemaximizingtheattainableprecision.ThereaderisreferredtoRef.[21]formoreinformationonthesubject.Fig.17:(a):16-bitsignedfractionalpoint,oftenindicatedasQ1.15.(b):IEEE754normalizedrepresentationofasingleprecisionfloatingpointnumber.Floating-pointnumbersprovideahigherdynamicrange,whichcanbeessentialwhendealingwithlargedatasetsandwithdatasetswhoserangecannotbeeasilypredicted.Thedynamicrangefora32-bitnumberrepresentedasfixed-pointandasfloating-pointisshowninFig.18.Fig.18:Dynamicrangefor32-bitdata,representedas32-bitsignedfractionalpointandIEEE754normalizednumberInadditiontothedifferentnumberformatsavailable,DSPsprovidehardwarewaystoimprovenumericalfidelity.Oneexampleisrepresentedbythelargeaccumulatorregisters,usedtoholdintermediateandfinalresultsofarithmeticoperations.Theseregistersareseveralbits(atleastfour)widerthanthenormalregistersinordertopreventoverflowasmuchaspossibleduringaccumulationoperations.Theextrabitsarecalledguardbitsandallowonetoretainahigherprecisionin18184 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNintermediatecomputationsteps.Flagstoindicatethatanoverflow/underflowhashappenedarealsoavailable.Theseflagsareoftenconnectedtointerrupts,thusallowingexception-handlingroutinestobecalled.AnothermeansDSPshavetoimprovenumericalfidelityissaturatedarithmetic.Thismeansthatanumberissaturatedtothemaximumvaluethatcanberepresented,soastoavoidwrap-aroundphenomena.3.5Fast-executioncontrolHereweshowtwoimportantexamplesofhowDSPcanfast-executecontrolinstructions.Thefirstexampleisthezero-overheadhardwareloopandreferstotheprogramflowcontrolinloops.ThesecondexamplereferstohowDSPsreacttointerrupts.Loopingisacriticalfeatureinmanydigitalsignalprocessingalgorithms.AnimportantDSPfeatureistheimplementationbyhardwareofloopingconstructs,referredtoas‘zero-overheadhardwareloop’.ThisallowsDSPprogrammerstoinitializeloopsbysettingacounteranddefiningtheloopbounds,withoutspendinganysoftwareoverheadtoupdateandtestloopcountersorbranchingbacktothebeginningoftheloop.ThecapabilitytoserviceinterruptsveryquicklyandinadeterministicwayisanimportantDSPcharacteristic.Interruptsareinternal(forinstancegeneratedbyinternaltimers)orexternal(broughttotheDSPcodeviapins)eventsthatchangetheDSPexecutionflowwhentheyareserviced.ThelatencyisthetimeelapsedfromwhentheinterrupteventistriggeredandwhentheDSPstartstoexecutethefirstinstructionofthecorrespondingInterruptServiceRoutine(ISR).Whenaninterruptisreceivedandiftheinterrupthasasufficiently-highpriority,theDSPmustcarryoutthefollowingactions:a)stopitscurrentactivity;b)savetheinformationrelatedtotheinterruptedactivity(calledcontext)intotheDSPstack;c)startservicingtheinterrupt.ThecontextcorrespondingtotheinterruptedactivitycanberestoredwhentheISRhasbeenexecutedandthepreviousactivityiscontinued.Table6:InterruptdispatchersavailableontheADIADSP21160MDSP.Theinstructioncycleis12.5µs,hencethenumberofcyclescaneasilybeconvertedtotime.InterruptdispatcherCyclesbeforeISRCyclesafterISRNormal183109Fast4026Super-fast(withalternateregistersset)3410Final2415MorethanoneinterruptdispatcheristypicallyavailableinaDSP;thismeansthattheusercanselecttheamountofcontexttobesaved,knowingthatahighernumberofsavedregistersimpliesalongercontextswitchingtime.AninterestingfeatureavailableinsomeDSPs,suchastheADISHARCAD21160[22],isthepresenceoftworegistersets,called‘primary’and‘alternate’foralltheCPU’skeyregisters.Whenaninterruptoccurs,thealternateregistersetcanbeused,thusallowingaveryfastcontextswitch.Table6showsthefourinterruptdispatchersavailableontheADSP21160MDSPandtheircorrespondinglatency(‘CyclesbeforeISR’)andcontextrestoretime(‘CyclesafterISR’.The‘Final’dispatcherisintendedforusewithuser-writtenassemblyfunctionsorCfunctionsthathavebeencompiledusing‘#pragmainterrupt’.Inparticular,thisdispatcherreliesonthecompiler(orassemblyroutine)tosaveandrestoreallappropriateregisters.19185 M.E.ANGOLETTA3.6DSPcoreexample:TITMS320C67xFigure19showsTI’sTMS320C6713DSP[23]corearchitecture,asanexampleofmodernVLIWarchitectureimplementingmanyofthecharacteristicsdescribedinSection3.ThisDSPisthatusedinthelaboratorycompanionofthelecturesuponwhichthispaperisbased.BoxesinsidetheyellowsquarebelongtotheDSPcorearchitecture,whichhereisconsideredtoincludethecachememoryaswellastheDMAcontroller.ThewhiteboxesarecomponentscommontoallC6000devices;greyboxesareadditionalfeaturesontheTMS320C6713DSP.Fig.19:TITMS320C6713DSPcorearchitecture.PicturecourtesyofTI[23].TheTMS320C6713DSPisafloatingpointDSPwithVLIWarchitecture.Theinternalprogrammemoryisstructuredsothatatotalofeightinstructionscanbefetchedateverycycle.Togiveanumericalexample,withaclockrateof225MHztheC6713DSPcanfetcheight,32-bitinstructionsevery4.4ns.FeaturesoftheC6713include264kBytesofinternalmemory:8kBasL1cacheand256kBasL2memorysharedbetweenprogramanddataspace.Theprocessingofinstructionsoccursineachofthetwodatapaths(AandB),eachofwhichcontainsfourfunctionalunits(.L,.S,.M,.D).AnEnhancedDMA(EDMA)controllersupportsupto16EDMAchannels.Fourofthesixteenchannels(channels8−11)arereservedforEDMAchaining,leavingtwelveEDMAchannelsavailabletoserviceperipheraldevices.4DSPperipherals4.1IntroductionTheavailableperipheralsareanimportantfactorfortheDSPchoice.Peripheralsarehereconsideredasbelongingtotwocategories:a)interconnect,discussedinSection4.2;b)services,suchastimers,PLLandpowermanagement,discussedinSection4.3.20186 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNDSPdevelopersmustinfactcarefullyevaluatetheneedsoftheirsystemintermsofinterconnectandservicesrequired,toavoidbottlenecksandreducedsystemperformance.ModernDSPsoftenhaveseveralperipheralsintegratedon-chip,suchasUARTs,serial,USBandvideoports.Therearebenefitsinusingembeddedperipherals,suchasfastperformanceandreducedpowerconsumption.Thereare,however,drawbacks,inthatembeddedperipheralscanbelessflexibleacrossapplicationsandtheirunitcostmightbehigher.TheevolutionofDSP-supportedperipheralshasbeenterrificoverthelast20years.Fromtheoriginalfewparallelandserialports,DSPcannowsupportawideperipheralsrange,includingthoseneededbyaudio/videostreamingapplications.OftentheDSPchipdoesnothavepinstoallowusingallsupportedperipheralsatthesametime.Toovercomethislimitation,thepinsaremultiplexed,i.e.,theDSPdevelopermustselectatboottimewhichperipheralshe/sheneedstohaveavailable.AnexampleofpinmultiplexingreferredtoTI’sTMS320C6713DSPisgiveninSection4.4.AnoverviewofinterconnectandDSPservicesisgiveninSections4.2and4.3,respectively.HintsondifferentinterfacingpossibilitiestoexternalmemoriesanddataconvertermemoriesareprovidedinSections4.5and4.6,respectively.Finally,abriefoutlineoftheDSPbootingprocessisgiveninSection4.7.4.2InterconnectTheamountofsupportedinterconnectanddataI/Oishuge,soonlyafewexamplesaregivenbelow,dividedperinterconnecttype.Serialinterfacesa)SerialPeripheralInterface(SPI):thisisanindustry-standardsynchronousseriallinkthatsupportscommunicationwithmultipleSPIcompatibledevices.TheSPIperipheralisasynchronous,four-wireinterfaceconsistingoftwodatapins,onedeviceselectpin,andagatedclockpin.Withthetwodatapins,itallowsforfull-duplexoperationtootherSPIcompatibledevices.AnexampleofDSPfittedwithaSPIportisADI’sBlackfinADSP-BF533[24].b)MultichannelBufferedSerialPorts(McBSP)[25]onTI’sDSPs:thisserialinterfaceisbaseduponthestandardserialportfoundinTMS320C2xandTMS320C5xDSPs.c)MultichannelAudioSerialPort(McASP)[26]onTI’sDSPs:thisisaserialportoptimizedfortheneedsofmultichannelaudioapplications.EachMcASPincludestransmitandreceivesectionsthatcanoperatesynchronizedaswellascompletelyindependent,i.e.,withseparatemasterclocks,bitclocks,anddatastreamformats.Parallelinterfacesa)ADI’slinkports[27]areparallelinterfacesthatallowDSP–DSPaswellasDSP–peripheralconnection.Anexampleoftheiruseforinter-DSPcommunicationtobuildmulti-DSPsystemsisgiveninSub-section9.3.1.b)ParallelPeripheralInterface(PPI)[28]onADI’sBlackfinDSP:thisisamultifunctionparallelinterface,configurablebetween8and16bitsinwidth.Itsupportsbidirectionaldataflowanditincludesthreesynchronizationlinesandaclockpinforconnectiontoanexternally-suppliedclock.ThePPIcanreceivedataatclockspeedsofupto65MHz,whiletransmitratescanapproach60MHz.Otherinterfacescommonlyfound,forinstanceinTIDSPs,arePeripheralComponentInterconnect(PCI)[29],Inter-IntegratedCircuit(I2C)[30],Host-PortInterface(HPI)[31]andGeneral-PurposeInput/Output(GPIO)[32].21187 M.E.ANGOLETTA4.3ServicesSystemservicesprovidefunctionalitythatiscommontoembeddedsystems;theon-chiphardwareisgenerallyaccompaniedbyanAPIthatallowsonetoeasilyinterfacetothem.Afewexamplesofservicesaregivenbelow.a)Timers:DSPsaretypicallyfittedwithoneormoregeneral-purposetimersthatareusedtotimeorcountevents,generateinterruptstotheCPU,orsendsynchronizationeventstoaDMA/EDMAcontroller.MoreinformationontimersforTI’sTMS320C6000DSPscanbefoundinRef.[33].b)PLLcontroller:itgeneratesclockpulsesfortheDSPcodeandtheperipheralsfrominternalorexternalclocksignals.MoreinformationonPLLcontrollersforTI’sTMS320C6000DSPscanbefoundinRef.[34].c)PowerManagement:thepower-downlogicallowsthereductionofclockingsoastoreducepowerconsumption.Infact,mostoftheoperatingpowerofCMOSlogicdissipatesduringcircuitswitchingfromonelogicstatetotheother.Significantpowercanbesavedbypreventingsomeoftheselevelswitches.MoreinformationonpowermanagementlogicofTI’sTMS320C6000DSPscanbefoundinRef.[35].d)Bootconfiguration:avarietyofbootconfigurationsareoftenavailableinDSPs.Theyareuser-programmableanddeterminewhatactionstheDSPperformsafterithasbeenresettopreparefortheinitialization.TheseactionsincludeloadingtheDSPcodeloadfromexternalmemoryorfromanexternalhost.SomebootmodesareoutlinedinSection4.7.Moreinformationonbootmodes,deviceconfiguration,andavailablebootprocessesforTITMS320C62x/67xisavailableinRef.[36].e)JTAG:thisinterfaceimplementstheIEEEstandard1149.1andallowsemulationanddebugging.AdetaileddescriptionofitsusecanbefoundinSection7.2.Figure20showsatypicalJTAGconnectorandcorrespondingsignals[37].Fig.20:Fourteen-pinJTAGheaderandcorrespondingsignals.PicturecourtesyofTI[37].22188 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN4.4TIC6713DSPexampleTheperipheralsavailableonTI’sTMS320C6713DSPareshowninFig.21asboxesencircledbyayellowshape.ThewhiteboxesarecomponentscommontoallC6000devices,whilegreyboxesareadditionalfeaturesontheTMS320C6713DSP.ManyperipheralsareavailableonthisDSP;however,therearepinsthataresharedbymorethanoneperipheralandareinternallymultiplexed.Mostofthesepinsareconfiguredbysoftwareviaaconfigurationregister,hencetheycanbeprogrammedtoswitchfunctionalityatanytime.Others(suchastheHPIpins)areconfiguredbyexternalpullup/pulldownresistorsatDSPchipreset;asaconsequence,onlyoneperipheralhasprimarycontrolofthefunctionofthesepinsafterreset.Fig.21:TITMS320C6713DSPavailableperipherals.PicturecourtesyofTI[23].4.5MemoryinterfacingDSPsoftenhavetointerfacewithexternalmemory,typicallysharedwithhostprocessorsorwithotherDSPs.ThetwomainmechanismsavailabletoimplementthememoryinterfacingaretousehardwareinterfacesalreadyexistingontheDSPchiportoprovideexternalhardwarethatcarriesoutthememoryinterfacing.Thesetwomethodsarebrieflymentionedbelow.HardwareinterfacesareoftenavailableonTIaswellasonADIDSPs.AnexampleisTIExternalMemoryInterface(EMIF)[38],whichisagluelessinterfacetomemoriessuchasSRAM,EPROM,Flash,SynchronousBurstSRAM(SBSRAM)andSynchronousDRAM(SDRAM).OntheTMS320C6713DSP,forinstance,theEMIFprovides512Mbytesofaddressableexternalmemoryspace.Additionally,theEMIFsupportsmemorywidthof8bits,12bitsand32bits,includingread/writeofbothbig-andlittle-endiandevices.Whennodedicatedon-chiphardwareisavailable,themostcommonsolutionforinterfacingaDSPtoanexternalmemoryistoaddexternalhardwarebetweenmemoryandDSP,asshowninFig.22.TypicallythisisdonebyusingaCPLDoranFPGAwhichimplementsaddressdecodingandaccessarbitration.Caremustbetakenwhenprogrammingtheaccesspriorityand/orinterleavedmemoryaccessintheCPLD/FPGA.Thisisessentialtopreservethedataintegrity.Synchronousmechanismsshouldbepreferredoverasynchronousonestocarryoutthedatainterfacing.23189 M.E.ANGOLETTAFig.22:GenericDSP–externalmemoryinterfacingscheme.Veryoftentheh/winterfaceconsistsofaCPLDoranFPGA.4.6DataconverterinterfacingDSPsprovideavarietyofmethodstointerfacewithdataconverterssuchasADCs.On-chipperipheralsareaveryconvenientdatatransfermechanism,sincedataconvertersaretypicallymuchslowerthantheDSPstheyareinterfacedwith,henceaskingtheDSPcoretodirectlyretrievedatafromtheconvertersisawasteofvaluableprocessingtime.SerialinterfacesareoftenavailableinTI’sDSPs:peripheralssuchasMcBSPandMcASPplusthepowerfulDMAallowaneasyinterfacetomanydataconvertertypes[39,40].AnotherpossiblesolutionforTIDSPsistousetheEMIFinasynchronousmodetogetherwiththeDMA.Inadditiontoserialinterfaces,ADIBlackfinDSPprovidesaparallelinterface,namelythePPIinterfacementionedinSection4.2,asaconvenientwaytointeractwithmanyconverters.Thisinterfacetypicallyallowshighersamplingratesthantheserialinterfaces.AgeneralsolutionforimplementingtheDSP–dataconverterinterfaceistouseanFPGAbetweenDSPandconverter,soastore-bufferthedata.AnexampleofthishardwareimplementationforADIBlackfinDSPsusedinwirelessportableterminalsisgiveninRef.[41].Additionalpre-processing,suchasfilteringordown-conversion,canalsobecarriedoutintheFPGA.ThisisthecaseforinstanceinCERN’sLEIRLLRFsystem[42],whereconverterssuchasADCsandDACsarehostedondaughtercards.PowerfulFPGAslocatedonthesamedaughtercardscarryoutpre-processinganddiagnosticsactionsunderfullDSPcontrol.Finally,mixed-signalDSPs,i.e.,DSPswithembeddedADCsand/orDACs,arealsoavailable.Anexampleofmixed-signalDSPisADI’sADSP-21990,containingapipelineflashconverterwitheightindependentanaloginputsandsamplingfrequencyofupto20MHz.4.7DSPbootingTheactionsexecutedbytheDSPimmediatelyafterapower-downoraresetarecalledDSPbootandaredefinedbyacertainnumberofconfigurableinputpins.Thisparagraphwillfocusonhowtheexecutablefile(s)isuploadedtotheDSPafterapower-downorreset.Twomethodsareavailable,whichtypicallycorrespondtodifferentlybuiltexecutables.MoreinformationonthecodebuildingprocessandonthemanyfileextensionscanbefoundinSection6.4.ThefirstmethodistousetheJTAGconnectortodirectlyuploadtotheexecutableintheDSP.UponaDSPpower-downthecodewilltypicallynotberetainedintheDSPandanothercodeuploadwillbenecessary.Thismethodisusedduringthesystemdebuggingphase,whenadditionalusefulinformationcanbegatheredviatheJTAG.OnoperationalsystemstheDSPloadstheexecutablecodewithoutaJTAGcable.Manymethodsareavailablefordoingthis,dependingontheDSPfamilyandmanufacturer;somegeneralwaysaredescribedbelow.a)No-boot.TheDSPfetchesinstructionsdirectlyfromapre-determinedmemoryaddress,correspondingtoEPROMorFlashmemoryandexecutesthem.OnSHARCDSPs,forinstance,thepre-definedstartaddressistypically0x800004.24190 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNb)Host-boot.TheDSPisstalleduntilthehostconfigurestheDSPmemory.ForTITMS320C6xxxDSPs,forinstance,thisisdoneviatheHPIinterface.Whenallnecessarymemoryisinitialized,thehostprocessortakestheDSPoutoftheresetstatebywritinginaHPIregister.c)ROM-boot.AbootkernelisuploadedfromROMtoDSPatboottimeandstartsexecutingitself.ThekernelcopiesdatafromanexternalROMtotheDSPbyusingtheDMAcontrollerandoverwritesitselfwiththelastDMAtransfer.AfterthetransferiscompletedtheDSPbeginsitsprogramexecution.Figure23visualizestheTIDSPprocessofbootingfromROMmemory:theprogram(showningreen)hasbeenmovedfromROMtoL2andL1Program(L1P)cacheviaEMIFandDMA.Fig.23:ExampleofTITMS320C6xDSPbootingfromROMmemory.ThepictureiscourtesyofTI[43].5Real-timedesignflow:introductionFigure24showsatime-orderedviewofthevariousactivitiesorphasesthatareal-timesystemdevelopermayberequiredtocarryoutduringanewsystemdevelopment.Theseactivitieswillbetreatedinthisdocumentinadidacticratherthaninatime-relatedorder,toalloweventheun-experiencedreadertobuilduptheknowledgeneededateachstep.Itshouldbeunderlinedthatthereal-timedesignflowmaybenottotallyforward-directed,andateachstepthedevelopermayhavetogobacktoapreviousphasetomakemodificationsorcarryoutadditionaltests.Fig.24:Activitiestypicallyrequiredtodevelopanew,DSP-basedsystemThe‘systemdesign’phasemayincludebothhardwareandsoftwaredesign.Forhardwaredesign,thedevelopermustmakechoicessuchastheDSPtypetouse,thehardwarearchitecture/interfaces,andsoon.Forsoftwaredesign,choicessuchasthecodestructure,thedataflowanddataexchangeinterfacesmustbemade.ThisphaseistreatedinSection9.The‘softwaredevelopment’phaseincludescreatingtheDSPprojectandwritingtheactualDSPcode.BasicandessentialinformationforthisphaseisgiveninSection6.25191 M.E.ANGOLETTAThe‘debug’phaseisaverycriticalone,wherethedevelopermustverifythatthecodeexecuteswhatitwasmeantto.Somedebuggingtechniquesaswellasdifferentmethodologiesavailable(suchassimulationandemulation)aredescribedinSection7.The‘analysisandoptimization’phaseallowsthedevelopertooptimizethesystemfordifferentgoals,suchasspeed,memory,input/outputbandwidth,orpowerconsumption.AnalysisandoptimizationtoolsaredescribedinSection8,togetherwithsomeoptimizationguidelines.Finally,the‘systemintegration’istheessentialphasewherethesystemisintegratedwithintheexistinginfrastructureandisthereforemadefullyoperational.Itisnotpossibletogiveprecisedetailsonthisphaseowingtothemanyexistingcontrolinfrastructures.However,generalguidelinesandgoodpracticesarediscussedinSection10.6Real-timedesignflow:softwaredevelopmentDSPsareprogrammedbysoftwareviaacross-compilation.Thismeansthattheexecutableiscreatedinaplatform(suchasaWindows-oraSUN-basedmachine)differentfromtheonethatitrunson,i.e.,theDSPitself.OnereasonforthisisthatDSPshavelimitedanddedicatedresources,henceitwouldnotbeconvenientorevenpossibletorunafilesystemwithauser-friendlydevelopmentenvironment.Thechoiceofprogramminglanguagesisvast,includingnativeassemblylanguageaswellashigh-levellanguagessuchasC,C++,Cextensionsanddialects,Adaandsoon.High-levelsoftwaretoolssuchasMATLABandNationalInstrumentsallowonetoautomaticallygeneratecodefilesfromgraphicalinterfaces,thusprovidingrapidprototypingmethods.Thecode-buildingtoolsareveryoftenprovidedbytheDSPmanufacturersthemselves.CompilersandIntegratedDevelopmentEnvironments(IDEs)arealsoavailablefromothersources,suchasGreenHillsSoftware.Thetrendisnowtowardsmorepowerfulanduser-friendlytools,capableoftamingandusinginthebestpossiblewaytheunderlyinghardwareandsoftwarecomplexity.6.1Developmentset-upandenvironmentDSPexecutablesaredevelopedbyusingIntegratedDevelopmentEnvironments(IDEs)providedbyDSPmanufacturers;theyintegratemanyfunctions,suchasediting,debugging,projectfilesmanagement,andprofiling.Veryoftenthelicencesareboughtona‘per-project’basis,evenifADIprovidesalsofloating(i.e.,networked)licences.ThedevelopmentenvironmentforTIandADIDSPsarecalled‘CodeComposerStudio’and‘VisualDSP++’,respectively;theyprovideverysimilarfunctionalities.ItshouldbeunderlinedthatTIhasrecentlymadeavailablefreeofchargethecompiler,assembler,optimizerandlinkertonon-commercialusers.However,neithertheIDEnoradebuggerwereincluded,thusthedevelopermuststillusetheproprietarytools.Figure25givesanexampleofatypicalCodeComposerscreen.Ontheleft-handsidethereisthelistofallfilesincludedinthesoftwareproject.Atthecentreofthescreentwowindowsshowthecode,asaCfile(process.c)andasassemblycode(Dis-assemblywindow).Abreakpointhasbeensetandtheexecutionisstoppedthere.Belowthecodewindows,twomemorywindowsarealsovisible,detailingthedatapresentataddresses0x80000000andfollowing,andataddresses0x40000030andfollowing.Dataataddress0x80000002isofadifferentcolourbecauseitsvaluechangedrecently.AtthebottomoftheIDEscreenthefollowingitemaredisplayed:a)theCompile/Linkwindow,whichdetailstheresultsfromthelastcodecompilation;b)theWatchwindow,whichdisplaysthevalueassumedbytwoC-languagevariablesandc)theRegisterwindow,whichdetailsthecontentsofallDSPregisters.Ontheright-handsidetherearethreegraphs:theyellowonesshowmemoryregions,whilethegreenoneshowstheFastFourierTransformofdatastoredinmemoryascalculatedbythe26192 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNIDE.ThereadercanfindmoredetailsaboutCodeComposerStudioanditscapabilitiesinRefs.[44]–[46].Fig.25:ScreenshotfromCodeComposer,i.e.,theTIDSPIDE.Thepicturewastakenin1998fromthedevelopmentofCERN’sADSchottkysystem.Figure26showsatypicalDSP-basedsystemset-up.Ontheleft-handsidetheDSPIDErunsonaPC,whichisconnectedtotheDSPviaaJTAGemulatorandpod.Thisallowsonetoeditthecode,compileit,downloadittothehardwareandretrievedebuginformation.Ontheright-handsidethesystemexploitationisshownwherebytheDSPrunsitsprogramandaPowerPCboard,runningLynxOSandactingasmasterVME,controlstheDSPactions,downloadsthecontrolparameters,andretrievestheresultingdata.TheexampleshownisthatofCERN’sADSchottkymeasurementsystem[47].Fig.26:Typicalsystemexploitation(ontheleft-handside)andcodedevelopment(ontheright-handside)set-ups27193 M.E.ANGOLETTA6.2Languages:assembly,C,C++,graphicalThechoiceofthelanguage(s)tobeusedfortheDSPdevelopmentisveryimportantanddependsmainlyontheselectedDSP,asdifferentDSPsmaysupportdifferentlanguages.OftenaDSPsystemwillincludebothassemblyandhigh-levellanguages;thelanguagechoiceorthechosenbalancebetweenthelanguagesdependsalsoontherequiredprocessorworkload,i.e.,onhowmuchthecodeshouldbeoptimizedtosatisfytherequirements.Thelanguagechoiceisnowadaysmuchlargerthaninthepast,mainlythankstotheimprovementsofcompilers.Additionally,theincreasedcomplexityofDSPhardware(seeSection3),suchasdeeppipelining,makesthehand-optimizationmuchmoredifficult.Themainlanguagechoicesinclude:a)assemblylanguage;b)high-levellanguagessuchasC,Cdialects/extensionsandC++;c)graphicallanguagessuchasMatlab.Thesethreechoicesarediscussedbelow.6.2.1AssemblylanguageTheassemblylanguageisveryclosetothehardware,asitexplicitlyworkswithregistersanditrequiresadetailedknowledgeoftheinnerDSParchitecture.Towriteassemblycodetypicallytakeslongerthantowritehigh-levellanguages;additionally,itisoftenmoredifficulttounderstandotherpeople’sassemblyprogramsthantounderstandprogramswritteninhigh-levellanguages.Theassemblygrammar/styleandtheavailableinstructionset/peripheralsdependnotonlyontheDSPmanufacture,butalsoontheDSPfamilyandonthetargetedDSP.Asaconsequence,itmightbedifficultorevenimpossibletoportassemblyprogramsfromoneDSPtoanother.Forinstance,forDSPsbelongingtotheTIC6xxxfamilythereisaboutan85%assemblycodecompatibility,i.e.,whengoingfromaC62xtoaC64xDSPtherearenoissuesbutifmovingfromaC64xtoaC62xonemighthavetointroducesomechangesinthecodeowingtothedifferentinstructionset.DSPapplicationshavetypicallyverydemandingprocessingrequirements.TheneedtoobtainthemaximumprocessingperformancehasoftenledDSPprogrammerstouseassemblyprogrammingextensively.Nowadaystheimprovementsincodecompilersandtheincreasingdifficultyinhand-optimizingassemblycodehavepromptedDSPdeveloperstousehigh-levellanguagesmoreoften.However,insomeDSPstherearestillfeaturesavailableonlyinassembly,suchasthesuper-fastinterruptdispatcherforADI’sADSP21160MDSPshowninTable6.Veryoften,thebulkoftheDSPcodeiswritteninhigh-levellanguagesandthepartsneedingabetterperformancemaybewritteninassembly.Differentmanufacturersadoptdifferentassemblystyles,whichhavealsoevolvedovertheyears.Table7showsacomparisonbetweenatraditionalassemblystyle,adoptedforinstancebyTIC40DSPs,andthealgebraicassembly,adoptedbyADISHARCDSPs.Table7:Comparisonofassemblycodestyles.ThetraditionalassemblystylewasadoptedforinstancebyTITMC320C4xDSPs,whilethealgebraicassemblyisusedinADI.28194 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNFigure27givesanexampleofhowonelineofCcodeisconvertedtothecorrespondingassemblycodefortheTIC6317DSP.Theupperwindowshowspartofthe‘SIN_to_output_RTDX.c’file,whichwasincludedintheDSPlaboratorycompanionofthelecturesdescribedinthisdocument;thelower‘Disassembly’windowshowstheresultingassemblycode.Fig.27:CandassemblylanguageexamplesfortheTIC6713DSP.Window(a):Csourcecode.Window(b):assemblycoderesultingfromthefirstC-codelineinwindow(a).6.2.2High-levellanguages:CTheClanguagewasdevelopedintheearly1970s;threemainstandardsexist,referredtoasANSI,ISO,andC99respectively.TherearemanyreasonswhyitisconvenienttousetheClanguageinDSP-basedsystems.TheClanguageisverypopularandknownbyengineersandsoftwaredevelopersalike;itistypicallyeasiertounderstandandfastertodevelopthanassemblycode.Itsupportsfeaturesusefulforembeddedsystemssuchascontrolstructuresandlow-levelbitmanipulations.AllDSPsareprovidedwithaCcompiler,henceitmaybepossibletoporttheCcodefromoneDSPtoanother.Thereare,however,drawbackstotheuseofstandardClanguagesinDSP-basedsystems.First,theexecutableresultingfromaC-languagesourcecodeistypicallyslowerthanthatderivedfromoptimizedassemblycodeandhasalargersize.TheANSI/ISOClanguagedoesnothavesupportfortypicalDSPhardwarefeaturessuchascircularbuffersornon-flatmemoryspaces.Additionally,theaddressatwhichdatamustbealignedcanvarybetweendifferentDSParchitectures:onsomeDSPsa4-byteintegercanstartatanyaddress,butonotherDSPsitcouldstartforinstanceatevenaddressesonly.Asaconsequence,thedataalignmentobtainedwithANSI/ISOCcompilersmaybeincompatiblewiththedataalignmentrequiredbytheDSP,thusleadingtodeadlybuserrors.InthestandardClanguagethereisnonativesupportforfixed-pointfractionalvariables,aseriousdrawbackformanyDSPsandsignalprocessingalgorithms.Finally,thestandardCcompilerdata-typesizesarenotstandardizedandmaynotfittheDSPnativedatasize,leadingforinstancetothereplacementoffasthardwareimplementationswithslowersoftwareemulations.Forinstance,64-bitdoubleoperationsareavailableinADI’sTigerSHARCassoftwareemulationsonly;hencethedeclarationof29195 M.E.ANGOLETTAvariablesasdoubleandnotasfloatwillresultinslowerexecution.Table8showshowdata-typesizescanvaryfordifferentDSPs.Table8:Examplesofdata-typesizefordifferentDSPsTable9showsthedata-typesizesandnumberformatfortheTIC6713DSP.The2’scomplementandbinaryformatsareusedforsignedandunsignednumbers,respectively.Table9:Data-typesizesandnumberformatfortheTIC6713DSPTherearetwomainapproachestoadaptingtheClanguagetospecificDSPshardwareandtotheneedsofsignalprocessingapplications.Thefirstapproachisthedefinitionof‘intrinsic’functions,i.e.,offunctionsthatmapdirectlytooptimizedDSPinstructions.Table10showssomeexamplesofintrinsicfunctionsavailableinTIC6713DSPs.Thesecondapproachisto‘extend’theC-languagesoastoincludespecializeddatatypesandconstructs.Ofcourse,thedrawbackofthelatterapproachisareducedportabilityoftheresultingClanguage.Table10:TIC6713intrinsicfunctions–someexamples.30196 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN6.2.3High-levellanguages:C++TheC++programminglanguagesupportsobject-orientedprogrammingandisthelanguageofchoiceformanybusinesscomputerapplications.C++compilersareoftenavailableforDSPs;someadvantagesofusingitaretheabilitytoprovideahigherabstractionlayerandtheupwardscompatibilitywiththeClanguage.Thereare,however,severaldisadvantages,forinstancetheincreasedmemoryrequirementsduetothemoregeneralconstructs.Additionally,manyapplicationprogramsandlibrariesrelyonfunctionssuchasmalloc()andfree(),whichneedaheap.WhilethewaytoadapttheC-languagetoDSPsistoaddfeatures,theC++languageisadaptedbytrimmingitsfeatures.C++characteristicstypicallyremovedaremultipleinheritanceandexceptionhandling;theresultingcodeismoreefficientandtheexecutableissmaller.6.2.4GraphicallanguagesAtrendwhichhasdevelopedoverthelastfivetotenyearsistousegraphicalprogrammingtogenerateDSPcode.ExamplesofprogramsandtoolsaimedatthisaretheMATLAB,HypersignalRIDE(nowacquiredbyNationalInstruments)andtheLabVIEWDSPModule.ThesemethodologiesgenerateDSPexecutablesthatoftenarenothighlyoptimized,thereforenotsuitablefortheimplementationofdemandingDSP-basedsystems.However,theyallowonetoquicklymovefromthedesigntotheimplementationphase,thusprovidingarapidprototypingmethodology.Fig.28:MATLABgraphicalprogrammingusedintheDSPlaboratorycompanioninthesenotes.Thedigitalfilterblockcaneasilybesetupbyusingauser-friendlyset-upGUI.Asanexample,MATLABprovidestoolssuchasSimulink,Real-TimeWorkshop,Real-TimeWorkshopEmbeddedCoder,EmbeddedTargetforTIC6xxxDSPsandLinkforCodeComposerthatallowgeneratingembeddedcodeforTIDSPsanddownloadingitdirectlyintoaDSPevaluation31197 M.E.ANGOLETTAboard.ThesetoolsprovideinterfacesfortheDSPperipherals,too.TheDSPlaboratorycompaniononthesenoteswasbaseduponTIC6713DSKandMATLABtools.Figure28showstheMATLABgraphicalprogramthatconstitutedoneofthelaboratoryexercises.MATLABallowsnotonlytointerfaceimmediatelywiththeon-boardCODECbyusingtheADCandDACblocks,butalsotosetupthroughauser-friendlyGUIthedigitalfiltertobeimplemented.6.3Real-timeoperatingsystemAReal-TimeOperatingSystem(RTOS)isaprogramthathasreal-timecapabilities,isdownloadedtotheDSPatboottime,andmanagesallDSPprograms,typicallyreferredtoastasks.TheRTOSinterfacestaskswithperipheralssuchasDMA,I/Oandmemory,viaanApplicationProgramInterface(API),asshowninFig.29.Fig.29:EmbeddedDSPsoftwarecomponentsARTOSistypicallytask-basedandsupportsmultipletasks(oftenreferredtoasthreads)bytime-sharing,i.e.,bymultiplexingtheprocessortimeovertheactivetasksset.EachtaskhasapriorityassociatedtoitandtheRTOSscheduleswhichtaskshouldrundependingonthepriority.Veryoftenthisisdoneinapre-emptiveway,meaningthatwhenahigh-prioritytaskbecomesreadyforexecution,itpre-emptstheexecutionofalower-prioritytask,withouthavingtowaitforitsturnintheregularre-scheduling.Finally,RTOShaveasmallmemoryfootprint,soasnottohavetoonegativeanimpactontheDSPexecutablesize.TherearemanyadvantageswhenusingaRTOStodevelopaDSP-basedsystem.Forinstance,theAPIandlibraryshowninFig.29provideadeviceabstractionlevelbetweenDSPhardwarefeaturesandtaskimplementation,thusallowingaDSPdevelopertofocusonthetaskratherthanthehardwareinterface’sdesignandcoding.TheDSPdevelopermayhavetojustcalldifferentinterfacingfunctionsincasethecodeshouldbeportedtoadifferentDSP,henceeasingcodeportability.ARTOSmanagesthetask’sexecutionhencethedevelopercancleanlystructurethecode,defineappropriateprioritylevelsforeachtask,andinsurethattheirexecutionmeetscriticalreal-timedeadlines.Systemdebugandoptimizationcanbeimproved,andmemoryprotectioncanoftenbeprovided.Thereare,however,drawbackstotheuseofRTOS.Asanexample,aRTOSusesDSPresources,suchasprocessingtimeandDSPtimers,foritsownfunctioning.Additionally,theRTOSturnoveristypicallyquitehighandroyaltiesareoftenrequiredfromdevelopers.ManyRTOSareavailableatanytime,typicallytargetedtoapreciseDSPfamilyorprocessor.ExamplesareTALONRTOSfromBlackhawk,targetedatTIDSPs,andINTEGRITYRTOSfromGreenHillsSoftwareorNUCLEUSRTOSfromAcceleratedTechnology,targetedatADIBlackfinDSPs.ItisworthmentioningLinux-basedOS,suchasRT-Linux,RTAIanduLinux.BothRT-LinuxandRTAIuseasmallreal-timekernelthatrunsLinuxasareal-timetaskwithlowerpriority.ThelastRTOSlistedabove,uLinux,isasoft-timeOSadaptedtoADIBlackfinDSPs.uLinuxcannotalwaysguaranteeRTOScapabilitiessuchasadeterministicinterruptlatency;however,itcantypicallysatisfytheneedsofcommercialproducts,wheretimeconstraintsareoftenonthemillisecondorderasdictatedbytheabilityoftheusertorecogniseglitchesinaudioandvideosignals.32198 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNOtherRTOSworthmentioningarethoseprovidedandmaintainedbyDSPmanufacturers.BothTIandADIprovideroyalties-freeRTOSwithsimilarcharacteristics,suchasasmallmemoryfootprint,multi-tasksandmulti-prioritylevelssupport.TheyarecalledDSP/BIOS[48]forTIandVisualDSP++Kernel(VDK)forADI,andcanoptionallybeincludedintheDSPcode.Inparticular,TIDSP/BIOSprovidesthirtyprioritylevelsandfourclassesofexecutionthreads.Thethreadclasses,listedinorderofdecreasingpriority,areHardwareInterrupts(HWI),SoftwareInterrupts(SWI),Tasks(TSK)andBackground(IDL).Figure30showshowtheprocessingtimeissharedbetweendifferentthreadsinTIDSP/BIOS.Intheverticalscalethedifferentthreadsareorderedbypriority,thehigheruphavingmorepriority;inthehorizontalscalethetimeisshown.Softwareinterruptscanbepre-emptedbyahigher-prioritysoftwareinterruptorbyahardwareinterrupt.Same-levelinterruptsareexecutedinafirst-come,first-servedway.Tasksarecapableofsuspension(seeTaskTSK2inFig.30)aswellasofpre-emption.Fig.30:DSP/BIOSprioritizedthreadexecutionexample.ImagecourtesyofTexasInstruments[48].6.4Code-buildingprocessTheDSPcode-buildingprocessreliesonasetofsoftwaredevelopmenttools,typicallyprovidedbyDSPmanufacturers.Fig.31:Mainelementsofthecodebuildingprocess.TypicalfileextensionsforADIandTIDSPsareshownatthebottomofthepicture.33199 M.E.ANGOLETTAFigure31showsthemainelementsandtoolsneededforthecode-buildingprocess.Sourcefilesareconvertedtoobjectfilesbythecompilerandtheassembler.Archivertoolsallowthecreationoflibrariesfromobjectfiles;theselibrariescanthenbelinkedtoobjectfilestocreateanexecutable.TheexecutablecanbedirectlydownloadedfromtheIDEtothetargetDSPviaaJTAGinterface;asanalternative,theexecutablecanbeconvertedtoaspecialformandloadedtoamemoryexternaltotheDSP,fromwhichtheDSPitselfwillboot.ThefirstapproachistypicallyusedduringtheDSPdevelopmentphase,whilethesecondapproachismoreconvenientduringsystemexploitation.Finally,thefileextensionsusedatthedifferentcode-buildingprocessstepsforADIandTIDSPsareshownatthebottomofFig.31.Threetools,namelycompiler,assembler,andlinker,areusedtogenerateexecutablecodefromC/C++orassemblysourcecode.Figure32showstheiruseinthecode-buildingprocessonTIDSPs.Thetools’maincharacteristicsaresummarizedinSub-sections6.4.1to6.4.3.Fig.32:Genericcode-buildingprocessing:(a)compiler;(b)assembler;(c)linker.ThepictureiscourtesyofTI[49].6.4.1C/C++compilerforTIC6xxxDSPs[49]TheC/C++compilergeneratesC6xxxassemblercode(.asmextension)fromC,C++orlinearassemblysourcefiles.Thecompilercanperformvariouslevelsofoptimization:high-leveloptimizationiscarriedoutbytheoptimizer,whilelow-level,target-specificoptimizationoccursinthecodegenerator.Finally,thecompilerincludesareal-timelibrarywhichisnon-target-specific.6.4.2AssemblerforTI‘C6xxxDSPs[50]Theassemblergeneratesmachinelanguageobjectfilesfromassemblyfiles;theobjectfilesformatistheCommonObjectFileFormat(COFF).Theassemblersupportsmacrosbothasinlinefunctionsand34200 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNtakenfromalibrary;italsoallowssegmentingthecodeintosections,asectionbeingthesmallerunitofanobjectfile.TheCOFFbasicsectionsarea)textfortheexecutablecode;b)datafortheinitializeddata;c)bssfortheun-initializedvariables.6.4.3LinkerforTIC6xxxDSPs[50]ThelinkergeneratesexecutablemodulesfromCOFFfilesasinput.Itresolvesundefinedexternalreferencesandassignsthefinaladdressestosymbolsandtothevarioussections.ADSPsystemtypicallyincludesmanytypesofmemoryanditisoftenuptotheprogrammertoplacethemostcriticalprogramcodeanddataintotheon-chipmemory.Thelinkerallowsallocatingsectionsseparatelyindifferentmemoryregions,soastoguaranteeanefficientmemoryaccess.AnexampleofthisisshowninFig.33.Fig.33:ExampleofsectionsallocationintodifferenttypesoftargetmemoryThelinkeralsoallowsonetoclearlyimplementamemorymapsharedbetweenDSPandhostprocessor;thisisessentialforinstancetoexchangedatabetweenthem.7Real-timedesignflow:debuggingThedebuggingphaseisthemostcriticalandleastpredictablephaseinthereal-timedesignflow,especiallyforlargesystems.Thedebuggingcapabilitiesofthedevelopmentenvironmenttoolscanmakethedifferencebetweencreatingasuccessfulsystemandspirallingintoanendlesssearchforelusivebugs.Thestartingpointofthisphaseisanexecutablecode,i.e.,acodewithoutcompilationandlinkererrors;thegoalistoascertainthatthecodebehavesasexpected.ThedebuggingtoolsandtechniqueshaveastrongimpactontheamountoftimeandeffortneededtovalidateaDSPcode.Therearemanytypesofbugs:theycanberepeatableorintermittent,thelatterbeingmuchtoughertotrackdownthanthefirstones.Bugscanbeduetothecodeimplementation,suchaslogicalerrorsinthesourcecode,orcanderivefromexternalproblems,i.e.,hardwaremisbehaviours.TheapproachesandthetoolstodebugaDSPcodeincludesimulation,emulation,andreal-timedebuggingtechniques.SimulationtoolsallowrunningtheDSPcodeonasoftwaresimulatorfittedwithfullvisibilityintoDSPinternalregisters.Emulationtoolsembeddebugcomponentsintothetargettoallowaninformationflowbetweentargetandhostcomputer.Real-timedebuggingtechniquesallowareal-timedataexchangebetweenhostandtargetwithoutstoppingtheDSP.ThesetechniquesaredescribedindetailinSections7.1.to7.3.35201 M.E.ANGOLETTAFig.34:Debugstepsandtheirsuggestedsequencing.Thedebugtoolssuitedtodifferentstepsarealsoshown.ThedevelopershouldnotattempttodebugtheDSPcodeasawhole,unlessthecodeitselfisrelativelyshortandsimple.Heisinsteadrecommendedtodebugthecodeinseveralsteps:Fig.34showsanexampleofstepsandoftheirsequencing,togetherwiththeappropriatedebugtoolsandtechniques.First,singletaskssuchasfunctionsandroutinesshouldbevalidated;thisstepcanbecarriedoutviasimulationonly.Second,thebehaviourofsub-systemsorspecificpartsofthecodecanbetestedwithrespecttoexternalevents,suchasISRtriggering.Thispartcanbecarriedoutwiththehelpoftraditionalemulationtechniques.Third,thebehaviourofmanytaskscanbevalidatedwithrespecttoreal-timeconstraints,suchastheproperfrequencyofISRtriggering.Onceallsystemcomponentshavebeenvalidated,thewholesystemcanbetested.Theselasttwostepsprofitparticularlyfromreal-timedebuggingtechniques.7.1SimulationDSPsoftwaresimulatorshavebeenavailableformorethanfifteenyears.TheycansimulateCPUinstructionsetsaswellasperipheralsandinterrupts,thusallowingDSPcodevalidationatareducedcostandevenbeforethehardwarethecodeshouldrunonisavailable.Simulatorsprovideahighvisibilityintothesimulatedtarget,inthattheusercanexecutethecodestepbystepandlookattheintermediatevaluestakenbyinternalDSPregisters.Largeamountofdatacanbecollectedandanalysed;resourceusagecanbeevaluatedandusedforanoptimizedhardwaredesign.Simulatorsarehighlyrepeatable,sincethesamealgorithmcanberuninexactlythesamewayoverandover.Thereadershouldnotethatthiskindofrepeatabilityisdifficulttoobtainwithothertechniques,suchasemulation,asexternalevents(forinstanceinterrupts)arealmostimpossibletobepreciselyrepeatedwithhardware.Simulatorsmayalsoallowmeasurementofthecodeexecutiontime,withlimitationsduetothetypeofsimulatorchosen.AusefulfeatureavailablewiththeTIC5xandC6xsimulatorsisthe‘rewind’[51],whichallowsviewingthepasthistoryoftheapplicationbeingexecutedonthesimulator.ThemainlimitationcommontoDSPsimulatorsistheirexecutionspeed,severalordersofmagnitudeslowerthanthetargettheysimulate;inparticular,themoreaccuratethemodellingoftheDSPchipandcorrespondingperipherals,theslowerthesimulation.DSPtoolvendorshaveovercomethisproblembyprovidingdifferentsimulatorsforthesameDSP,providingadifferentlevelofchipandperipheralsmodelling.Figure35showssomesimulatorsavailableforTIDSPs.ThereadershouldnoticethatTIprovidesuptothreesimulatorsforeachDSP,namely:36202 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNa)CPUCycleAccurateSimulator:Thissimulatormodelstheinstructionset,timers,andexternalinterrupts,allowingthedebuggingandoptimizationoftheprogramforcodesizeandCPUcycles.b)DeviceFunctionalSimulator:Thissimulatornotonlymodelsinstructionset,timers,andexternalinterrupts,butalsoallowsfeaturessuchasDMA,InterruptSelector,cachesandMcBSPtobeprogrammedandused.However,thetruecyclesofaDMAdatatransferarenotsimulated.c)DeviceCycleAccurateSimulator:Thissimulatormodelsallperipheralsandcachesinacycle-accuratemanner,thusallowingtheusertomeasurethetotaldeviceandstallcyclesusedbytheapplication.MoreinformationonTIsimulatorscanbefoundinRef.[52].Fig.35:ExampleofDSPsimulatorsavailablewithTI’sCodeComposerStudiodevelopmentenvironment7.2EmulationTheintegrationofprocessor,memory,andperipheralsinasinglesiliconchipiscommonlyreferredtoasSystem-On-a-Chip(SOC).Thisapproachallowsreducingthephysicaldistancebetweencomponents,hencedevicesbecomesmallerinsize,runfaster,costlesstomanufacture,andaretypicallymorereliable.FromaDSPcodedeveloper’sviewpoint,themaindisadvantageofthisapproachisthelackofaccesstoembeddedsignals,oftenreferredtoasvanishingvisibility.Infact,37203 M.E.ANGOLETTAmanychippackages(e.g.,ballgridarray)donotallowprobingthechippins;additionally,internalchipbussesareoftennotevenavailableatthechippins.Emulationtechniques[53]restorethevisibilityneededforcodedebuggingbyembeddingdebugcomponentsintothechipitself.Therearethreemainkindsofemulation,namely:a)Monitor-basedemulation:Asupervisorprogram(calledmonitor)runsontheDSPandusesoneoftheprocessor’sinput–outputinterfacestocommunicatewiththedebuggerprogramrunningonthehost.Thedebuggingcapabilitiesofthisapproacharemorelimitedthanthoseprovidedbythetwootherapproaches;additionally,themonitorpresencechangesthestateoftheprocessor,forinstanceregardingtheinstructionpipeline.Theadvantageisthatitdoesnotrequireemulationhardware,henceitscostislower.b)Pod-basedInCircuitEmulation(ICE):Thetargetprocessorisreplacedbyadevicethatactsliketheoriginaldevice,butisprovidedwithadditionalpinstomakeaccessibleandvisibleinternalstructuressuchasinternalbusses.Thisemulationapproachhastheadvantageofprovidingreal-timetracesoftheprogramexecution.However,replacingthetargetprocessorwithadifferentandmorecomplexdevicemaycreateelectricalloadingproblems.Additionally,thissolutionisquitecostly,thehardwareisdifferentfromthecommercializedproductandbecomesquitedifficulttoimplementathighprocessorspeed.c)Scan-basedemulation:Dedicatedinterfacesanddebugginglogicareincorporatedintocommercially-availableDSPchips.Thison-chiplogicisresponsibleformonitoringthechip’sreal-timeoperations,forstoppingtheprocessorwhenforinstanceabreakpointisreached,andforpassingdebugginginformationtothehostcomputer.Anemulationcontrollercontrolstheflowofinformationto/fromthetargetandcanbelocatedeitherontheDSPboardoronanexternalpod.Manytypesoftarget–hostinterfaceexist.OntheDSPboardonecantypicallyfindaJTAG(IEEEstandard1149.1)connector.Onthehostcomputer,parallelorUSBportsareoftenavailable.Thescan-basedemulationtechniquehasbeenwidelypreferredovertheothertwosincethelate1980sandisnowadaysavailableonthevastmajorityofDSPs.Figure36showstheTIXDS560emulator,composedofaPCcard,acablewithJTAGinterfacetothetarget,andanemulationcontrollerpod.Manyemulatorsareavailableonthemarket,withdifferentinterfacesandcharacteristics.Asanexample,itisworthmentioningSpectrumDigital’sXDS510USBgalvanicJTAGemulator,whichprovidesvoltageisolation.Fig.36:TIXDS560emulator,composedofacardtoinstallonthehostcomputer(PCIinterface),aJTAGcableandanemulationcontrollerpod38204 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNCapabilitiesofscan-basedemulatorsincludesource-leveldebugging,i.e.,thepossibilitytoseetheassemblyinstructionsbeingexecutedandtoaccessvariablesandmemorylocationseitherbynameorbyaddress.Capabilitiessuchaswritingtothestandardoutputareavailable.Asanexample,theprintf()functionallowsprintingDSPinformationonthedebuggerGUI;thereadershould,however,beawarethatthisoperationcanbeextremelytime-consuming,andoptimizedfunctions(suchasLOG_printf()forTIDSPs)shouldbepreferred.Anothercommoncapabilitysupportedbyemulationtechnologyisthebreakpoint.AbreakpointfreezestheDSPandallowsthedevelopertoexamineDSPregisters,toplotthecontentofmemoryregions,andtodumpdatatofiles.Twomainformsofbreakpointexist,namelysoftwareandhardware.AsoftwarebreakpointreplacestheinstructionatthebreakpointlocationwithonecreatinganexceptionconditionthattransferstheDSPcontroltotheemulationcontroller.Anhardwarebreakpointisimplementedbyusingcustomhardwareonthetargetdevice.ThehardwarelogiccanforinstancemonitorasetofaddressesontheDSPandstoptheDSPcodeexecutionwhenacodefetchisperformedataspecificlocation.Breakpointscanbetriggeredalsobyacombinationofaddresses,data,andsystemstatus.ThisallowsDSPdeveloperstoanalysethesystemwhenforinstanceithangs,i.e.,whentheDSPprogramcounterbranchesintoaninvalidmemoryaddress.Intermittentbugscanalsobetrackeddown.Itisimportanttounderlinethatthedebuggingcapabilitiesprovidedbyemulatorsallowmostly‘stop-modedebugging’,inthattheDSPishaltedandinformationissenttothehostcomputeratthatmoment.Thisdebuggingtechniqueisinvasiveandallowsthedevelopertogetisolated,althoughveryuseful,snapshotsofthehaltedapplication.Toimprovethesituation,DSPtoolvendorshavedevelopedamoreadvanceddebuggingtechnologythatallowsreal-timedataexchangebetweentargetandhost.Thistechniqueisdescribednext.7.3Real-timetechniquesOverthelasttenyears,DSPvendorshavedevelopedtechniquesforareal-timedataexchangebetweentargetandhostwithoutstoppingtheDSPandwithminimalinterferenceontheDSPrun.Thisprovidesacontinuousvisibilityintothewaythetargetoperates.Additionally,itallowsthesimulationofdatainputtothetarget.ADI’sreal-timecommunicationtechnologyiscalledBackgroundTelemetryChannel(BTC)[54].ThisisbaseduponasharedgroupofregistersaccessiblebytheDSPandbythehostforreadingandwriting.ItiscurrentlysupportedonBlackfinandADSP-219sDSPsonly.TI’sreal-timecommunicationtechnologyiscalledRealTimeDataeXchange(RTDX)[55,56].ItsmainsoftwareandhardwarecomponentsareshowninFig.37.Acollectionofchannels,throughwhichdataisexchanged,arecreatedbetweentargetandhost.Thesechannelsareunidirectionalanddatacanbesentacrossthemasynchronously.TIprovidestwolibraries,theRTDXtargetlibraryandtheRTDXhostlibrary,thathavetobelinkedtotargetandhostapplications,respectively.Asanexample,thetargetapplicationsendsdatatothehostbycallingfunctionsintheRTDXtargetlibrary.Thesefunctionsbufferthedatatobesentandthengivetheprogramflowcontrolbacktothecallingprogram;afterthis,theRTDXtargetlibrarytransmitsthebuffereddatatothehostwithoutinterferinginthetargetapplication.RTDXisalsosupportedwhenrunninginsideaDSPsimulator;tothatend,theDSPdevelopershouldlinkthetargetapplicationwiththeRTDXsimulatortargetlibrarycorrespondingtothechosentarget.Onthehostside,datacanbevisualizedandtreatedfromapplicationsinterfacingwiththeRTDXhostlibrary.OnWindowsplatformsaMicrosoftComponentObjectModule(COM)interfaceisavailable,allowingclientssuchasVisualBasic,VisualC++,Excel,LabView,MATLABandothers.39205 M.E.ANGOLETTAFig.37:TI’sRTDXmaincomponents.ThepictureiscourtesyofTI[56].In1998TIimplementedtheoriginalRTDXtechnology,whichrunsonXDS510-classemulators.Ahigh-speedRTDXversionwasdevelopedlaterthatreliesonadditionalDSPchiphardwarefeaturesandonimprovedemulators,namelytheXDS560class.Theseemulatorsmakeuseoftwonon-JTAGpinsinthestandardTIJTAGconnectortoincreaseRTDXbandwidth.TheyarealsobackwardscompatibleandcansupportstandardRTDX,thusallowinghigherdatatransferspeed.Thehigh-speedRTDXissupportedinTI’shighestperformanceDSPs,suchastheTMS320C55x,TMS320C621x,TMS320C671xandTMS320C64xfamilies.Table11showsthedatatransferspeedsavailablewithdifferentcombinationsofRTDXandemulators.RTDXoffersabandwidthof10to20kbytes/s,thusenablingreal-timedebuggingofapplicationssuchasCDaudioandaudiotelephony.Thehigh-speedRTDXwithXDS560-classemulatorsprovidesadatatransferspeedhigherthan2Mbytes/s,thusallowingreal-timevisibilityintoapplicationssuchasADSL,hard-diskdrivesandvideoconferencing[57].Table11:DatatransferspeedasafunctionoftheemulatortypeforTI’sRTDXEmulationtypeSpeedRTDX+XDS51010–20kbytes/sRTDX+USB(ex:‘C6713DSKboard)10–20kbytes/sRTDX+XDS560≤130kbytes/sHighspeedRTDX+XDS560>2Mbytes/s8CodeanalysisandoptimizationMostDSPapplicationsaresubjecttoreal-timeconstraintsandstresstheavailableCPUandmemoryresources.Asaconsequence,codeoptimizationmightberequiredtosatisfytheapplicationrequirements.DSPcodecanbeoptimizedaccordingtooneormoreparameterssuchasexecutionspeed,memoryusage,input/outputbandwidth,orpowerconsumption.Differentpartsofthecodecanbeoptimizedaccordingtodifferentparameters.Atrade-offbetweencodesizeandhigherperformanceexists,hencesomefunctionscanbeoptimizedforexecutionspeedandothersforcodesize.Codedevelopmentenvironmentstypicallyallowdefiningseveralcodeconfigurationreleases,eachcharacterizedbydifferentoptimizationlevels.Figure38showstheprojectconfigurationsavailableinTICodeComposerStudio.The‘Release’configurationcomprisesthehigheroptimization40206 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNlevel,whilethe‘Debug’configurationenablesdebugfeatures,whichtypicallyincreasecodesize.Finally,theusercanspecifya‘Custom’configurationwhereuser-selectabledebugandoptimizationfeaturesareenabled.Fig.38:ChoiceoftheDSPcodeprojectconfigurationsinTICodeComposerItisimportanttounderlinethatdebugandoptimizationphasesaredifferentandoftenconflicting.Infact,anoptimizedcodedoesnotincludeanydebuginformation;additionally,theoptimizercanre-arrangethecodesothattheassemblycodeisnotanimmediatetranspositionoftheoriginalsourcecode.Thereaderisstronglyencouragedtoavoidusingthedebugandtheoptimizeoptionstogether;itisrecommendedinsteadtofirstdebugthecodeandonlythentoenabletheoptimization.8.1SwitchingthecodeoptimizerONCompilersarenowadaysveryefficientatcodeoptimization,allowingDSPdeveloperstowritehigherlevelcodeinsteadofassembly.Todothis,compilesmustbehighlycustomized,i.e.,tightlytargetedtothehardwarearchitecturethecodewillberunningupon.However,currenttrendsinsoftwareengineeringincluderetargetingcompilerstoDSPspecializedarchitectures[58].Aspreviouslymentioned,manykindsofoptimizationcanberequired.Anexampleisexecutionspeedvs.executablesize.Figure39showshowtheusercanselectoneortheotherintheCodeComposerStudiodevelopmentenvironment.41207 M.E.ANGOLETTAFig.39:ChoiceofoptimizationlevelsinTICodeComposer.Theplothighlightsexecutionspeedvs.executablecodesize.Thereadershouldbeawarethattheoptimizercanrearrangethecode,hencethecodemustbewritteninaproperway.Failingthis,theactionsgeneratedbytheoptimizedcodemightbedifferentfromthosedesiredandimplementedbyanon-optimizedcode.Figure40showstwocodesnippetswherethevalueassumedbythememorylocationpointedtobyctrldeterminesthewhile()loopbehaviour.Inparticular,theDSPexitsthewhile()loopifthectrlcontenttakesthevalue0xFF;thectrlcontentcanbemodifiedbyanotherprocessororexecutionthread.Bothcodesnippetswillperformequallyincaseofnon-optimization.However,incaseofoptimizationtheleft-handsidecodewillnotevaluatethectrlcontentateverywhile()iteration,hencetheDSPwillremainforeverintheloop.Ontheright-handsidesnippet,thevolatilekeyworddisablesmemoryoptimizationlocally,thusforcingtheDSPtore-evaluatethectrlcontentvalueateverywhile()loopiteration.Thisguaranteesthedesiredbehaviourevenwhenthecodeisoptimized.Thenumberofvolatilevariablesshouldberestrictedtosituationswheretheyarestrictlyneeded,astheylimitthecompiler’soptimization.Fig.40:Exampleofgoodandbadprogrammingtechniques.Theleft-handsidecodewouldlikelyresultinaprogrammingmisbehaviour.Therecommendedcodedevelopmentflowistofirstwritehigh-levelcode,suchasCorC++.Thiscodecanthenbedebuggedandoptimized,tocomplywiththespecifiedperformance.Incasethecoderunsstillslowerthandesired,thetime-criticalareascanbere-codedinlinearassembly.Ifthe42208 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNcodeisstilltooslow,thentheDSPdevelopershouldturntohand-optimizedassemblycode.Figure41showsacomparisonofthedifferentprogrammingtechniques,withcorrespondingexecutionefficiencyanddevelopmenteffort.Fig.41:Comparisonofprogrammingtechniqueswithcorrespondingexecutionefficiencyandestimateddevelopmenteffort.ThepictureiscourtesyofTI[59].8.2AnalysistoolsDSPcodeoftenfollowsthe20/80rule,whichstatesthat20%ofthesoftwareinanapplicationuses80%oftheprocessingtime.Asaconsequence,theDSPdevelopershouldfirstconcentrateeffortsondeterminingwheretooptimize,i.e.,onunderstandingwheretheexecutioncyclesaremostlyspent.Thebestwaytodeterminethepartsofthecodetooptimizeistoprofiletheapplication.OverthelasttenyearsDSPdevelopmentenvironmentshaveconsiderablyenlargedtheirofferofanalysistools.SomeexamplesofTI’sCCSanalysisandtuningtoolsare:a)Compilerconsultant.ItanalysestheDSPcodeandprovidesrecommendationsonhowtooptimizeitsperformance.Thisincludescompileroptimizationswitchesandprograms,thusallowingaquickimprovementinperformance.Figure42showshowtoenablethecompilerconsultantinCCS.b)Cachetune.Itprovidesagraphicalvisualizationofmemoryreferencepatternsandmemoryaccesses,thusallowingtheidentificationofproblemareasrelatedforinstancetomemoryaccessconflicts.c)Codesizetune.Itprofilestheapplication,collectsdataonindividualfunctionsanddeterminesthebestcombinationsofcompileroptionstooptimizethetrade-offbetweencodesizeandexecutionspeed.d)AnalysisToolKit(ATK).ItrunswithDSPsimulatorsonlyandallowsonetoanalysetheDSPcoderobustnessandefficiency.ThereadercanfindmoreinformationontheATKsetupanduseinRefs.[60,61].TheDSPdevelopershouldnotonlyknowwhentooptimize,asdescribedpreviously:he/sheshouldalsoknowwhentostop.Infact,thereisalawofdiminishingreturnsinthecodeanalysisandoptimizationprocess.Itisthusimportanttotakeadvantageoftheimprovementsthatcomewithrelativelylittleeffort,andleaveasalastresortthosethataredifficulttoimplementandprovidelow-yield.Finally,itisstronglyrecommendedtomakeonlyoneoptimizationchangeatthesametime;thiswillallowthedevelopertoexactlymaptheoptimizationtoitsresult.43209 M.E.ANGOLETTAFig.42:Howthe‘CompilerConsultantAdvice’canbeenabledinTI’sCCSDevelopmentEnvironment8.3ProgrammingoptimizationguidelinesThisSectionincludessomegeneralprogrammingguidelinesforwritingefficientcode;theseguidelinesareapplicabletothevastmajorityofDSPcompilers.DSPdevelopersshould,however,refertothemanualsofthedevelopmenttoolstheyareusingformorepreciseinformationonhowtowriteefficientcode.ThereferencemanualforTITMS320C6xxxDSPcanbefoundinRef.[62].Finally,itisstronglyrecommendedtomakeonlyoneoptimizationchangeatthesametime;thiswillallowthedevelopertoexactlymaptheoptimizationtoitsresult.–Guideline1:UsetheDMAwhenpossibleandallocatedatainmemorywiselyDMAcontrollers(seeSub-section3.2.3)mustbeusedwheneverpossiblesoastofreetheDSPcoreforothertasks.Thelinker(seeSub-section6.4.3)shouldbeusedforallocatingdatainmemorysoastoguaranteeanefficientmemoryaccess.Additionally,DSPdevelopersshouldavoidplacingarraysatthebeginningorattheveryendofmemoryblocks,asthiscreatesproblemsforsoftwarepipelining.SoftwarepipeliningisatechniquethatoptimizestightloopsbyfetchingadatasetwhiletheDSPisprocessingthepreviousone.However,thelastiterationofaloopwouldattempttofetchdataoutsidethememoryspace,incaseanarrayisplacedonthememoryedge.Compilersmustthenexecutethelastiterationinaslowerway(‘loopepilogue’)topreventthisaddresserrorfromhappening.Somecompilers,suchastheADIBlackfinone,makeavailablecompileroptionstospecifythatitissafetoloadadditionalelementsattheendofthearray.–Guideline2:ChoosevariabledatatypescarefullyDSPdevelopersshouldknowtheinternalarchitectureoftheDSPtheyareworkingon,soastobeabletousenativedatatypeDSPsasopposedtoemulatedones,wheneverpossible.Infact,operationsonnativedatatypesareimplementedbyhardware,hencearetypicallyveryfast.Onthecontrary,operationsonemulateddatatypesarecarriedoutbysoftwarefunctions,henceareslowerandusemoreresources.AnexampleofemulateddatatypeisthedoublefloatingpointformatonADI’sTigerSHARCfloatingpointDSPs.AnotherexampleisthefloatingpointformatonADI’s44210 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNBlackfinfamilyoffixed-pointprocessors[63].IntheseDSPsthefloatingpointformatisimplementedbysoftwarefunctionsthatusefixed-pointmultiplyandALUlogic.Inthislastcaseafasterversionofthesamefunctionsisavailablewithnon-IEEE-compliantdataformats,i.e.,formatsimplementinga‘relaxed’IEEEversionsoastoreducethecomputationalcomplexity.Table12showsa,executiontimescomparisonofIEEE-compliantandnon-IEEE-compliantfunctionsinADI’sBlackfinBF533.Table12:ExecutiontimeofIEEE-compliantvs.non-IEEE-compliantlibraryfunctionsforADI’sBlackfinBF533–Guideline3:FunctionsandfunctioncallsFunctionssuchasmax(),min()andabs()areoftensingle-cycleinstructionandshouldbeusedwheneverpossibleinsteadofmanuallycodingthem.Figure43showsontheright-handsidethemax()functionandontheleft-handsideamanualimplementationofthesamefunction.Theadvantageintermsofcodeefficiencyofusingasingle-cyclemax()functionisevident.OftenmorecomplexfunctionssuchasFFT,IIR,orFIRfiltersareavailableinvendor-providedlibraries.Thereaderisstronglyencouragedtousethem,astheiroptimizationiscarriedoutatalgorithmlevel.Fig.43:ExampleofgoodandbadprogrammingtechniquesAsfewparametersaspossibleshouldbepassedtoafunction.Infact,parametersaretypicallypassedtofunctionsbyusingregisters.However,thestackisusedwhennomoreregistersareavailable,thusslowingdownthecodeexecutionconsiderably.–Guideline4:AvoiddataaliasingAliasingoccurswhenmultiplevariablespointtothesamedata.Forexample,twobuffersoverlap,twopointerspointtothesamesoftwareobjectorglobalvariablesusedinaloop.Thissituationcandisruptoptimization,asthecompilerwillanalysethecodetodeterminewhenaliasingcouldoccur.Ifitcannotworkoutiftwoormorepointerspointtoindependentaddressesornot,thecompilerwilltypicallybehaveconservatively,henceavoidoptimizationsoastopreservetheprogramcorrectness.–Guideline5:WriteloopscodecarefullyLoopsarefoundveryofteninDSPalgorithms,hencetheircodingcanstronglyinfluencetheprogramexecutionperformance.Functioncallsandcontrolstatementsshouldbeavoidedinsidealoop,soastopreventpipelineflushes(seeSub-section3.3.2).Figure44showsanexampleofgood45211 M.E.ANGOLETTAandbadprogrammingtechniquesreferredtocontrolstatementsinsideafor()loop:bymovingtheconditionalexpressionif…elseoutsidetheloop,asshownintheright-handsidecodesnippet,onecanreducethenumberoftimestheconditionalexpressionisexecuted.Fig.44:ExampleofgoodandbadprogrammingtechniquesLoopcodeshouldbekeptsmall,soastofitentirelyintotheDSPcachememoryandtoallowalocalrepeatoptimization.Incaseofmanynestedloops,thereadershouldbeawarethatcompilerstypicallyfocustheiroptimizationeffortsontheinnerloop.Asaconsequence,pullingoperationsfromtheoutertotheinnerloopcanimproveperformance.Finally,itisrecommendedtouseintorunsignedintdatatypesforloopcountersinsteadofthelarger-sizeddatatypelong.–Guideline6:Beawareoftime-consumingoperationsThereareoperations,suchasthedivision,thatdonothavehardwaresupportforasingle-cycleimplementation.Theyareinsteadimplementedbyfunctionsimplementingiterativeapproximationsalgorithms,suchastheNewton–Raphson.TheDSPdevelopershouldbeawareofthatandtrytoavoidthemwhenpossible.Forexample,thedivisionbyapower-of-twooperationcanbeconvertedtotheeasierrightshiftonunsignedvariables.DSPmanufacturersoftenprovideindicationsontechniquestoimplementthedivisioninstructionmoreefficiently[64].Otheroperationsareavailablefromlibraryfunctions.Examplesaresine,cosineandatanfunctions,veryoftenneededintheacceleratorsectorfortheimplementationofrotationmatrixesandforrectangulartopolarcoordinatesconversion.Ifneeded,customimplementationscanbedevelopedtoobtainafavourableratiobetweenprecisionandexecutiontime.Table13showsthecomparisonofdifferentimplementationsofthesamefunctions;inparticular,thesecondcolumnshowsacustomimplementationusedinCERN’sLEIRaccelerator.Inthisimplementation,thesine,cosineandatancalculationalgorithmhasbeenimplementedbyapolynomialexpansionoftheseventhorderinsteadoftheusualTaylorseriesexpansion.Table13:Executiontimesvs.differentimplementationsofthesamefunctionsExecutiontime[µs]CERNsingle-precisionVisualDSP++single-VisualDSP++double-Functionimplementationprecisionimplementationprecisionimplementationcosine0.250.595.5sine(forasine/cosinecouple)0.595.3atan0.41251.45.646212 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN–Guideline7:BeawarethatDSPsoftwareheavilyinfluencespoweroptimizationDSPsoftwarecanhaveasignificantimpactonpowerconsumption:asoftware-efficientintermsoftherequiredprocessorcyclestocarryoutataskisoftenalsoenergyefficient.Softwareshouldbewrittensoastominimizethenumberofaccessestooff-chipmemory;infact,thepowerrequiredtoaccessoff-chipmemoryisusuallymuchhigherthanthatusedforaccessingon-chipmemory.PowerconsumptioncanbefurtheroptimizedinDSPsthatsupportselectivedisablingofunusedfunctionalblocks(e.g.,on-chipmemories,peripherals,clocks,etc.).These‘powerdownmodes’areavailableinADIDSPs(suchasBlackfin)aswellasinTIDSPs(suchastheTMS320C6xxxfamily[35]).Makingagooduseofthesemodesandfeaturescanbedifficult;however,APIsandspecificsoftwaremodulesareavailabletohelp.AnexampleisTI’sDSP/BIOSPowerManager(PWRM)module[65],providingakernel-levelAPIthatinterfacesdirectlytotheDSPhardwarebywritingandreadingconfigurationregisters.Figure45showshowthismoduleisintegratedinagenericapplicationarchitectureforDSPsbelongingtoTI’sTMS320C55xfamily.Fig.45:TI’sDSP/BIOSPowerManager(PWRM)moduleinageneralsystemarchitecture.PicturecourtesyofTexasInstruments[65].9Real-timedesignflow:systemdesignThissectiondealswithsomeaspectsofdigitalsystemsdesign,particularlywithsoftwareandhardwarearchitectures.HeretheassumptionisthatthesystemtobedesignedisbasedupononeormoreDSPs.Thereadershould,however,beawarethatintheacceleratorsectortherearecurrentlythreemainreal-timedigitalsignalprocessingactors:DSPs,FPGAsandfront-endcomputers.Thefront-endcomputersaretypicallyimplementedbyembeddedGeneralPurposeProcessors(GPPs)runningaRTOS.Nowadays,theincreaseinclockspeedallowsGPPstocarryoutreal-timedataprocessingandslowcontrolactions;inaddition,thereisatendencytointegrateDSPhardwarefeaturesandspecializedinstructionsintoGPPs,yieldingGPPhybrids.OneexampleofsuchprocessorsisgiveninFig.46,showingthePowerPCwithMotorola’sAltivecextension.TheAltivec128-bitSIMDunitaddsupto16operationsperclockcycle,inparalleltotheIntegerandFloatingPointunits,and162instructionstotheexistingRISCarchitecture.Fundamentalchoicestomakewhendesigninganewdigitalsystemarewhichdigitalsignalprocessingactorsshouldbeusedandhowtasksshouldbesharedbetweenthem.Thischoicerequiresdetailedandup-to-dateknowledgeofthedifferentpossibilities.47213 M.E.ANGOLETTAFig.46:Altivectechnology:SIMDexpansiontoMotorolaPowerPC(G4family)InindustrythechoiceoftheDSPtouseisoftenbasedonthe‘4P’law:Performance,Powerconsumption,PriceandPeripherals.Intheacceleratorsector,thepowerconsumptionfactoristypicallynegligible.Otherfactorsareinsteaddecisive,suchasstandardizationinthelaboratory,synergieswithexistingsystems,andpossibilitiesofevolutiontocoverdifferentmachines.Lastbutnotleast,oneshouldconsidertheexistingknow-howintermsoftoolsandofhardware,whichcanbedirectlytranslatedtoashorterdevelopmenttime.Inthissectionthreedesignaspectsareconsideredandbrieflydiscussed,namely:a)DSPchoiceinSections9.1and9.2.b)SystemarchitectureinSections9.3to9.6.c)DSPcodedesigninSections9.7and9.8.9.1DSPchoice:fixedvs.floating-pointDSPsThereadercanfindabasicdescriptionoffixed-andfloating-pointnumberformatsinSection3.4.Fixed-pointformatscantypicallybeimplementedinhardwareinacheaperway,withbetterenergyefficiencyandlesssiliconthanfloating-pointformats.Veryoftenfixed-pointDSPssupportaclockfasterthanfloating-pointDSPs;asanexample,TIfixed-pointDSPscancurrentlybeclockedupto1.2GHz,whileTIfloating-pointDSPsareclockedupto300MHz.Floating-pointformatsareeasiertousesincetheDSPprogrammercanmostlyavoidcarryingoutnumberscalingpriortoeacharithmeticoperation.Inaddition,floating-pointnumbersprovideahigherdynamicrange,whichcanbeessentialwhendealingwithlargedatasetsandwithdatasetswhoserangecannotbeeasilypredicted.Thereadershouldbeawarethatfloating-pointnumbersarenotequispaced,i.e.,thegapbetweenadjacentnumbersdependsontheirmagnitude:largenumbershavelargegapsbetweenthem,andsmallnumbershavesmallgaps.Asanexample,thegapbetween8adjacentnumbersishigherthan10fornumbersoftheorderof2·10.Additionally,theerrorduetotruncationandroundingduringthefloating-pointnumberscalinginsidetheDSPdependsonthenumbermagnitude,too.Thisintroducesanoisefloormodulationthatcanbedetrimentalforhigh-qualityaudiosignalprocessing.Forthisreason,high-qualityaudiohasbeentraditionallyimplementedbyusingfixed-pointnumbers.However,amigrationofhigh-fidelityaudiofromfixed-tofloating-pointimplementationiscurrentlytakingplace,soastobenefitfromthegreateraccuracyprovidedbyfloatingpointnumbers.Thechoicebetweenfixed-andfloating-pointDSPisnotalwayseasyanddependsonfactorssuchaspowerconsumption,price,andapplicationtype.Asanexample,militaryradarsneedfloating-pointimplementationsastheyrelyinfindingthemaximalabsolutevalueofthecross-correlationbetweenthesentsignalandthereceivedecho.Thisisexpressedastheintegralofafunctionagainstanexponential;theintegralcanbecalculatedbyusingFFTtechniquesthatbenefitfromthefloatingpointdynamicrangeandresolution.Forradarsystems,thepowerconsumptionisnotamajorissue.The48214 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNfloating-pointDSPadditionalcostisnotanissueeither,astheprocessorrepresentsonlyafractionoftheglobalsystemcost.AnotherexampleisthemobileTV.Thecoreofthisapplicationisthedecoder,whichcanbeMPEG-2,MPEG-4orJPEG-2000.Thedecodingalgorithmsaredesignedtobeperformedinfixed-point;thegreaterprecisionoffloating-pointnumbersisnotusefulasthealgorithmsareingeneralbit-exact.Itshouldbeunderlinedthatmanydigitalsignalprocessingalgorithmsareoftenspecifiedanddesignedwithfloating-pointnumbers,butaresubsequentlyimplementedinfixed-pointarchitecturessoastosatisfycostandpowerefficiencyrequirements.Thisrequiresanalgorithmconversionfromfloating-pointtofixed-pointanddifferentmethodologiesareavailable[66].Finally,asmentionedinSection8.3,somefixed-pointDSPsmakeavailablefloating-pointnumbersandoperationsbyemulatingtheminsoftware(hencetheyareslowerthaninanativefloating-pointDSP).AnexampleisADI’sBlackfin[63].Thefactthatfloating-pointnumbersarenotequispacedhasalreadybeenmentioned.ThereadermightbeinterestedinlookingatsomeconsequencesofthiswithanexamplefromtheLHCbeamcontrolimplementation.Figure47showsazoomontothebeamloopspartoftheLHCbeamcontrol.The‘Low-levelLoopsProcessor’isaboardincludingaTigerSHARCDSPandanFPGA.TheFPGAcarriesoutsomesimplepre-processinganddatainterfacing,whiletheDSPimplementsthelow-levelloops.Inparticular,theDSPcalculatesthefrequencytobesenttothecavitiesfromthebeamphase,radialposition,synchrotronfrequency,andprogrammedfrequency;thesecalculationsarecarriedoutinfloating-pointformat.Thefrequencytobesenttothecavities,referredtoasF_outinFig.47,mustbeexpressedasanunsigned,16-bitinteger.Thedesiredfrequencyrangetorepresentis10kHz,hencetheneededresolutionis0.15Hz.TheLHCcavitiesworkatafrequencyofabout400.78MHzbutthe6spacingofasingle-precision,floating-pointnumberwithmagnitudeofapproximately400·10ishigherthanone.Toavoidtheuseofslower,double-precision,floating-pointformat,thebeamloopcalculationsarecarriedoutasoffsetfrom400.7819MHz.Fig.47:LHCbeamcontrol–zoomontothebeamloopspart49215 M.E.ANGOLETTA9.2DSPchoice:benchmarkingBenchmarkingaDSPmeansevaluatingitonanumberofdifferentmetrics.Table14givesanexampleofsomecommonmetricsandcorrespondingunits.Table14:ExamplesofDSPperformancemetricsetsandcorrespondingunitsGoodbenchmarksareimportantforcomparingDSPsandallowcriticalbusinessortechnicaldecisionstobemade.Itshouldbeunderlinedthatbenchmarkscanbemisleading,thusshouldbeconsideredinacriticalway.Asanexample,themaximumclockfrequencyofaDSPcanbedifferentfromtheinstructionrates;hencethisparametermightnotbeindicativeoftherealDSPprocessingpower.AnotherexampleistheexecutionspeedmeasuredinMIPS:thismetriciseasytomeasurebutitisoftentoosimpletoprovideusefulinformationabouthowaprocessorwouldperforminarealapplication.VLIWarchitecturesissueandexecutemultipleinstructionsperinstructioncycle.TheseprocessorsusuallyusesimplerinstructionsthatperformlessworkthantheinstructionstypicalofconventionalDSPs.Asaconsequence,MIPScomparisonbetweenVLIW-basedDSPandconventionalonesismisleading.Morecomplexbenchmarksareavailable;examplesaretheexecutionofapplicationtasks(typicallycalledkernelfunctions)suchasIIRfilters,FIRfilters,orFFTs.KernelfunctionbenchmarkingistypicallymorereliableandisavailablefromDSPmanufacturesaswellasfromindependentcompanies.ItisdifficulttoprovidegeneralguidelinestomeasuretheefficacyofDSPbenchmarksforDSPselection.Twogeneralrulesshouldbefollowed:first,thebenchmarkshouldperformthetypeofworktheDSPwillbeexpectedtocarryoutinthetargetedapplication.Second,thebenchmarkshouldimplementtheworkinawaysimilartowhatwillbeusedinthetargetedapplication.9.3Systemarchitecture:multiprocessorarchitecturesMultiprocessorarchitecturesarethosewheretwoormoreprocessorsinteractinreal-timetocarryoutatask.Rightfromtheirearlydays,manyDSPfamilieshavebeendesignedtobecompatiblewithmultiprocessingoperation;anexampleistheTITMS320C40family.Multiprocessingarchitecturesareparticularlysuitedforapplicationswithahighdegreeofparallelism,suchasvoiceprocessing.Infact,processingtenvoicechannelscanbecarriedoutbyimplementingaone-voicechannel,thenrepeatingtheprocesstentimesinparallel.Applicationsrequiringmultiprocessingcomputingtosupportprocessingofgreaterdataflowincludehigh-endaudiotreatment,3Dgraphicsacceleration,andwirelesscommunicationinfrastructure,justtomentionafewofthem.50216 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNThereisanotherreasontomovetomultiprocessingsystems.FormanyyearsdevelopershavebeentakingadvantageofthesteadyprogressinDSPperformance.Newandfasterprocessorswouldbeavailable,allowingmorepowerfulapplicationstobeimplementedsometimesonlyforthepriceofportingexistingcodetothenewDSP.Thisfavourablesituationwasdrivenbythesteadyprogressofthesemiconductorindustrythatmanagedtopackmoretransistorsintosmallerpackagesandathigherclockfrequencies.Theincreasedperformancewasenabledbyarchitecturalinnovations,suchasVLIW,aswellasaddedresources,suchason-chipmemories.Inrecentyears,however,progressinsingle-chipperformancehasbeenslowingdown.Thesemiconductorindustryhasturnedtoparallelismtoincreaseperformance.ThisistruenotonlyfortheDSPsector,butingeneralforbusinesscomputing.OneexampleistheIntelCoreDuoprocessors,includingtwoexecutioncoresinasingleprocessor,nowtheestablishedplatformforpersonalcomputersandlaptops.Finally,thereadershouldbeawarethatdevelopmentenvironmentshaveevolvedtoprovidesupportfordebuggingmultipleprocessorcoresconnectedinthesameJTAGpath[67].AnexampleisTI’sParallelDebugManager[68],whichisintegratedwithintheCodeComposerStudioIDE.Ofthemanypossiblemultiprocessingforms,themulti-DSPandmulti-coreapproachesareconsideredanddiscussedinSub-sections9.3.1and9.3.2,respectively.Examplesofembeddedmulti-processorsanddifferentapproachescanbefoundinRef.[69].9.3.1Multi-DSParchitectureManyseparateDSPchipscanco-operatetocarryoutataskprovidinganincreasedsystemperformance.Oneadvantageofthisapproachisthescalability,i.e.,theabilitytotunethesystemperformanceandcosttotherequiredfunctionalityandprocessingperformancebyvaryingthenumberofDSPchipsused.Thereadershould,however,beawarethatmulti-DSPdesignsinvolvedifferentconstraintsthansingle-processingsystems.Threekeyaspectsmustbetakenintoaccount.a)Tasksmustbepartitionedbetweenprocessors.Asanexample,asingleprocessorcanhandleataskfromstarttoend;asanalternative,aprocessorcanperformonlyaportionofthetask,thenpasstheintermediateresultstoanotherprocessor.b)Resourcessuchasmemoryandbusaccessmustbesharedbetweenprocessorssoastoavoidbottlenecks.Asanexample,additionalmemorymaybeaddedtostoreintermediateresults.Organizingmemoryintosegmentsorbanksallowssimultaneousmemoryaccesseswithoutcontentionsifdifferentbanksareaccessed.c)Arobustandfastinter-DSPcommunicationmeansmustbeestablished.Ifthecommunicationistoocomplexortakestoomuchtime,theadvantageofamultiprocessingcanbelost.Twoexamplesofmulti-DSParchitecturesbasedonADIDSPsareshowninFig.48.ThereadercanfindmoredetailedinformationinRefs.[70]and[71].Ontheleft-handside(plota)thepoint-to-pointarchitectureisdepicted,baseduponADIlinkportinterconnectcablestandard[27].Point-to-pointinterconnectprovidesadirectconnectionbetweenprocessorelements.ThisisparticularlyusefulwhenlargeblocksofintermediateresultsmustbepassedbetweentwoDSPswithoutinvolvingtheothers.Read/writetransactionstoexternalmemoryaresavedbypassingdatadirectlybetweentwoDSPs,thusallowingtheuseofslowermemorydevices.Additionally,thepoint-to-pointinterconnectcanbeusedtoscaleadesign:additionallinkscanbeaddedtohavemoreDSPsinteracting.Thiscanbedoneeitherdirectlyorbybridgingacrossseverallinks.51217 M.E.ANGOLETTAFig.48:Examplesofmulti-DSPconfigurations.(a)point-to-point,linkport-basedand(b)clusterbusOntheright-handside(plotb)theclusterbusarchitectureisdepicted.Aclusterbusmapsinternalmemoryresources,suchasregistersandprocessormemoryaddresses,directlyontothebus.ThisallowsDSPcodedeveloperstoexchangedatabetweenDSPsusingaddressesasifeachprocessorpossessedthememoryforstoringthedata.Memoryarbitrationismanagedbythebusmaster;thisavoidstheneedforcomplexmemoryordatasharingschemesmanagedbysoftwareorbyRTOS.ThemapincludesalsoacommonbroadcastspaceformessagesthatneedtoreachallDSPs.Asanexample,Fig.49showstheTigerSHARCglobalmemorymap.ThemultiprocessingspacemapstheinternalmemoryspaceofeachTigerSHARCprocessorintheclusterintoanyotherTigerSHARCprocessor.EachTigerSHARCprocessorintheclusterisidentifiedbyitsID;validprocessorIDvaluesare0to7.Fig.49:ADITigerSHARCTS101globalmemorymap.PicturecourtesyofAnalogDevices[72].52218 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNThereadershouldbeawarethatthetwoabove-mentionedarchitectures,namelypoint-to-pointandclusterbus,arenotmutuallyexclusive;onthecontrary,theycanbothbeusedinthesameapplicationascomplementarysolutions.9.3.2Multi-corearchitectureInamulti-corearchitecture,multiplecoresareintegratedintothesamechip.Thisprovidesaconsiderableincreaseoftheperformanceperchip,eveniftheperformancepercoreonlyincreasesslowly.Additionally,thepowerefficiencyofmulti-coreimplementationsismuchbetterthanintraditionalsingle-coreimplementations.ThisapproachisaconvenientalternativetoDSPfarms.AstheperformancerequiredbyDSPsystemskeepsincreasing,itisnowadaysessentialforDSPdeveloperstodeviseaprocessingextensionstrategy.Multi-corearchitecturescanprovideit,inthattheDSPperformanceisboostedwithoutswitchingtoadifferentcorearchitecture.Thishastheadvantagethatapplicationscanbebaseduponmultipleinstancesofanalready-provencore,ratherthanbeadaptedtonewarchitectures.DSPmulti-corearchitectureshavebeencommercializedonlyrecently;however,theDSPmarkethasreliedformanyyearsonco-processortechnology(alsocalledon-chipaccelerators)toboostperformance.Figure50showstheevolutionofDSParchitecture.Fromtheinitialsingle-corearchitecture(a),thesingle-coreplusco-processorarchitecturesoonemerged.Theco-processoroftenrunsatthesamefrequencyastheDSP,therefore‘doubling’theperformanceforthetargetedapplication.Co-processorexamplesareTurboandViterbidecodersforcommunicationapplications.ExampleofdecodercoprocessorsforTI’sTMS320C64xcanbefoundinRefs.[73]and[74].Finally,overthelastfewyearsthemulti-corearchitectureshowninplot(c)hasemerged,whichstillincludesco-processors.Fig.50:Multi-coreandco-processorDSParchitecturesevolution.Single-coreDSP(a),single-coreDSPpluscoprocessor(b)andmulti-coreDSPpluscoprocessor(c).Multi-corearchitecturesareavailableintwodifferentflavours,namelySymmetricMulti-Processing(SMP)andAsymmetricMulti-Processing(AMP).SMParchitecturesincludetwoormoreprocessorswhicharesimilar(oridentical),connectedthoroughahigh-speedpathandsharingsomeperipheralsaswellasmemoryspace.AMParchitecturescombinetwodifferentprocessors,typicallyamicrocontrollerandaDSP,intoahybridarchitecture.Itispossibletouseamulti-coredeviceindifferentways.Thedifferentcorescanoperateindependentlyortheycancooperatefortaskcompletion.Anefficientinter-corecommunicationmaybeneededinbothcases,butitisparticularlyimportantwhentwoormorecoresworktogethertocompleteatask.Asforthemulti-DSPcasediscussedinSub-section9.3.1,itisimportanttodecidehowtoshareresourcestoavoidbottlenecksanddeadlocks,andtoensurethatonecoredoesnotcorrupttheoperationofanothercore.Theresourcesmustbepartitionednotonlyatboardlevel,likeinthesingle-corecase,butatdevicelevel,too,thusaddingincreasecomplexity.Figure51showsanexampleofmulti-corebusandmemoryhierarchyarchitecture.L1memoriesaretypicallydedicatedtotheirowncoreasnon-partitionedbetweencores,asitmaybeinefficienttoaccessthemfromother53219 M.E.ANGOLETTAcores.TheL2memoryisaninternalmemorysharedbetweenthedifferentcores,asopposedtothesingle-corecasewheretheL2memorycanbeeitherinternalorexternal.Themulti-corearchitecturemustmakesurethateachcorecanaccesstheL2memoryandthearbitrationmustbesuchthatcoresarenotlockedoutfromaccessingthisresource.Fig.51:Multi-corebusandmemoryhierarchyexampleFigure52showstheTMS320C5421DSPasanexampleofamulti-core,SMPDSP.TheTMS320C5421DSPiscomposedoftwoC54xDSPcoresandistargetedatcarrier-classvoiceandvideoendequipment.Thecoresare16-bitfixed-pointandthechipisprovidedwithanintegratedVITERBIaccelerator.Fourinternalbusesanddualaddressgeneratorsenablemultipleprogramanddatafetchesandreducememorybottlenecks.Fig.52TMS320C5421multi-coreDSPasanSMPexample.PicturecourtesyofTexasInstruments,DSPselectionguide2007,p.48.54220 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGNTheprogrammingofmulti-coresystemisgenerallymorecomplexthaninthesingle-corecase.Inparticular,thereadershouldbeawarethatmulti-corecodemustfollowthere-entrancerules,tomakesurethatonecore’sprocessingdoesnotcorruptthedatausedbyanothercore’sprocessing.Thisapproachisfollowedbysingle-coreprocessors,too,whenimplementingmulti-taskingoperations.Anexampleofadvantagesandchallengesadeveloperisdealingwithwhenmovinganaudioapplicationfromsingle-todouble-corearchitectureisgiveninRef.[75].9.4Systemarchitecture:radiationeffectsSingle-EventUpset(SEU)eventsarealterationsinthebehaviourofelectroniccircuitsinducedbyradiation.Thesealterationscanbetransientdisruptions,suchaschangesoflogicstates,orpermanentICalterations.ThereaderisreferredtoRef.[76]formoreinformationonthesubject.TechniquestomitigatetheseeffectsinICscanbecarriedoutatdifferentlevels,namely:a)Atdevicelevel,forinstancebyaddingextra-dopinglayerstolimitthesubstratechargecollection.b)Atcircuitlevel,forinstancebyaddingdecouplingresistors,diodes,ortransistorsintheSRAMhardening.c)Atsystemlevel,withErrorDetectionAndCorrection(EDAC)circuitryorwithalgorithm-basedfaulttolerance[77].AnexampleofthelatterapproachistheTripleModuleRedundancy(TMR)algorithmorthenewerWeightedChecksumCode(WCC).Thereadershould,however,beawarethattherearelimitationstowhatthesealgorithmscanachieve.Forinstance,theWCCmethodappliedtofloating-pointsystemsmayfail,asroundofferrorsmaynotbedistinguishedfromfunctionalerrorscausedbyradiation.NeitherADInorTIcurrentlyprovideanyradiation-hardDSP.Third-partycompanieshavedevelopedandmarketedradiation-hardversionsofADIandTIDSPs.AnexampleisSpaceMicroInc.,basedinSanDiego,California.ThiscompanydevisedtheProton200ksingle-boardcomputerbaseduponaTIC67xxDSP,fittedwithEDACcircuitryandwithatotaldosetolerancehigherthan100krad.TheLHCpowersupplycontrollers[78,79]areexamplesofmitigationtechniquesappliedtoDSP.Theyarebaseduponnon-radiation-hardTIC32DSPsandmicrocontrollers.ThememoryisprotectedwithEDACcircuitryandbyavoidingtheuseofDSPinternalmemory,whichcannotbeprotected.Awatchdogsystemrestartsthepowersupplycontrollerintheeventofacrash.Radiationtests[80]havebeencarriedouttocheckthatthedevisedprotectionstrategyissufficientfornormaloperation.9.5Systemarchitecture:interfacesAnessentialstepinthedigitalsystemdesignistoclearlydefinetheinterfacesbetweenthedifferentpartsofthesystem.Figure53showssometypicalbuildingblocksthatcanbefoundinadigitalsystem,namelyDSP(s),FPGA(s),daughtercards,MasterVME,machinetimings,andsignals.TheDSPsystemdesignermustdefinetheinterfacesbetweenDSP(s)andtheotherbuildingblocks.Itisstronglyrecommendedtoavoidhard-codingintheDSPcodetheaddressofmemoryregionssharedwithotherprocessingelements.Onthecontrary,thelinkershouldbeusedtoallocateappropriatelythesoftwarestructuresintheDSPmemory,asmentionedinSub-section6.4.3.Additionally,theDSPdevelopershouldcreateddataaccesslibraries,soastoobtainamodularhencemoreeasilyupgradeableapproach.55221 M.E.ANGOLETTAFig.53:Typicaldigitalsystembuildingblocksandcorrespondinginterfaces9.6Systemarchitecture:generalrecommendationsBasicallyallDSPchipspresentsomeanomaliesontheirexpectedbehaviour.ThisisespeciallytrueforthefirstreleaseofDSPchips,asdiscoveredanomaliesaretypicallysolvedonlaterreleases.AlistofallanomaliesforacertainDSPrelease,whichincludesalsoworkaroundswhenpossible,isnormallyavailableonthemanufacturer’swebsite.Thereaderisstronglyencouragedtolookatthoselists,soastoavoidbeingdelayedbyalready-knownproblems.Fig.54:TIC6713DSKevaluationboard–picture(a)andboardlayout(b)ADSPsystemdesignercangainusefulsoftwareandhardwareexperiencebyusingevaluationboardsintheearlystagesofsystemdesign.EvaluationboardsaretypicallyprovidedbymanufacturersforthemostrepresentativeDSPs.TheyarerelativelyinexpensiveandaretypicallyfittedwithADCsandDACs;theycomewiththestandarddevelopmentenvironmentandJTAGinterface,too.TheDSPdesignercanusethemtosolvetechnicaluncertaintiesandsometimescanevenmodifythemtoquicklybuildasystemprototype[81].Figure54showsTI’sC6713DSKevaluationboard(a)andcorrespondingboardlayout(b);thisevaluationboardwasthatusedintheDSPlaboratorycompanionofthelecturessummarizedinthispaper.56222 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN9.7DSPcodedesign:interrupt-drivenvs.RTOS-basedsystemsAfundamentalchoicethattheDSPcodedevelopermustmakeishowtotriggerthedifferentDSPactions.ThetwomainpossibilitiesareviaaRTOSorviainterrupts.AnoverviewofRTOSisgiveninSection6.3.RTOScandefinedifferentthreads,eachoneperformingaspecificaction,aswellasthecorrespondingthreads’prioritiesandtriggers.RTOS-basedsystemshavetypicallyacleandesignandmanybuilt-inchecks.ThedisadvantageofusingRTOSisapotentiallyslowerresponsetoexternalevents(interrupts)andtheuseofDSPresources(suchassomehardwaretimingsandinterrupts)fortheinternalRTOSfunctioning.Interrupt-drivensystemsassociateactionsdirectlytointerrupts.Theresourceuseisthereforeoptimized.Anexampleofinterrupt-drivensystemisCERN’sLEIRLLRF[42].Figure55showssomeofitssoftwarecomponents:abackgroundtasktriggeredeverymillisecondcarriesouthousekeepingactions,whileacontroltasktriggeredevery12.5µsimplementsthebeamcontrolactions.Drivingasystemthroughinterruptsisveryefficientwithalimitednumberofinterrupts.Forahighnumberofinterrupts,thesystemcanbecomeverycomplexanditsbehaviournoteasilypredictable.Fig.55:Exampleofaninterrupt-drivensystem.Controlandbackgroundtasksaretriggeredbyinterruptsandareshowninredandgreen,respectively.9.8DSPcodedesign:goodpracticeAvastamountofliteratureisavailableoncodedesigngoodpractice.Herejustafewpointsareunderlined,whichareparticularlyrelevanttoembeddedsystems.First,digitalsystemsmustnotturnintotightlysealedblackboxes.ItisessentialthatdesignersembedmanydiagnosticsbuffersintheDSPcode,soastopreventthisfromhappening.Thediagnosticsbufferscouldtakemanyforms,suchaspost-mortem,circularorlinearbuffers.Theymightbeuser-configurableandmustbevisiblefromtheapplicationprogram.AnexampleofadigitalsystemincludingextensivediagnosticscapabilitiescanbefoundinRef.[42].Second,everynewDSPcodereleaseshouldbecharacterizedbyaversionnumber,visiblefromtheapplicationlevel.Thefunctionalityandinterfacemapcorrespondingtoacertainversionnumbershouldbeclearlydocumented,soastoavoidpainfulmisunderstandingsbetweenthemanysystemlayers.Sourcecodecontrolisessentialformanagingcomplexsoftwaredevelopmentprojects,aslargeprojectsrequiremorethanoneDSPcodedeveloperworkingonmanysourcefiles.Sourcecodecontroltoolsmakeitpossibletokeeptrackofthechangesmadetoindividualsourcefilesandpreventfilesfrombeingaccessedbymorethanonepersonatatime.DSPsoftwaredevelopmentenvironmentscanoftensupportmanysourcecontrolproviders.CodeComposerStudio,forexample,supportsanysourcecontrolproviderthatimplementstheMicrosoftSCCInterface.Finally,DSPdevelopersshouldalsoaddchecksontheexecutionduration,tomakesurethecodedoesnotoverrun.Thisisparticularlyimportantforinterrupt-drivensystems(mentionedinSection9.7),whereoneormoreinterruptsmaybemissediftheactionscorrespondingtoaninterruptarenotfinishedbythetimethenextinterruptoccurs.Asanexample,theminimumandmaximum57223 M.E.ANGOLETTAnumberofclockcyclesneededforexecutingapieceofcodecanbeconstantlymeasuredandmonitoredbytheuserathighlevel.AllDSPsprovidemeanstomeasurethenumberofclockcyclesrequiredtoexecuteacertainamountofcode;thenumberofclockcyclescanthenbeeasilyconvertedintoabsolutetime.Figure56showsapossibleimplementationonADISHARCDSPSoftheexecutiondurationofacodecalled‘criticalaction’.SHARCprocessorshaveasetofregisterscalledemuclkandemuclk2whichmakeupa64-bitcounter.ThiscounterisunconditionallyincrementedduringeveryinstructioncycleontheDSPandisnotaffectedbyfactorssuchascache-missesorwait-states.Everytimeemuclkwrapstozero,emuclk2isincrementedbyone.Bydeterminingthedifferenceintheemuclkvaluebetweenbeforeandafterthecriticalaction,theDSPdevelopercandeterminethenumberofclockcycles—hencethetime—toexecutethecode.Fig.56:ExecutiondurationmeasurementwithemuclkregistersintheADISHARCDSP10Real-timedesignflow:systemintegrationThesystemintegrationisoneofthefinalpartsinthesystemdevelopmentprocess.Thisphaseisextremelyimportantasitcandeterminethesuccessorthefailureofawholesystem.Infact,asystemwhichiswellintegratedcanbecomeoperational,whileasystemonlypartiallyintegratedwilloftenremaina‘machinedevelopment’tool,easilyforgotten.Duringthesystemintegrationphase,thesystemiscommissionedwithrespecttodataexchangewiththecontrolinfrastructureandtheapplicationprogram(s).Twoormoregroups,suchasInstrumentation,ControlsandOperation,canbeinvolvedinthiseffort,dependingonthelaboratory’sorganization.Asaconsequence,acoordinationandspecificationworkisrequired.Goodsystemintegrationpracticeswilldependonthelaboratory’sorganizationaswellasonthesystemarchitecture.Thereare,however,someguidelinesthatcanbeappliedtomostcases.–Guideline1:WorkinparallelAllsoftwarelayersneededinasystemshouldbeplannedinparallel.Waitinguntilthelow-levelpartiscompletedbeforestartingwiththespecificationand/orwiththedevelopmentoftheotherlayersmayresultinunacceptabledelays.–Guideline2:AboutinterfacesSection9.5summarizedthemanyinterfacesthatcanexistinasystem.Forasuccessfulsystemintegrationitisessentialthattheinterfacesarespecifiedclearly,areagreeduponwithalldifferentpartiesandarefullydocumented.Recipesonhowtosetupdifferentsoftwarecomponentsofthesystemoronhowtointeractwiththemcanbereallyusefulandspeedupconsiderablysystemdevelopmentaswellasdebugging.Itisrecommendedthatalldocumentsbekeptupdatedandstoredonserversaccessiblebyallpartiesinvolved.Remember:goodfencesmakegoodneighbours!58224 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN–Guideline3:AlwaysincludechecksontheDSPinputsvalidityThevalidityofallcontrolinputstotheDSPshouldbechecked.Alarmsorwarningsshouldberaisedifacontrolvaluefallsoutsidetheallowedrange.Thismechanismwillhelpthesystemintegrationpartandcouldevenpreventseriousmalfunctioningfromhappening.–Guideline4:AddspareparametersItisstronglyrecommendedtomapspareparametersbetweentheDSPandapplicationprogram;theyshouldhavedifferentformatsformaximumflexibility.Thesespareparametersallowaddingdebuggingfeaturesormakingsomesmallupdatewithoutmodificationstotheintermediatesoftwarelayers.–Guideline5:CodereleaseandvalidationThesourcecode(andifpossiblethecorrespondingexecutable,too)shouldbesavedtogetherwithadescriptionofitsfeaturesandimplementedinterfaces.Thiswillallowgoingbacktopreviousworkingreleasesincaseofproblems.Procedureanddatasetsshouldalsobedefinedforcodevalidation.11SummaryandconclusionsThispaperaimedatprovidinganoverviewofDSPfundamentalsandDSP-basedsystemdesign.TheDSPhardwareandsoftwareevolutionwasdiscussedinSections1and2,togetherwithtypicalDSPapplicationstotheacceleratorsector.Section3showedthemainfeaturesofDSPcorearchitecturesandSection4gaveanoverviewofDSPperipherals.Thereal-timedesignflowwasintroducedinSection5anditsstepswerediscussedindetailinSection6(softwaredevelopment),Section7(debugging),Section8(analysisandoptimization),Section9(systemdesign)andSection10(systemintegration).Existingchipexampleswereoftengivenandreferencedtotechnicalmanualsorapplicationnotes.ExamplesofDSPuseinexistingacceleratorsystemswerealsogivenwheneverpossible.TheDSPfieldisofcourseverylargeandmoreinformation,aswellashands-onpractice,isrequiredtobecomeproficientinit.However,theauthorhopesthatthisdocumentandthereferenceshereincanbeusefulstartingpointsforanyonewishingtoworkwithDSPs.References[1]T.Shea,TypesofAcceleratorsandSpecificNeeds,theseCASproceedings.[2]M.E.Angoletta,DigitalSignalProcessingInBeamInstrumentation:LatestTrendsAndTypicalApplications,DIPAC’03,Mainz,Germany,2003.[3]M.E.Angoletta,DigitalLow-LevelRF,EPAC‘06,Edinburgh,Scotland,2006.[4]J.EyreandJ.Bier,TheEvolutionofDSPProcessors,IEEESignalProc.Mag.,vol.17,Issue2,March2000,pp.44–51.[5]J.Glossneretal.,TrendsInCompilableDSPArchitecture,ProceedingsofIEEEWorkshoponSignalProcessingSystems(SiPS)2000,November2000,Lafayette,LA,USA,pp.181–199,ISBN0-7803-6488-0.[6]R.RestleandA.Cron,TMS320Cc30-IEEEFloating-PointFormatConverter,TexasInstrumentsApplicationReportSPRA400,1997.[7]E.A.Lee,ProgrammableDSPArchitectures:PartI,IEEEASSPMag.,October1988,pp.4–19.59225 M.E.ANGOLETTA[8]E.A.Lee,ProgrammableDSPArchitectures:PartII,IEEEASSPMag.,January1989,pp.4–14.[9]J.Eyre,TheDigitalSignalProcessorDerby,IEEESpectrum,June2001,pp.62–68.[10]L.Geppert,High-FlyingDSPArchitectures,IEEESpectrum,Nov.1998,pp.53–56.[11]P.Lapsley,J.Bier,A.ShohamandE.A.Lee,DSPProcessorFundamentals:ArchitecturesandFeatures,IEEEPress,ISBN0-7803-3405-1,1997.[12]TMS320C621x/C671xDSPTwo-LevelInternalMemoryReferenceGuide,TexasInstrumentsLiteratureNumberSPRU609A,November2003.[13]M.Anderson,AdvancedProcessorFeaturesAndWhyYouShouldCare,Part1And2,talksESC-404andESC-424,EmbeddedSystemsConference,SiliconValley2006.[14]AnalogDevicesTeam,TheMemoryInside:TigerSHARCSwallowsItsDRAM,SpecialFeatureSHARCBitesBack,COTSJournal,December2003.[15]S.Srinivasan,V.CuppuandB.Jacob,TransparentData-MemoryOrganisationsForDigitalSignalProcessors,ProceedingsofCASES’01,InternationalConferenceonCompilers,ArchitectureandSynthesisforEmbeddedSystems,CachesandMemorySystemsSession,Atlanta,Georgia,USA,2001,pp.44–48.[16]TMS320C620x/C670xDSPProgramandDataMemoryController/DirectMemoryAccess(DMA)Controller-ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU234,July2003.[17]D.Talla,L.K.John,V.LapinskiiandB.L.Evans,EvaluatingSignalProcessingAndMultimediaApplicationsOnSIMD,VLIWAndSuperscalarArchitectures,ProceedingsoftheInternationalConferenceonComputerDesign,September2000,AustinTX,USA,ISBN0-7695-0801-4.[18]J.A.Fisher,P.FarabosciandC.Young,AVLIWApproachToArchitecture,CompilersAndTools,MorganKaufmannPublisher,December2004,ISBN-13978-1558607668.[19]Extended-PrecisionFixed-PointArithmeticOnTheBlackfinProcessorPlatform,AnalogDevicesEngineer-to-EngineerNoteEE-186,May2003.[20]IEEEStandardForRadix-IndependentFloating-PointArithmetic,ANSI/IEEEStd854–1987.[21]D.Goldber,WhatEveryComputerScientistShouldKnowAboutFloating-PointArithmetic,Comput.Surv.,March1991.[22]ADSP-21160SHARCDSP–HardwareReference,Revision3.0,November2003,AnalogDevicesPartNumber82-001966-01.[23]TMS320C6713,TMS320C6713BFloating-PointDigitalSignalProcessors,TexasInstrumentsManualSPRS186I,December2001,RevisedMay2004.[24]ADSP-BF533BlackfinProcessor–HardwareReference,Revision3.1,June2005,AnalogDevicesPartNumber82-002005-01.[25]TMS320C6000DSP–MultichannelBufferedSerialPort(McBSP)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU580C,May2004.[26]TMS320C6000DSP–MultichannelAudioSerialPort(McASP)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU041C,August2003.[27]R.Kilgore,LinkPortOpenSystemsInterconnectCableStandard,AnalogDevicesEngineer-to-EngineerNoteEE-106,October1999.[28]J.KentandJ.Sondermeyer,InterfacingADSP-BF533/BF561BlackfinProcessorstoHigh-SpeedParallelADCs,AnalogDevicesApplicationNoteAN-813.[29]TMS320C6000DSPInter-IntegratedCircuit(I2C)Module–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU581A,October2003.60226 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN[30]TMS320C6000DSPPeripheralComponentInterconnect(PCI)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRUA75A,October2002.[31]TMS320C6000DSPHostPortInterface(HPI)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU578A,September2003.[32]TMS320C6000DSPGeneral-PurposeInput/Output(GPIO)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU584A,March2004.[33]TMS320C6000DSP32-BitTimer–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU582A,March2004.[34]TMS320C6000DSPSoftware-ProgrammablePhase-LockedLoop(PLL)Controller–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU233B,March2004.[35]TMS320C6000DSPPower-DownLogicandModes–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU728,October2003.[36]TMS320C620x/C670xDSPBootModesandConfiguration–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU642,July2003.[37]TMS320C6000Peripherals–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU109D,February2001.[38]TMS320C6000DSPExternalMemoryInterface(EMIF)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU266A,September2003.[39]T.Kugelstadt,AMethodologyOfInterfacingSerialA-to-DconverterstoDSPs,AnalogApplicationJournal,February2000,pp.1–10.[40]EfficientlyInterfacingSerialDataConvertersToHigh-SpeedDSPs,AnalogApplicationJournal,August2000,pp.10–15.[41]J.Sondermeyer,J.Kent,M.KesslerandR.Gentile,InterfacingTheADSP-BF535BlackfinProcessorToHigh-SpeedConverters(LikeThoseOnTheAD9860/2)OverTheExternalMemoryBus,AnalogDevicesEngineer-to-EngineerNoteEE-162,June2003.[42]M.E.Angolettaetal.,BeamTestsOfANewDigitalBeamControlSystemForTheCERNLEIRAccelerator,PAC‘05,Knoxville,Tennessee,2005.[43]D.Dahnoun,Bootloader,TexasInstrumentsUniversityProgram,Chapter9,2004.[44]CodeComposerStudioIDEGettingStartedGuide–UsersGuide,TexasInstrumentsLiteratureNumberSPRU509F,May2005.[45]V.WanandK-S.Lee,AutomatedRegressionTestsAndMeasurementsWithTheCCStudioScriptingUtility,TexasInstrumentsApplicationReportSPRAAB7,October2005.[46]A.Campbell,K-S.LeeandD.Sale,CreatingDeviceInitializationGELFiles,TexasInstrumentsApplicationReportSPRAA74A,December2004.[47]M.E.Angolettaetal.,TheNewDigital-Receiver-BasedSystemforAntiprotonBeamDiagnostics,PAC2001,Chicago,Illinois,2001.[48]D.Dart,DSP/BIOSTechnicalOverview,TexasInstrumentsApplicationReportSPRA780,August2001.[49]TMS320C6000OptimizingCompiler–User’sGuide,TexasInstrumentsLiteratureNumberSPRU187L,May2004.[50]TMS320C6000AssemblyLanguageTools–User’sGuide,TexasInstrumentsLiteratureNumberSPRU186N,April2004.[51]RewindUser’sGuide,TexasInstrumentsLiteratureNumberSPRU713A,April2005.[52]TMS320C6000InstructionSetSimulator–TechnicalReference,TexasInstrumentsLiteratureNumberSPRU600F,April2005.[53]C.Brokish,EmulationFundamentalsforTIsDSPSolutions,TexasInstrumentsApplicationReportSPRA439C,October2005.61227 M.E.ANGOLETTA[54]VisualDSP++4.5–User’sGuide,Revision2.0,April2006,AnalogDevicesPartNumber82-000420-02.[55]B.Novak,XDS560EmulationTechnologyBringsReal-timeDebuggingVisibilitytoNextGenerationHigh-SpeedSystems,TexasInstrumentsApplicationReportSPRA823A,June2002.[56]H.Thampi,J.Govindarajan,DSP/BIOS,RTDXandHost-TargetCommunications,TexasInstrumentsApplicationreportSPRA895,February2003.[57]X.Fu,Real-TimeDigitalVideoTransferViaHigh-SpeedRTDX,TexasInstrumentsApplicationreportSPRA398,May2002.[58]S.Jung,Y.Paek,TheVeryPortableOptimizerForDigitalSignalProcessors,ProceedingsofCASES’01,InternationalConferenceonCompilers,ArchitectureandSynthesisforEmbeddedSystems,CompilersandOptimizationSession,Atlanta,Georgia,USA,2001,pp.84–92.[59]D.Dahnoun,LinearAssembly,TexasInstrumentsUniversityProgram,Chapter7,2004.[60]AnalysisToolkitv1.3forCodeComposerStudio–User’sGuide,TexasInstrumentsLiteratureNumberSPRU623D,April2005.[61]V.WanandP.Lal,SimulatingRF3toLeverageCodeTuningCapabilities,TexasInstrumentsApplicationReportSPRAA73,December2004.[62]TMS320C6000OptimizingCompiler–User’sGuide,TexasInstrumentsLiteratureNumberSPRU197L,May2004.[63]AnalogDevicesTeam,FastFloating-PointArithmeticEmulationonBackfinProcessors,AnalogDevicesEngineer-to-EngineerNoteEE-185,August2007.[64]Y-T.Cheng,TMS320C6000IntegerDivision,TexasInstrumentsApplicationReportSPRA707,October2000.[65]V.WanandE.Young,PowerManagementinanRF5AudioStreamingApplicationUsingDSP/BIOS,TexasInstrumentsApplicationReportSPRAA19A,August2005.[66]D.Menard,D.ChilletandO.Sentieys,Floating-To-Fixed-PointConversionForDigitalSignalProcessors,EURASIPJ.Appl.SignalProc.,vol.2006,ArticleID96421.[67]F.Culloch,SpeedingtheDevelopmentofMulti-DSPApplications,EmbeddedEdge,June2001,pp.22–29.[68]G.CooperandJ.Hunter,ConfiguringCodeComposerStudioForHeterogeneousDebugging,TexasInstrumentsApplicationReportSPRA752,May2001.[69]R.F.Hobson,A.R.Dyck,K.L.CheungandB.Ressi,SignalProcessingWithTeamsOfEmbeddedWorkhorseProcessors,EURASIPJ.EmbeddedSyst.,vol.2006,ArticleID69484.[70]M.Kokaly-Bannourah,IntroductionToTigerSHARCMultiprocessorSystemsUsingVisualDSP++,AnalogDevicesEngineer-to-EngineerNoteEE-167,April2003.[71]M.Kokaly-Bannourah,UsingTheExpertLinkerForMultiprocessorLDFs,AnalogDevicesEngineer-to-EngineerNoteEE-202,May2005.[72]ADSP-TS101TigerSHARCProcessor–HardwareReference,Revision1.1,May2004,AnalogDevicesPartNumber82-001996-01.[73]TMS320C64xDSPTurbo-DecoderCoprocessor(TCP)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU534A,November2003.[74]TMS320C64xDSPViterbi-DecoderCoprocessor(VCP)–ReferenceGuide,TexasInstrumentsLiteratureNumberSPRU533C,November2003.[75]P.Cohrs,W.PowellandE.Williams,CreatingaDual-ProcessorArchitecturesforDigitalAudio,EmbeddedEdge,June2002,pp.14–19.[76]P.E.DoddandW.L.Massegill,BasicMechanismofSingle-EventUpsetinDigitalMicroelectronics,IEEETrans.Nucl.Sci.,vol.50,No.3,June2003,pp.583–602.62228 DIGITALSIGNALPROCESSORFUNDAMENTALSANDSYSTEMDESIGN[77]M.VijayandR.Mittal,Algorithm-BasedFaultTolerance:AReview,Microproc.Microsyst.,vol.21,No.3,Dec.1997,pp.151–161.[78]Q.Kingetal.,TheAll-DigitalApproachToLHCPowerConverterCurrentControl,CERNSL-2002-002PO.[79]H.Schmickler,UsageOfDSPAndInLargeScalePowerConverterInstallations(LHC),theseCASproceedings.[80]Q.Kingetal.,RadiationTestsOnTheLHCPowerConverterControlElectronics,UniversitéCatholiqueDeLouvain-LaNeuve(UCL),CERNABNote2003-041PO.[81]J.Weberetal.,PEP-IITransverseFeedbackElectronicsUpgradePAC05,Knoxville,2005,p.3928.63229

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭