《Springer.System.Level.Design.of.Reconfigurable.SoC 》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
SYSTEMLEVELDESIGNOFRECONFIGURABLESYSTEMS-ON-CHIP SystemLevelDesignofReconfigurableSystems-on-ChipEditedbyNIKOLAOSS.VOROSINTRACOMS.A.,Patra,GreeceandKONSTANTINOSMASSELOSImperialCollegeofScienceTechnologyandMedicine,London,U.K. AC.I.P.CataloguerecordforthisbookisavailablefromtheLibraryofCongress.ISBN-100-387-26103-6(HB)ISBN-13978-0-387-26103-4(HB)ISBN-100-387-26104-4(e-book)ISBN-13978-0-387-26104-1(e-book)PublishedbySpringer,P.O.Box17,3300AADordrecht,TheNetherlands.www.springeronline.comPrintedonacid-freepaperAllRightsReserved©2005SpringerNopartofthisworkmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,electronic,mechanical,photocopying,microfilming,recordingorotherwise,withoutwrittenpermissionfromthePublisher,withtheexceptionofanymaterialsuppliedspecificallyforthepurposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthework.PrintedintheNetherlands. ContentsContributingAuthors7Preface9Acknowledgments11PartAReconfigurableSystemsIntroductiontoReconfigurableHardware15KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS15ReconfigurableHardwareExploitationinWirelessMultimediaCommunications27KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS27ReconfigurableHardwareTechnologies43KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS43PartBSystemLevelDesignMethodologyDesignFlowforReconfigurableSystems-on-Chip87KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS87SystemCBasedApproach107YANGQUANDKARITIENSYRJÄ107 6SystemLevelDesignofReconfigurableSystems-on-ChipOCAPI-XLBasedApproach133MIROSLAVČUPÁKANDLUCRIJNDERS133PartCDesignCasesMPEG-4VideoDecoder155MIROSLAVČUPÁKANDLUCRIJNDERS155PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip179KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS179WCDMADetector209YANGQU,MARKOPETTISSALOANDKARITIENSYRJÄ209 ContributingAuthorsMiroslavCupak,IMEC,Kapeldreef75,B-3001Leuven,BelgiumKonstantinosMasselosImperialCollegeofScienceTechnologyandMedicine,ExhibitionRoad,London,SW72BT,UnitedKingdomMarkoPettissaloNokiaTechnologyPlatforms,P.O.Box50,FIN-90571Oulu,FinlandYangQuVTTElectronics,P.O.Box1100,FIN-90571Oulu,FinlandLucRijndersIMEC,Kapeldreef75,B-3001Leuven,BelgiumKariTiensyrjäVTTElectronics,P.O.Box1100,FIN-90571Oulu,FinlandNikolaosS.VorosINTRACOMS.A.,254Panepistimioustr.,26443,Patra,Greece PrefaceThisbookpresentstheperspectiveoftheADRIATICprojectforthedesignofreconfigurablesystems-on-chip,asperceivedinthecourseoftheresearchduring2001-2004.Theprojectprovided:(a)ahigh-levelhardware/softwareco-designandco-verificationmethodologyandtoolsforreconfigurablesystems-on-chip,supplementedwithback-enddesigntoolsfortheimplementationofthereconfigurablelogicblocksofthechip,(b)thedefinitionofthetechnologicalrequirementsforreconfigurableprocessorsforwirelessterminalsand(c)theimplementationofMPEG-4,WCDMAandWLANdesigncasestovalidatethemethodologyandtools.ReconfigurabilityisbecominganimportantpartofSystem-on-Chip(SoC)designtocopewiththeincreasingdemandsforsimultaneousflexibilityandcomputationalpower.Currenthardware/softwareco-designmethodologiesprovidelittlesupportfordealingwiththeadditionaldesigndimensionintroduced.Furthersupportatthesystem-levelisneededfortheidentificationandmodellingofdynamicallyre-configurablefunctionblocks,forefficientdesignspaceexploration,partitioningandmapping,andforperformanceevaluation.Theoverheadeffects,e.g.contextswitchingandconfigurationdata,shouldbeincludedinthemodellingalreadyatthesystem-levelinordertoproducecredibleinformationfordecision-making.Thisbookfocusesonhardware/softwareco-designappliedforreconfigurableSoCs.Wediscussexplorationofadditionalrequirementsduetoreconfigurability,reportrextensionstottotwoC+++based+languages/methodologies,SystemCandOCAPI-XL,tosupportthoserequirements,andpresentresultsofthreecasestudiesinthewirelessandmultimediacommunicationdomainthatwereusedforthevalidationoftheapproaches. 10SystemLevelDesignofReconfigurableSystems-on-ChipThebookincludesninechapters,dividedinthreeparts:PartAcontainsChapters1–3andprovidesanintroductiontoreconfigurablesystems-on-chip;PartBcontainsChapters4–6anddescribesindetailtheproposedsystemleveldesignmethodologyandtheassociatedtools;PartC,whichcontainsChapters7–9,providesthedetailsofapplyingtheproposedmethodologyinpractice. AcknowledgmentsTheresearchworkthatprovidedthematerialforthisbookwascarriedoutduring20012004mainlyintheADRIATICProject(AdvancedMethodologyforDesigningReconfIgurableSoCandApplication-TargetedIP-entitiesinwirelessCommunications)supportedpartiallybytheEuropeanCommissionunderthecontractIST-2000-30049.GuidanceandcommentsofMrRonanBurgess,DrLechJozwiakandDrMarkHellyaronresearchdirectionarehighlyappreciated.Inadditiontotheauthors,thecontributionsofthefollowingprojectmembersandpartners'personnelaregratefullyacknowledged:AnttiAnttonen,SpyrosBlionas,KristofDenolf,KlausKronlöf,TarjaLeinonen,DimitrisMetafas,RobertPasko,AnttiPelkonen,KonstantinosPotamianos,TapioRautio,GeertVanmeerbeeck,SergeVernalde,PeterVos,ErikWatzeels,MattiWeisssenfeltandYanZhang.Ofthem,theeditorsexpresstheirspecialthankstoAnttiPelkonenandYanZhangfortheirvaluablecontributionstoChapter5andChapter9,RobertPaskoandGeertVanmeerbeeckfortheirvaluablecontributionstoChapter6,KristofDenolfandPeterVosfortheirsubstantialcontributionstoChapter7andSergeVernaldeandErikWatzeelsformanagementrelatedissues. PARTARECONFIGURABLESYSTEMS Chapter1INTRODUCTIONTORECONFIGURABLEHARDWARE1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:Thischapterintroducesthereadertomainconceptsofreconfigurablecomputingandreconfigurablehardware.Differenttypesofreconfigurationarediscussed.Adetailedclassificationofreconfigurablearchitectureswithrespecttothegranularityoftheirbuildingblocks,thereconfigurationschemeandthesystemlevelcouplingisalsopresented.Keywords:Reconfigurablehardware,reconfigurablearchitectures,reconfiguration,reconfigurablecomputing1.RECONFIGURABLECOMPUTINGANDRECONFIGURABLEHARDWAREReconfigurablecomputingreferstosystemsincorporatingsomeformofhardwareprogrammability–customizinghowthehardwareisusedusinganumberofphysicalcontrolpoints[2].Thesecontrolpointscanthenbechangedperiodicallyinordertoexecutedifferentapplicationsusingthesamehardware.ReconfigurablehardwareoffersagoodbalancebetweenimplementationefficiencyandflexibilityasshowninFigure1-1.Thisisbecausereconfigurablehardwarecombinespost-fabricationprogrammabilitywiththespatial(parallel)computationstyle[2]ofapplicationspecificintegratedcircuits(ASICs),whichismoreefficientincomparisontothetemporal(sequential)computationstyleofinstructionsetprocessors.Duetotheincreasingflexibilityrequirements(e.g.foradaptationtodifferentevolvingstandardsandoperatingconditions)thatareimposedbycomputationallyintensiveapplicationssuchaswirelesscommunications,15N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,15-26.©2005Springer.PrintedintheNetherlands. 16Chapter1devicesneedtobehighlyadaptabletotherunningapplications.Ontheotherhand,efficientrealizationsofsuchapplicationsarerequired,especiallyintheresourcestheyuseduringdeployment,wherepowerconsumptionmustbetradedagainstperceivedqualityoftheapplication.ThecontradictoryrequirementsforflexibilityandimplementationefficiencycannotbesatisfiedbyconventionalinstructionsetprocessorsandASICs.Reconfigurablehardwareformsaninterestingimplementationoptioninsuchcases.TemporalcomputationstyleLimitedSpatialcomputationparallelismEmbeddedstyleGeneralPurposeInstructionSetUnlimitedProcessorparallelismInstructionSet(LPARM)DSP(TI320CXX)FlexibilityApplicationSpecificInstructionSetProcessor(ASIP)ReconfigurableProcessor/FPGAEmbeddedPostfabricationReconfigurableprogrammabilityLogic/FPGAFactorof100-1000Dedicated/DirectMappedHardware(ASIC)Area/PowerFigure1-1.PositioningofreconfigurablehardwareTherearealsootherreasonswhytousereconfigurableresourcesinsystem-on-chip(SoC)design.Theincreasingnon-recurringengineering(NRE)costspushdesignerstousesameSoCinseveralapplicationsandproductsforachievinglowcostperchip.Thepresenceofreconfigurableresourcesallowsthefinetuningofthechipfordifferentproductsorproductvariations.Also,theincreasingcomplexityinthefuturedesignsaddsthepossibilityofincludingdesignflows,whichcanrequirecostlyandslowredesignofthechip.Reconfigurableelementsareoftenhomogenousarrays,whichcanbepre-verifiedtominimizethepossibilityofhavingdesignerrors.Alsothepost-manufacturingprogrammabilityallowscorrectionorcircumventionofproblemslaterthanwithfixedhardware. 1.IntroductiontoReconfigurableHardware172.TYPESOFRECONFIGURATIONThenextparagraphsdescribedifferenttypesofreconfiguration.2.1LogicreconfigurationAtypicallogicblockreconfigurablearchitecturecontainsalook-uptable(LUT),anoptionalDflip-flopandadditionalcombinationallogic.TheLUTallowsanyfunctiontobeimplemented,providinggenericlogic.Theflip-flopcanbeusedforpipelining,registers,stateholdingfunctionsforfinitestatemachines,oranyothersituationwhereclockingisrequired.Thecombinatoriallogicisusuallythefastcarrylogicusedtospeedupfastcarry-basedcomputationssuchasaddition,parity,wideANDoperationsandotherfunctions.ThelogicblockslocatedattheperipheryofthedevicecanbeofdifferentarchitecturededicatedtoI/Ooperations.Thelogicblocksaregroupedtomatricesoverlaidwithareconfigurableinterconnectionnetworkofwires.Interconnectionnetworkreconfigurationiscontrolledbychangingtheconnectionsbetweenthelogicblocksandthewiresandbyconfiguringtheswitchboxes,whichconnectdifferentwires.ThereconfigurationofboththelogicblocksandtheinterconnectionnetworkisachievedbyusingSRAMmemorybitstocontroltheconfigurationoftransistors.Thefunctionalityofthelogicblocks,I/Oblocksandtheinterconnectionnetworkismodifiedbydownloadingbitstreamofreconfigurationdataontothehardware.2.2Instruction-setreconfigurationTheconceptofinstruction-setreconfigurationreferstothehybridarchitecturesconsistingofmicroprocessorandreconfigurablelogic.Thekeybenefitisacombinationoffullsoftwareflexibilitywithhighhardwareefficiency.Onepromisingapproachisthereconfigurableinstructionsetprocessors(RISP),whichhavethecapabilitytoadapttheirinstructionsetstotheapplicationbeingexecutedthroughareconfigurationintheirhardware.Theresultisareconfigurableandextensibleprocessorarchitecture,whichcouldbetailoredcloselytothedesigners'specificneeds.Throughtheadaptation,specializedhardwareacceleratestheexecutionoftheapplications.Ifsharedresourcesareusedintheadaptation,thefunctionaldensityisalsoimproved.Bymovingtheapplication-specificdata-pathsintotheprocessor,aremarkableimprovementinperformancecomparedtofixedinstruction-setprocessorscanbeachieved.Atthesametime,designingatthelevelofinstruction-setarchitecturesignificantlyshortensthedesigncycleandreducesverificationeffortandrisk.Onthe 18Chapter1otherhand,newmethodologies,toolsandprocessorfoundationsarerequired.Automatedextensionofprocessorfunctionunitsandassociatedsoftwareenvironment-compilers,debuggers,instructionsimulatorsetc.,arealsothekeypointstosuccess.Differentsystemswithdifferentcharacteristicshavebeendesigned.Usuallytwomaindesigntasksareinvolved:1.Whatisthetypeofinterfacesbetweenthemicroprocessorandthereconfigurablelogic?2.Howtodesignthereconfigurablelogicitself?Eachofthemcontainsmanytrade-offs.Thecommonclassificationofthereconfigurableprocessorscouldbemadeaccordingtothecouplinglevelsofreconfigurablelogic.Theconceptofcouplinglevelsappliesalsowithoutareferencetoreconfigurableprocessors.AsshowninFigure1-2,therearethreetypesofcouplinglevels:ProcessorCo-processorRFUMainBusMemoryI/OBusAttachedprocessorFigure1-2.Basiccouplinglevelsofreconfigurablelogic1.Reconfigurablefunctionalunit(RFU)-thelogicisplacedinsidetheprocessor,theinstructiondecoderissuesinstructionstothereconfigurableunitasifitwereoneofthestandardfunctionalunitsoftheprocessor.Inthisway,thecommunicationcostisverysmall,sothespeedcouldbeeasilyincreased.Thisisalsothemostpromising 1.IntroductiontoReconfigurableHardware19approachbecauseitcanbeusedtoacceleratealmostanyapplication[1].2.Coprocessor-thelogicisnexttotheprocessor.Communicationisdoneusingaprotocol.3.Attachedprocessor-thelogicisplacedonsomekindofI/Obus.Withthecoprocessorandattachedprocessorapproaches,thespeedimprovementusingthereconfigurablelogichastocompensatefortheoverheadoftransferringthedata.Thisusuallyhappensinapplicationswhereahugeamountofdatahastobeprocessedusingasimplealgorithmthatfitsinthereconfigurablelogic.2.3StaticanddynamicreconfigurationTherearetwobasicreconfigurationapproaches:staticanddynamic.2.3.1StaticreconfigurationStaticreconfiguration(oftenreferredascompiletimereconfiguration)isthesimplestandmostcommonapproachforimplementingapplicationswithreconfigurablelogic.Staticreconfigurationinvolveshardwarechangesatarelativelyslowrate.Itisastaticimplementationstrategywhereeachapplicationconsistsofoneconfiguration.Themainobjectiveistoimprovetheperformance.ConfigureExecuteFigure1-3.PrincipleofstaticreconfigurationThedistinctivefeatureofthisconfigurationisthatitconsistsofasinglesystem-wideconfiguration.Priortocommencinganoperation,thereconfigurableresourcesareloadedwiththeirrespectiveconfigurations.Onceoperationcommences,thereconfigurableresourceswillremaininthisconfigurationthroughouttheoperationoftheapplication.Thushardwareresourcesremainstaticforthelifeofthedesign(orapplication).ThisisdepictedinFigure1-3.Muchhigherperformancethanwithpuresoftwareimplementation(e.g.microprocessorapproaches),costadvantageover 20Chapter1ASICsincertaincasesandconventionalCADtoolsupportarethemainadvantagesofthistechnology.2.3.2DynamicreconfigurationWhereasstaticreconfigurationallocateslogicforthedurationofanapplication,dynamicreconfiguration(oftenreferredtoasruntimereconfiguration)usesadynamicallocationschemethatre-allocateshardwareatrun-time.Thisisanadvancedtechniquethatsomepeopleregardasaflexiblerealizationofthetime/spacetrade-off.ItcanincreasesystemperformancebyusinghighlyoptimizedcircuitsthatareloadedandunloadeddynamicallyduringtheoperationofthesystemasdepictedinFigure1-4.Inthiswaysystemflexibilityismaintainedandfunctionaldensityisincreased[9].ConfigureExecuteFigure1-4.PrincipleofdynamicreconfigurationDynamicreconfigurationisbasedupontheconceptofvirtualhardware,whichissimilartotheideaofvirtualmemory.Here,thephysicalhardwareismuchsmallerthanthesumoftheresourcesrequiredbyalloftheconfigurations.Therefore,insteadofreducingthenumberofconfigurationsthataremapped,weinsteadswaptheminandoutoftheactualhardware,astheyareneeded.Therearetwomaindesignproblemsforthisapproach:thefirstistodividethealgorithmintotime-exclusivesegmentsthatdonotneedto(orcannot)runconcurrently.Thisisreferredtoastemporalpartitioning.BecausenoCADtoolssupportthisstep,thisrequirestediousanderror-proneuserinvolvement.Thesecondproblemistoco-ordinatethebehaviourbetweendifferentconfigurations,i.e.themanagementoftransmissionofintermediateresultsfromoneconfigurationtothenext[8]. 1.IntroductiontoReconfigurableHardware213.CLASSIFICATIONOFRECONFIGURABLEARCHITECTURESInthissectionreconfigurablehardwarearchitecturesareclassifiedwithrespecttoseveralparameters.Theseparametersaredescribedbelow:•GranularityofbuildingblocksThisreferstothelevelsofmanipulationofdata.Inthischapterwedistinguishthreetypesofgranularity:fine-grainwhichcorrespondstobit-levelmanipulationofdata,mediumgrainmanipulatingdatawithvaryingnumberofbitsandcoarse-graingranularitywhichimplieswordleveloperations.•ReconfigurationschemeSystemscanbereconfiguredstaticallyordynamically.Dynamicallyreconfigurablesystemspermitthepartialreconfigurationofcertainlogicblockswhileothersareperformingcomputations.Staticallyreconfigurabledevicesrequireexecutioninterrupt.•CouplingThisreferstothedegreeofcouplingwithahostmicroprocessor.Inacloselycoupleddsystemreconfigurableunitsareplacedonthedatapathoftheprocessor,actingasexecutionunits.Looselycoupleddsystemsactasacoprocessor.Theyareconnectedtoahostcomputersystemthroughchannelsorsomespecial-purposehardware.3.1ClassificationwithrespecttobuildingblocksgranularityThegranularitycriterionreflectsthesmallestblockofwhichareconfigurabledeviceismade.Infine-graineddarchitectures,thebasicprogrammedbuildingblockusuallyconsistsofacombinatorialnetworkandafewflip-flops.Thelogicblockcanbeprogrammedintoasimplelogicfunction,suchasa2-bitadder.Theseblocksareconnectedthroughareconfigurableinterconnectionnetwork.Morecomplexoperationscanbeconstructedbyreconfiguringthisnetwork.CommerciallyavailableFieldProgrammableGateArrays(FPGAs)arebasedonfinegrainarchitectures.Althoughhighlyflexible,thesesystemsexhibitalowefficiencywhenitcomestomorespecifictasks.Forexample,althoughan8-bitaddercanbeimplementedinafine-grainedcircuit,itwillbeinefficient,comparedtoareconfigurablearrayof8-bitadders,whenperforminganaddition-intensivetask.An8-bitadderwillalsooccupymorespaceinthefine-grainedimplementation. 222Chapter1Reconfigurablesystemswhichuselogicblocksoflargergranularityarecategorizedasmedium-grained[6,7,10,11,17].Forexample,Garp[6]isdesignedtoperformanumberofdifferentoperationsonuptofour2-bitinputs.Anothermedium-grainedstructurewasdesignedspecificallytoimplementmultipliersofaconfigurablebit-width[7].ThelogicblockusedinthemultiplierFPGAiscapableofimplementinga4x4multiplication.TheCHESSarchitecture[11]alsooperateson4-bitvalues,witheachofitscellsactingasa4-bitALU.Themajoradvantageofmedium-grainedsystemswhencomparedtothefine-grainedarchitectureis,thattheybetterutilizethechiparea,sincetheyareoptimizedforthespecificoperations.However,adrawbackofthisapproachisrepresentedinahighoverheadwhensynthesizingoperationswhichareincompatiblewiththesimplestlogicblockarchitecture.Coarse-graineddarchitecturesareprimarilyintendedfortheimplementationoftasksdominatedbyword-widthoperations.Becausethelogicblocksusedareoptimizedforlargecomputations,theywillperformtheseoperationsmuchmorequickly(andconsumelesschiparea)thanasetofsmallercellsconnectedtoformthesametypeofstructure.However,becausetheircompositionisstatic,theyareunabletoleverageoptimizationsinthesizeofoperands.Ontheotherhand,thesecoarse-grainedarchitecturescanbemuchmoreefficientthanfiner-grainedarchitecturesforimplementingfunctionsclosertotheirbasicwordsize.Anexampleofcoarse-grainedsystemistheRaPiDarchitecture[4].Averycoarsegranularityisthecasewhenthesimplestlogicblockisbasedonanentiremicroprocessorwithorwithoutspecialaccelerators.ExamplesofsucharchitecturesaretheREMARC[12]andRAW[13]architectures.3.2Classificationwithrespecttoreconfigurationscheme3.2.1StaticallyreconfigurablearchitecturesTraditionalreconfigurablearchitecturesarestaticallyreconfigurable,whichmeansthatthereconfigurableresourcesareconfiguredatthestartofexecutionandremainunchangedforthedurationoftheapplication.Inordertoreconfigureastaticallyreconfigurablearchitecture,thesystemhastobehaltedwhilethereconfigurationisinprogressandthenrestartedwiththenewconfiguration.TraditionalFPGAarchitectureshaveprimarilybeenseriallyprogrammedsingle-contextdevices,allowingonlyoneconfigurationtobeloadedatatime.ThistypeofFPGAsisprogrammedusingaserialstreamof 1.IntroductiontoReconfigurableHardware23configurationinformation,requiringafullreconfigurationifanychangeisrequired.3.2.2DynamicallyreconfigurablearchitecturesDynamicallyreconfigurable(run-timereconfigurable)architecturesallowreconfigurationandexecutiontoproceedatthesametime.ThedifferentreconfigurablestylesofdynamicreconfigurationaredepictedinFigure1-5anddiscussedinthefollowingparagraphs.SinglecontextdynamicallyreconfigurablearchitecturesAlthoughsinglecontextarchitecturescantypicallybereconfiguredonlystatically,arun-timereconfigurationontosinglecontextFPGAcanalsobeimplemented.Typically,theconfigurationsaregroupedintocontexts,andeachcontextisswappedasneeded.Attentionhastobepaidonproperpartitioningoftheconfigurationsbetweenthecontextsinordertominimizethereconfigurationdelay.Multi-contextdynamicallyreconfigurablearchitecturesAmulti-contextarchitectureincludesmultiplememorybitsforeachprogrammingbitlocation.Thesememorybitscanbethoughtofasmultipleplanesofconfigurationinformation[3,15].Onlyoneplaneofconfigurationinformationcanbeactiveatagivenmoment,butthearchitecturecanIIngFigure1-5.Thedifferentbasicmodelsofdynamicallyreconfigurablecomputing 244Chapter1quicklyswitchbetweendifferentplanes,orcontexts,ofalready-programmedconfigurations.Inthismanner,themulti-contextarchitecturecanbeconsideredamultiplexedsetofsingle-contextarchitectures,whichrequiresthatacontextbefullyreprogrammedtoperformanymodificationtotheconfigurationdata.However,thisrequiresagreatdealmoreareathantheotherstructures,giventhattheremustbeasmanystorageunitsperprogramminglocationastherearecontexts.Thisalsomeansthatmulti-contextschemesaremainlyusedincoarse-grainarchitectures.PartiallyReconfigurableArchitecturesInsomecases,configurationsdonotoccupythefullreconfigurablehardware,oronlyapartofaconfigurationrequiresmodification.Inbothofthesesituationsapartialreconfigurationofthereconfigurableresourcesisdesired,ratherthanthefullreconfigurationsupportedbytheserialarchitecturesmentionedabove.Inpartiallyreconfigurablearchitectures,theunderlyingprogramminglayeroperateslikeaRAMdevice.Usingaddressestospecifythetargetlocationoftheconfigurationdataallowsforselectivereconfigurationofthereconfigurableresources.Frequently,theundisturbedportionsofthereconfigurableresourcesmaycontinueexecution,allowingtheoverlapofcomputationwithreconfiguration.Whenconfigurationsdonotrequiretheentireareaavailablewithinthearray,anumberofdifferentconfigurationsmaybeloadedintootherwiseunusedareasofthehardware.Partiallyrun-timereconfigurablearchitecturescanallowforcompletereconfigurationflexibilitysuchastheXilinx6200[18],ormayrequireafullcolumnofconfigurationinformationtobereconfiguredatonce,asintheXilinxVirtexFPGA[19].4.COUPLINGThetypeofcouplingoftheReconfigurableProcessingUnit(RPU)tothecomputingsystemhasabigimpactonthecommunicationcost.Itcanbeclassifiedintooneofthefourgroupslistedbelow,whicharepresentedinorderofdecreasingcommunicationcostsandillustratedinFigure1-6:•RPUscoupledtotheI/Obusofthehost(Figure1-6.a).Thisgroupincludesmanycommercialcircuitboards.SomeofthemareconnectedtothePCIbusofaPCorworkstation.•RPUscoupledtothelocalbusofthehost(Figure1-6.b). 1.IntroductiontoReconfigurableHardware25•RPUscoupledlikeco-processors(Figure1-6.c)suchastheREMARC-ReconfigurableMultimediaArrayCoprocessor[12].•RPUsactinglikeanextendeddata-pathoftheprocessor(Figure1-6.d)suchastheOneChip[16],thePRISC-ProgrammableReducedInstructionSetComputer[14],andtheChimaera[5].Figure1-6.CouplingoftheRPUtothehostcomputerREFERENCES1.BaratF,LauwereinsR(2000)ReconfigurableInstructionSetProcessors:ASurvey.In:ProceedingsofIEEEinternationalWorkshoponRapidSystemPrototyping,pp168-173 26Chapter1rd2.BrodersenB(2002)WirelessSystems-on-a-ChipDesign.In:Proceedingsof3InternationalSymposiumonQualityofElectronicDesign,pp221-2223.DeHonA(1996)DPGAUtilizationandApplication.In:ProceedingsofACM/SIGDAInternationalSymposiumonFPGAs,pp115-1214.EbelingC,CronquistDC,FranklinP(1996)RaPiDReconfigurablePipelinedDatapath.In:LectureNotesinComputerScience1142–FieldProgrammableLogic:SmartApplications,NewParadigmsandCompilers,SpringerVerlag,pp126-1355.HauckS,FryTW,HoslerMM,KaoJP(1997)TheChimaeraReconfigurableFunctionalthUnit.In:Proceedingsofthe5IEEESymposiumonFieldProgrammableCustomComputingMachines,pp87-966.HauserJR,WawrzynekJ(1997)Garp:AMIPSProcessorwithaReconfigurableCoprocessor.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp12-217.HaynesSD,CheungPYK(1998)Areconfigurablemultiplierarrayforvideoimageprocessingtasks,suitableforembeddinginanFPGAstructure.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp226-2358.HutchingsBL,WirthlinMJ(1995)Implementationapproachesforreconfigurablelogicapplications.BrighamYoungUniversity,Dept.ofElectricalandComputerEngineering9.KhatibJ(2001)ConfigurablerableComputing.ting.Availablelableat:http://www.geocities.com/siliconvalley/pines/6639/fpga10.LucentTechnologiesInc(1998)FPGADataBook,Allentown,Pennsylvania11.MarshallA,StansfieldT,KostarnovI,VuilleminJ,HutchingsB(1999)AReconfigurableArithmeticArrayforMultimediaApplications.In:ProceedingsofACM/SIGDAInternationalSymposiumonFPGAs,pp135-14312.MiyamoriT,OlukotunK(1998)Aquantitativeanalysisofreconfigurablecoprocessorsformultimediaapplications.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp2-1113.MoritzCA,YeungD,AgarwalA(1998)Exploringoptimalcostperformancedesignsforrawmicroprocessors.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp12-2714.RazdanR,BraceK,SmithMD(1994)PRISCSoftwareAccelerationTechniques.In:ProceedingsoftheIEEEInternationalConferenceonComputerDesign,pp145-14915.TrimbergerS,CarberryD,JohnsonA,WongJ(1997)ATime-MultiplexedFPGA.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp22-2916.WittingRD,ChowP(1996)OneChip:AnFPGAProcessorwithReconfigurableLogic.In:ProceedingsoftheIEEESymposiumonFPGAsforCustomComputingMachines,pp126-13517.XilinxInc.(1994)TheProgrammableLogicDataBook18.XilinxInc.(1996)XC6200:Advancedproductspecificationv1.0.In:TheProgrammableLogicDataBook19.XilinxInc.(1999)VirtexTM:ConfigurationArchitectureAdvancedUsersGuide’ Chapter2RECONFIGURABLEHARDWAREEXPLOITATIONINWIRELESSMULTIMEDIACOMMUNICATIONS1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:Thischapterpresentscaseswherereconfigurablehardwarecanbeexploitedfortheefficientrealizationofwirelessmultimediacommunicationsystems.Thevariousscenariosdescribedarereferringto(a)theDLC/MAClayerandthebasebandpartofthephysicallayerofHIPERLAN/2andIEEE802.11aWLANprotocols,and(b)theapplicationlayerofasophisticatedpersonaldevice.Thegoalofthischapteristoprovideaninsightontheadvantagesreconfigurablehardwaremaybringinreallifeapplications.Keywords:Reconfiguration,WLAN,applicationlayer,wirelessmultimediacommunications1.RECONFIGURABLEHARDWAREBENEFITSFROMASYSTEM’SPERSPECTIVEThepresenceofreconfigurablehardwareresourcesinasystemcanbeexploitedintwomajordirections:•Tocreatespaceforpost-fabricationfunctionalmodificationse.g.toupgradesystemfunctionalityorforsoftwarelikebugfixing.Softwarerealizationsallowpost-fabricationfunctionalmodifications,howeverforcomplextaskssoftwarerealizationsmightbeinefficient.Thisfeaturemayallowimportanttime-to-marketimprovement.•Toallowsharingofhardwareresourcesamongtasksthatarenotactivesimultaneouslythusreducingthetotalareacostofthesystem.Such27N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,27-42.©2005Springer.PrintedintheNetherlands. 28Chapter2tasksmaybelongtodifferentmodesofoperationofagivensystem,todifferentapplicationsorstandardsrealizedonthesameplatformoreventotimenon-overlappingtasksofasinglesystem.Givenanapplication,tasksthataresuitableforrealizationonreconfigurablehardwarearethosethatmaysharehardwareresourceswithothertasksovertimeorarelikelytobemodified/upgradedinthefutureandalsohavehighcomputationalcomplexity(thatpreventsefficientrealizationoninstructionsetprocessors).Intherestofthischapter,reconfigurationscenariosarediscussedfromthewirelesscommunicationsandmultimediadomains.ReallifecomplexsystemsareusedforthisanalysisnamelytheHIPERLAN/2andIEEE802.11aWLANsystems(coveringMACandphysicallayersfunctionality)andtheMPEGsystem(coveringtheapplicationlayer).2.RECONFIGURATIONSCENARIOSFORHIPERLAN/2ANDIEEE802.11aWLANSYSTEMSInthissectionreconfigurationscenariosfortheHIPERLAN/2andIEEE802.11aWLANsystemsarediscussed.ThetwosystemstargetedfunctionalitiescovertheDLC/MAClayerandthebasebandpartofthephysicallayer.2.1HIPERLAN/2andIEEE802.11asystemsHIPERLAN/2[1]isaconnection-orientedtime-divisionmultipleaccess(TDMA)system.PhysicallayerisbasedoncodedOFDMmodulationscheme[2].Thephysicallayerismulti-ratetypeallowingcontroloflinkcapabilitybetweenaccesspointandmobileterminalaccordinginterferencesituationsanddistance.TheflowgraphoftheHIPERLAN/2transmitterisshowninFigure2-1.Theblocksintheinputsandoutputsofthedifferenttasksgivetheinputandoutputratesofthetasksrespectively.Theinputrateofagiventaskcorrespondstotheminimumamountofdatarequiredforthetasktoproduceagivenoutput(outputrate).ThecomputationalcomplexityandthetypeofprocessingofthetransmittertasksareanalyticallypresentedinTable2-1.Theanalysisofcomputationalcomplexityisdonebyestimatingthenumberofrequiredbasicoperationsperoutputdataitemineachfunction.Thebasicoperationsincludearithmetic,logicandmemoryread/writeoperations.Itisassumed, 2.ReconfigurablehardwareexploitationinWirelessMultimedia29Communicationsthataprocessingoftransmittedorreceiveddatashouldbepossibleatasustainednominaldatarateofeachphysicallayermode.Theinputandoutputoperationsincludedinthiscomplexityanalysiscorrespondtodatacomingfromprevioustasksandbeingpassedtofollowingtasks(inarealimplementationtheseoperationsarelikelycorrespondingtoaccessestodatastoragelocations).TailbitsappendingMAC/PHYTxmemory1bitDataScrambler2to1MUXInterface1bit13bits/1bitRate1bit12bits/Convolutional1to2DEMUXindependent1bitencoderpuncturingP113bits/1bit1bit9bits/3bits/1bit8bits/RatedependentNN2bits/CBPSInterleaverCBPSpuncturingP2288bits(WC)288bits(WC)9bits/1bit3bits/1bitI(real)part64I's48I's6bits/ofsamplePilotConstellation4bits/Insertionmapper2bits/Q(real)part1bit64Q's48Q'sofsample64real80realsamplesIFFTCyclicprefixsamplesPhyburst64imaginaryinsertion80imaginaryformationsamplessamplesPreamblesmemoryFigure2-1.HIPERLAN/2transmitterFromthecomputationalcomplexityanalysisitcanbeseenthattherearesomealgorithmsthatgenerateaconstantcomputationalcomplexityinallphysicallayermodes.ThemostimportantisIFFTthatisdominatingtheoveralltransmitsidecomplexityinthelowbitratemodes.Thecomplexitiesofchannelcodingfunctionsarenaturallyrelatedtotheusedbitrate. 300Chapter2Table2-1.ComputationalcomplexityoftransmittertasksindifferentphysicallayermodesTaskTypeofprocessingComputationalcomplexity(MOPS)/PHYmode(Mb/s)691218273654bitlevel-shiftScrambling108162216324486648972register,XORConvolutionalbitlevel-shift17426134852278310441566encodingregister,XORPuncturing(Ratebitlevel–logic0.310.310.310.310.310.310.31dependent)operationsPuncturing(Ratebitlevel–logic033066105132198dependent)operationsGroupofbits–Interleaving48489696192192288LUTaccessesConstellationGroupofbits–30453654547290mappingLUTaccessesWordlevel-Pilotinsertion56565656565656memoryaccessesWordlevel–multiplications,IFFT922922922922922922922additions,memoryaccessesCyclicprefixWordlevel-72727272727272insertionmemoryaccessesSum1410159917462112267031384164Timingandfrequency80complexCyclic64complexsynchronizationandsamplesPrefixsamplescorection(160words)Extraction(128words)Channel1complex64complexConstellationestimationandsamplesamplesFFTdecoderfrequencydomain(2words)(128words)equalizationRateNNdependent8/2/1bitsCBPSDe-interleaverCBPS6/4/2/1bits288bits(wc)288bits(wc)depuncturingRateViterbi9/3/1bits12/1bitsindependent13/1bits2bitsdecoderdepuncturingMAC/PHY1bitDescrambler1bitinterfaceFigure2-2.HIPERLAN/2receiver 2.ReconfigurablehardwareexploitationinWirelessMultimedia31CommunicationsTheflowgraphofareferenceHIPERLAN/2receiverispresentedinFigure2-2.ThereceiverchainoftheHIPERLAN/2isleftopenbythestandardsothereismorefreedomforalgorithmselectionforcertainblockssuchasthetimingandfrequencysynchronizationandthechannelestimation(differentchainsoftaskscanbeadoptedforthesetwogenericblocks).ThecomputationalcomplexityandthetypeofprocessingofthereceivertasksareanalyticallypresentedinTable2-2.Table2-2.ComputationalcomplexityofreceivertasksindifferentphysicallayermodesTaskTypeofprocessingComputationalcomplexity(MOPS)/PHYmode(Mb/s)691218273654CyclicprefixWordlevelmemory96969696969696extractionaccessesWordlevel–Frequencyerrormultiplications,208208208208208208208correctionadditions,memoryaccessesWordlevel–multiplications,FFT922922922922922922922additions,memoryaccessesWordlevel–Frequencymultiplications,domain132132132132132132132additions,memoryequalizationaccessesConstellationGroupofbits–LUT4848240240288288336demappingaccessesGroupofbits–LUTDeinterleaving48489696192192288accessesDepuncturingbitlevel–logic050099118198297(Ratedependent)operationsDepuncturingbitlevel–logic(Rate0.160.200.160.200.280.200.20operationsindependent)BitlevelI/OwordViterbidecodingleveladditions,11701755234035105265702010530comparisonsbitlevelshiftregister,Descrambling108162216324486648972XORSum27323421425056277707970413781 32Chapter2Asitcanbededuced,theViterbidecodingdominatestheoverallcomplexityfiguresinallphysicallayermodes.Itcanbealsoseenthatthereceiversideprocessingisuptothreetimesmorecomplexthantransmitsideprocessing.BBBBBBBBBBCPCCIEEE802.11aPREAMBLE16161616161616161616326464sampleseessamplesseessamplesseessamplessessamplesessamplesessampleseessamplessessamplesessamplessamplessamplessamplesAIAAIAIABBBBIBCPCCHIPERLAN/2BroadcastburstPREAMBLE16161616161616161616326464sampleseessamplesseessamplesseessamplessessamplesessampleseessamplesseessamplesseessamplessessamplessamplessamplessamplesHIPERLAN/2CPCCDownlinkburstPREAMBLE326464samplessamplessamplesHIPERLAN/2BBBBIBCPCCUplinkburstshortPREAMBLE1616161616326464samplesessamplesessampleseessamplesseessamplesssamplessamplessamplesHIPERLAN/2BBBBBBBBBIBCPCCUplinkburstlongPREAMBLEandDirectlinkburstPREAMBLE16161616161616161616326464sampleseessamplessessamplesessamplesessamplesessampleseessamplesseessamplessessamplesessamplessamplessamplessamplesFigure2-3.IEEE802.11aandHIPERLAN/2preamblesThebasebandpartoftheIEEE802.11asystem[3]isalmostsimilartothatofHIPERLAN/2system.Onlysomeminordifferencesexist.IEEE802.11ausesonlyonepreamblesequence(showninFigure2-3)of320samples.HIPERLAN/2uses4differenttypesofpreamblesequencesforthedifferenttypesofPDUswithsizesrangingfrom160samplesto320samples.ThecontentsofthefirsthalfofthePREAMBLEsequencesofHIPERLAN/2arealwaysdifferenttothatofIEEE802.11a.Fromanimplementationpointofviewthismayaffectthesynchronizationblockofthereceiver.Differentsequencesareusedbythetwosystemsfortheinitializationofthe(de)scrambler.InIEEE802.11atheinitializationisperformedusingthefirst7bitsoftheservicefieldwhicharealwayssettozero.InHIPERLAN/2theinitialstateofthescramblerissettopseudorandomnon-zero7-bitstatedeterminedbytheframecounterfieldintheBCH(firstfourbitsofBCH)atthebeginningofthecorrespondingMACframe.Theinitialstateisderived 2.ReconfigurablehardwareexploitationinWirelessMultimedia33CommunicationsbyappendingthefirstfourbitsofBCHtothefixedbinarynumber(111)2.Thisdifferenceissmallfromanimplementationpointofview.Intheencoderside,IEEE802.11asupports1/2,3/4and2/3coderateswhileHIPERLAN/2supports1/2,3/4and9/16coderates.Twocoderatesareincommonwhileeachsystemsupportsathirddifferentextraone.HIPERLAN/2appliestwopuncturingstages(arateindependentonefollowedbyaratedependentone)whileIEEE802.11aappliesasinglepuncturingstage.ThepuncturingpatternsappliedbythetwosystemstoachievethedifferentcoderatesarepresentedinFigure2-4(nopuncturingpatternisrequiredfor1/2coderate).Thedifferencefromanimplementationpointofviewissmall.Thecombinationsofmodulation,codingrateandachievednominalbitrate(physicalmodesofoperation)supportedbyIEEE802.11aandHIPERLAN/2arepresentedinTable2-3.Sixmodesofoperationarecommon,IEEE802.11asupportstwoextramodeswhileHIPERLAN/2supportsoneextramode.Fromanimplementationpointofviewthenumberofmodesofoperationsupportedaffectsthemodemcontrollerfromwhichthemodemcontrolwordsareissued.1111110111111XHIPERLAN/2rateindependentpuncturingpatterns1111111111110Y111111110XHIPERLAN/29/16puncturingpattern111101111Y110XCommon3/4puncturingpattern101Y11XIEEE802.11a2/3puncturingpattern10YFigure2-4.PuncturingpatternsusedbyIEEE802.11aandHIPERLAN/2TheMACframedurationoftheHIPERLAN/2isfixedto2ms.TheHIPERLAN/2MACframestructuredescribedinFigure2-5comprisestime 34Chapter2slotsforbroadcastcontrol(BCH),framecontrol(FCH),accessfeedbackcontrol(ACH)anddatatransmissionindownlink(DL),uplink(UL)anddirectlink(DiL)phases,whichareallocateddynamicallydependingontheneedfortransmissionresources.Amobileterminal(MT)firsthastorequestcapacityfromtheaccesspoint(AP)inordertosenddata.Thiscanbedoneintherandomaccesschannel(RCH),wherecontentionforthesametimeslotisallowed.Downlink,uplinkanddirectlinkphasesconsistoftwotypesofPDUs.ThelongPDUshaveasizeof54bytesandcontaincontroloruserdata.Thepayloadis49.5bytesandtheremaining4.5bytesareusedforthePDUType(2bits),asequencenumber(10bits,SN)andcyclicredundancycheck(CRC-24).LongPDUsarereferredtoasthelongtransportchannel(LCH).ShortPDUscontainonlycontroldataandhaveasizeof9bytes.Theymaycontainresourcerequests,ARQmessagesetcandtheyarereferredtoastheshorttransportchannel(SCH).AphysicalburstiscomposedofthePDUtrainpayloadandapreambleandistheunittobetransmittedviathephysicallayer.Table2-3.PhysicalmodesofoperationofIEEE802.11aandHIPERLAN/2CodingNominalbitrateCodedbitsModulationRateR(Mbit/s)perOFDMsymbolBPSK1/2648BPSK3/4948QPSK1/21296QPSK3/4189616QAM9/1627192(HL/2only)16QAM1/224192(IEEE802.11aonly)16QAM3/43619264QAM3/45428864QAM2/348288(IEEE802.11aonly)ThestructureoftheIEEE802.11aPPDUframeisdescribedinFigure2-6.Theheadercontainsinformationaboutthelengthoftheexchangeddataandthetransmissionrate.TheRATEfieldconveysinformationaboutthetypeofthemodulationandthecodingrateusedintherestofthepacket.TheLENGTHfieldtakesavaluebetween1and4095andspecifiesthenumberofbytestobeexchanged(PSDU).Thesixtailbitsareusedtoresettheconvolutionalencoderandtoterminatethecodetrellisinthedecoder.Thefirst7bitsoftheservicefieldaresettozeroandareusedtoinitialisethe(de)scrambler.Theremaining9bitsarereservedforfutureuse. 2.ReconfigurablehardwareexploitationinWirelessMultimedia35CommunicationsThepadbitsareusedtoensurethatthenumberofbitsinthePPDUframemapstoanintegernumberofOFDMsymbols.Acyclicredundancycheck(CRC-32)isincludedintheIEEE802.11aPSDU.2msBCHFCHACHDLphaseDiLphaseULphaseRCHMACFrameLongPDUs(LCH)ShortPDUs(SCH)PDUType(2bits)SN(2bits)Payload(49.5bytes)CRC(3bytes)LongPDUs(LCH)54bytesPreamblePDUTrainPhysicalBurstFormatFigure2-5.HIPERLAN/2MACframe,LongPDUandPhysicalBurstformatAnimportantissueisthatthetransmissionduration(TXTIME)foraPPDUframeinIEEE802.11aisnotfixedbutafunctionofLENGTHfieldasshowninthefollowingequation:TXTIME=T+T+T×Ceiling(((16+8×LENGTH+6)/N)(1)PREAMBLEPSIGNALSSYMSDBPSDwhereNDBPSisthenumberofdatabitspersymbolandcanbederivedfromtheDATARATEparameter.FromanimplementationpointofviewthisfactimposesastricttimingrequirementtotheMAC/PHYinterfaceforthedecodingoftheSIGNALsymbolinordertodeterminethenumberofOFDMsymbolstobeexchanged.HEADERRATEReservedLENGTHParityTailSERVICETailPadPSDU(4bits)(1bit)(12bits)(1bit)(6bits)(16bits)(6bits)BitsPREAMBLESIGNALDATA12SymbolsOneOFDMsymbolVariablenumberofOFDMsymbolsBPSK1/2RateModeindicatedfromRATEFigure2-6.IEEE802.11aPPDUframeformat 36Chapter2ThemajordifferencesbetweenIEEE802.11aandHIPERLAN/2systemsoccurintheMACsublayer.InHIPERLAN/2themediumaccessisbasedonaTDD/TDMAapproach.ThecontroliscentralizedtoanAP,whichinformstheMTsatwhichpointintimeintheMACframetheyareallowedtotransmittheirdata.IEEE802.11ausesadistributedMACprotocolbasedonCarrierSenseMultipleAccesswithCollisionAvoidance(CSMA/CA).2.2WLANReconfigurationscenariosSomereconfigurationscenariosfortheMACandbasebandpartsoftheHIPERLAN/2andIEEE802.11aWLANsystemsaredescribedinthissection.HIPERLAN/2andIEEE802.11abasebandprocessingalgorithmsarequitesimpleasfarascontrolflowisconcernedandtheirfunctionalitydoesnotdependinprincipleonthephysicallayermodethatisusedintransmissionorreception.Thebasebandprocessingcomputationalcomplexitydependsverymuchontheusedphysicallayermodeinthetransmissionorreception.ComplexComplexAlgorithmTask1TaskNDistributedReconfigurableISPHardwareSharedArchitectureMemoryInterconnectNetworkI/OFigure2-7.RealizationonahighlyflexibleplatformThemostcomputationallycomplextasksaretheViterbidecodingandtheFFTonthereceiversideandtheIFFTinthetransmitterside.Assumingahighlyflexibleimplementationusinginstructionsetprocessors(ISP)andreconfigurablehardware(alongsideinterconnect,memory,I/Osetc.)thesetasksshouldbeassignedtoreconfigurablehardware(forincreasedspeedandreducedpower).ThisscenarioisillustratedinFigure2-7.Howeveralmostnoflexibilityisrequiredforthesetasksonastand-alonebasis(nodifferentcandidateimplementationchoicesexist).IfASICblockswereincludedinthetargetimplementationplatformthesetasksshouldbepreferablymovedtothem. 2.ReconfigurablehardwareexploitationinWirelessMultimedia37CommunicationsReconfigurablehardwareresourcescanbesharedamongbasebandprocessingtasksthatarenotactivesimultaneously.Thismayleadtosiliconareaoptimization(takingintoconsiderationreconfigurationrelatedoverheads).ThisscenarioisdescribedinFigure2-8.Forexampleunderahalfduplexingscenariothetransmitterandthereceiverwillnotbeactivesimultaneously.Inthiscase,tasksofthetransmitterandthereceivermaysharethesamereconfigurablehardwareresources.GroupoftaskswithnonoverlappingAlgorithmlifetimesDistributedReconfigurableDedicatedISPHardwareHardwareSharedArchitectureMemoryInterconnectNetworkI/OFigure2-8.Reconfigurablehardwaresharingamongtaskswithnon-overlappinglifetimesTaskTaskAlgorithmInstance1InstanceNDistributedReconfigurableDedicatedISPHardwareHardwareSharedArchitectureMemoryInterconnectNetworkI/OFigure2-9.RealizationofdifferentalgorithmicinstancesofthesametaskonreconfigurablehardwareCertaintasksinthereceiverchainofthebasebandprocessingallowdifferentalgorithmicimplementationswithdifferenttrade-offsbetweenalgorithmicperformanceandcomputationalcomplexity(e.g.channelestimation).Loweralgorithmicperformancerequirements(e.g.SNR,BER)mayallowtheuseoflesssophisticatedandcomputationalcomplexalgorithmicinstancesleadingtoimprovedimplementationefficiency(speed, 38Chapter2power).Furthermorerealizationofdifferentalgorithmicinstancesforthesametaskinagivensystemmaybebeneficiale.g.allowingadaptationtodifferentoperatingconditions.Suchtasksaregoodcandidatesforimplementationonreconfigurablehardware(withtheirdifferentinstancessharingthesamereconfigurablehardwareresources)iftheircomplexityishigh(preventingefficientrealizationoninstructionsetprocessors).ThisscenarioisdescribedinFigure2-9.Task1TaskNcandidateforcandidateforpostfabricationpostfabricationAlgorithmmodificationmodificationDistributedReconfigurableDedicatedISPSharedArchitectureHardwareHardwareMemoryInterconnectNetworkI/OFigure2-10.PostshipmentmodificationscenarioStandard1Standard2AlgorithmTaskTaskDistributedReconfigurableDedicatedISPSharedArchitectureHardwareHardwareMemoryInterconnectNetworkI/OFigure2-11.Multi-standardrealizationscenarioAnotheropportunityforreconfigurablehardwareexploitationistowardspost-shipmentmodification/enhancementofthesystem’sfunctionality(e.g.withmoresophisticatedrealizationsofcertaintasks).Basebandprocessingtasksthatarecandidatesforbeingupgradedarethosethatareleftopenbythestandard.ThisscenarioisdescribedinFigure2-10.Moreopportunitiesforreconfigurationandreconfigurablehardwaresharingexistinthecaseofrealizationofmultiplestandardsonthesamereconfigurableimplementationplatform.ThisscenarioisdescribedinFigure2-11.LetassumeaHIPERLAN/2–IEEE802.11adualstandard 2.ReconfigurablehardwareexploitationinWirelessMultimedia39Communicationsrealizationwiththetwosystemsnotbeingactivesimultaneously.GiventhatthemajordifferencesbetweenthetwostandardsareintheMAClayersreconfigurablehardwarecanbeusedfortherealizationofthemostcomplexandperformancedemandingpartsoftheMAClayers(andtheMACtobasebandinterfaces)ofthetwosystems.3.RECONFIGURATIONSCENARIOSATTHEAPPLICATIONLAYERAsportabledevicesbecomemorepowerful,italsobecomespossibletorunmorecomputationallyintensiveservicesontheseappliances.Duetotheincreasingflexibilityrequirementsthatareimposedbytheseapplications,thedevicesneedtobehighlyadaptabletotherunningapplications.Attheotherhand,efficientrealizationsoftheseapplicationsarerequired,especiallyintheresourcestheyuseduringdeployment,wherepowerconsumptionmustbetradedagainstperceivedqualityoftheapplication.Tobeabletorealizeavarietyofapplicationsorservices,theimplementationplatformneedstobehighlyadaptable.AssumeawirelesscommunicationterminalasisshowninFigure2-12,whichconsistsoutofinstructionsetprocessors(ISP)andreconfigurablehardwarethatareconnectedtoacommoninterconnectnetworkandtomemory.Thisdeviceispowerfulenoughtosupportvariousapplications,includingvideo.Becauseofthehighcomputationaldemandofsuchavideoapplication,itwillberunonthereconfigurablehardware(seeFigure2-12)asthatpartcanbeconfiguredforoptimalperformanceforagivenapplication.Whentheuserdecidestoviewthevideoinasmallwindowandtostartupa3Dgame,thesituationchanges.Thenthevideoapplicationcanberunwithmuchlessresources,whilethegamebecomesthemostcomputationallyintensiveapplication.Thismeansthatthis3Dgamewillneedtoberunonthereconfigurablehardware.Toenablethat,thevideoapplicationismovedtorunfurtherinsoftwareonaninstructionsetprocessor(ISP).Thehardwareisthenreconfiguredforthe3Dgameandthatapplicationisstarted(seeFigure2-13).Bymovingthevideoapplicationtosoftwareandrunningitinasmallerwindowalsoimpliesthatalowerdataratecanbeusedonthewirelessterminalinterconnect.Thismeansthatthewirelessapplianceshouldsendbacktotheserverthatalowerresolution(andthusalowerbit-rate)isallowedforthevideoapplication.Theapplicationqualityasperceivedbytheuserisstillsatisfying. 400Chapter2Figure2-12.AvideoapplicationisrunningonthereconfigurablehardwareFigure2-13.A3Dapplicationisrunningonthereconfigurablehardware,whilethevideoapplicationcontinuesinareducedwindowandonasoftwareprocessorFromtheapplicationscenarioabove,itisclearthatitmustbepossibletorunmanydifferentapplicationsonthereconfigurablehardware.Thismeansthatgeneralreconfigurablehardwareisneeded,incontrasttoincorporatingdedicatedhardwareblocks,likeFFTprocessor,FIRfilteretc.Alsowenoticethatapplicationsareverydifferentinnature,asalreadydescribedinthecaseofvideostreamingandinteractive3Dapplications.Aselectionofthe 2.ReconfigurablehardwareexploitationinWirelessMultimedia41Communicationsreconfigurationcharacteristicsisalsobasedongeneralcharacteristicsofthemulti-mediaapplicationsandontheusagescenarioabove.Requirementsonreconfigurationtimearemodest:becausereconfigurationisuser-initiated,fastreconfigurationtimes(<1msec)arenotneeded.Whene.g.switchingavideoapplicationfromhardwaretosoftware,itisnotimportantthatanumbersofframesarenotdecoded.Assoonastheapplicationisrunninginsoftware,itdecodesthenextincomingframe.Requirementsonthereconfigurationgranularityarecomplicatedbytheunknownnatureoftheapplication,thegranularityshouldbefineenoughsothatforeachapplicationanoptimalimplementationinreconfigurablehardwareispossible.Howeverduetopowerrequirements,wordlevelcoarsegrainreconfigurationismoreappropriatethanbit-levelreconfiguration.Thisisespeciallythecasewhentheword-lengthsarematchedtotheapplicationathand.Table2-4.OperationalpowerrequirementsforMPEG2videodecodingMPEG-2MP@MLDecoderFunctionMOPSInputOutputBitstreamparsingandVLD12440DequantizationandIDCT1054070MotionCompensation2737070YUVtoRGBcolorconversion2997035Total689184215Table2-5.Operationalpowerrequirementsfora3DapplicationQualityCPUtime#triangles#pixelsArchitecture31dB40ms50005%SW31dB2ms50005%HW25dB70ms500019%SW30dB80ms800019%SW43dB118ms1750019%SW43dB21ms1750019%HWTosummarizetherequirementsonapplications,itisnotonlyemphasizedthatdifferentapplicationsmustbeabletorunonthewirelessLANplatform,butalsothattheycanhavehugecomputationaldemandsforwhichdedicatedorreconfigurablehardwareisneeded.Tohaveanindicationoftherequiredoperationalpower,werefertoliterature[4,5]theresultsofwhicharesummarizedinTable2-4forMPEG2andinTable2-5fora3Dapplication.InthelatterapplicationtheCPUtime,andthustheframerate,isclosely 422Chapter2relatedtotherequiredquality(applicationQoS)butalsodependsonthearchitecture,beitahardwareorasoftwarerealization.REFERENCES1.ETSI(2000),BroadbandRadioAccessNetworks(BRAN);HIPERLANtype2;Physical(PHY)layer,v1.2.12.VanNeeR,PrasadR(1999)OFDMforMobileMultimediaCommunications.Boston:ArtechHouse3.IEEEStd802.11a/D7.0(1999)Part1:WirelessLANMediumAccessControl(MAC)andPhysicalLayer(PHY)specifications:HighSpeedPhysicalLayerinthe5GHzBand4.ZhouCG,KabirI,KohnL,JabbiA,RiceD,HuXP(1995)MPEGvideodecodingwithththeUltraSPARCvisualinstructionset.In:Proceedingsofthe40IEEEComputerSocietyInternationalConference,pp.4704775.LafruitG,NachtergaeleL,DenolfK,BormansJ(2000)3DComputationalGracefulDegradation.In:ProceedingsofISCASWorkshopandExhibitiononMPEG-4,vol.3,pp.547-550 Chapter3RECONFIGURABLEHARDWARETECHNOLOGIES1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:Alargenumberofreconfigurablehardwaretechnologieshavebeenproposedbothinacademiaandcommercially(someofthemintheirfirstmarketsteps).Theycanberoughlyclassifiedinthreemajorcategories:a)FieldProgrammableGateArrays(FPGAs),b)integratedcircuitdeviceswithembeddedreconfigurableresourcesandc)embeddedreconfigurablecoresforSystems-on-Chip(SoCs).Inthischapterrepresentativecommercial1technologiesarediscussedandtheirmainfeaturesarepresented.Keywords:FieldProgrammableGateArrays(FPGAs),embeddedreconfigurablecores,finegrainreconfigurablearchitecture,coarsegrainreconfigurablearchitecture1.FIELDPROGRAMMABLEGATEARRAYS(FPGAS)Fieldprogrammablegatearrayscurrentlyrepresentthemostpopularandmaturesegmentofreconfigurablehardwaretechnologies.TechnologyadvanceskeepincreasingthegatescountsandmemorydensitiesofFPGAswhiletheyalsoallowtheintegrationoffunctionsrangingfromhardwiredmultipliersthroughhighspeedtransceiversandallthewayupto(hardorsoft)CPUcoreswithassociatedperipherals.TheseadvancesmakepossibletherealizationofcompletesystemsonasingleFPGAchipimprovingend-systemsize,powerconsumption,performance,reliabilityandcost.Equally1Theinformationincludedinthischapterisup-to-dateuntilNovember2004.43N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,43-83.©2005Springer.PrintedintheNetherlands. 444Chapter3importantFPGAscanbereconfiguredinsecondseitherstaticallyordynamically/partially.Reconfigurationcantakeplaceintheworkstation,intheassemblylineorevenattheenduserpremises.Thesecapabilitiesprovideflexibility:•toreacttolastminutedesignchanges•toprototypeideasbeforeimplementation•tomeettime-to-marketdeadlines•tocorrecterrorsandupgradefunctionsoncetheendsystemisinusers’hands•oreventoimplementreconfigurablecomputingi.e.usingafixednumberoflogicgatestotime-division-multiplexmultiplefunctions.Becauseofalltheseadvantages,FPGAshavebeenmakingsignificantinroadsintoASICterritory.Itisamatteroftheper-gatecostdecreasesandthegatesperdeviceincreasestodecidewhetherFPGAscanreplaceASICs.MappingofapplicationsonFPGAshasbeenbasedonVHDLandVeriloglanguagesforinputdescriptions.Cbasedapproachesarealsocurrentlyunderconsideration.TheintegrationofCPUsonFPGAsintroduceddesignflowsandtoolssupportinghardware/softwarecodesignandsoftwaredevelopment.ThereareanumberofcompaniesbuildingFPGAsincludingActel,Altera,Atmel,LatticeSemiconductor,QuicklogicandXilinx;XilinxandAlteracurrentlybeingthemarketleaders.Inordertodifferentiate,FPGAvendorshaveintroduceddevicestoaddressdifferentintersectionsofperformance,power,integrationandcosttargets.SomerepresentativeFPGAdevicesarebrieflydiscussedinthefollowingsubsections.1.1ALTERAStratixIIAlteraclaimsthatStratixIIdevices[11]areindustry’sfastestandhighestdensityFPGAs.StratixIIdevicesextendthepossibilitiesofFPGAdesign,allowingdesignerstomeetthehigh-performancerequirementsoftoday’sadvancedsystemsandavoiddevelopingwithcostlyASICs.1.1.1ArchitectureTheStratixIIarchitecturehasbeendesignedtoprimarilyoptimizeperformancebutalsologicdensityinagivensiliconarea.ItslogicstructureisconstructedwithAltera’snewadaptivelogicmodules(ALMs).TheStratixIIarchitecturereducessignificantlythelogicresourcesrequiredtoimplementanygivenfunctionandthenumberoflogiclevelsinagivencriticalpath.Thearchitectureaccomplishesthisbypermittinginputstobe 3.ReconfigurableHardwareTechnologies45sharedbyadjacentlook-uptablesinthesameALM.Multiple,independentfunctionscanalsobepackedintoasingleALM,furtherreducinginterconnectdelaysandlogicresourcerequirements.ThestructureofaStratixIIALMisshowninFigure3-1.StratixIIFPGAsutilizetheTriMatrixmemorystructure.TriMatrixmemoryincludesthe512-bitM512blocks,the4-KbitM4Kblocks,andthe512-KbitM-RAMblocks,eachofwhichcanbeconfiguredtosupportawiderangeoffeatures.EachembeddedRAMblockintheTriMatrixmemorystructuretargetsadifferentclassofapplications:theM512blockscanbeusedforsmallfunctionssuchasfirst-infirst-out(FIFO)applications,theM4Kblockscanbeusedtostoreincomingdatafrommulti-channelI/Oprotocols,andtheM-RAMblockscanbeusedforstorage-intensiveapplicationssuchasInternetprotocolpacketbufferingorprogram/datamemoryforanon-chipNiosembeddedprocessor.Allmemoryblocksincludeextraparitybitsforerrorcontrol,embeddedshiftregisterfunctionality,mixed-widthmode,andmixed-clockmodesupport.Additionally,theM4KandM-RAMblockssupporttruedual-portmodeandbytemaskingforadvancedwriteoperations.Figure3-1.StratixIIadaptivelogicmodulestructureStratixIIDSPblocksareoptimizedtoimplementprocessingintensivefunctionssuchasfiltering,transforms,andmodulation.Capableofrunningat370MHz,StratixIIDSPblocksprovidemaximumDSPthroughput(upto284GMACs)thatisordersofmagnitudehigherthanleading-edgedigitalsignalprocessorsavailabletoday.EachDSPblockcansupportavarietyofmultiplierbitsizes(9x9,18x18,36x36)andoperationmodes(multiplication,complexmultiplication,multiply-accumulateandmultiplyadd)andcangenerateDSPthroughputof3.0GMACSperDSPblock.Inaddition,roundingandsaturationsupporthasbeenaddedtotheDSPblock. 466Chapter3StratixIIFPGAssupportmanyhigh-speedI/Ostandardsandhigh-speedinterfacessuchas10GigabitEthernet(XSBI),SFI-4,SPI4.2,HyperTransport™,RapidIO™,andUTOPIALevel4interfacesatupto1Gbps.Theseallowinterfacingwithanythingfrombackplanes,hostprocessors,busesandmemorydevicesto3Dgraphicscontrollers.StratixIIdevicessupportinternalclockfrequencyratesofupto500MHzandtypicaldesignperformanceatover250MHz.LogicdensitiesofStratixIIdevicesrangefrom15,600to179,400equivalentlogicelements.Totalmemorydensitiescanbeupto9MbitsofRAM,whichcanbeclockedata370MHzmaximumclockspeed.StratixIIFPGAsmayincludeupto12PLLsandupto48systemclocksperdevice.1.1.2GranularityStratixIIarchitectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodules.1.1.3TechnologyStratixIIFPGAsaremanufacturedon300-mmwafersusingTSMC’s90-nm,1.2-V,all-layercopperSRAM,low-kdielectricprocesstechnology.1.1.4ReconfigurationStratixIIdevicesareconfiguredatsystempower-upwithdatastoredinanAlteraconfigurationdeviceorprovidedbyanexternalcontroller.TheStratixIIdevice'soptimizedinterfaceallowsmicroprocessorstoconfigureitseriallyorinparallel,andsynchronouslyorasynchronously.TheinterfacealsoenablesmicroprocessorstotreatStratixIIdevicesasmemoryandconfigurethembywritingtoavirtualmemorylocation,makingreconfigurationeasy.RemotesystemupgradescanbetransmittedthroughanycommunicationsnetworktoStratixIIdevices.1.1.5Otherissues•NiosembeddedprocessorsallowdesignerstointegrateembeddedprocessorsonStratixIIdevicesforcompletesystem-on-a-programmable-chip(SOPC)designs.TheNiossoftembeddedprocessorhasbeenoptimizedfortheadvancedarchitecturalfeaturesoftheStratixIIdevicefamily. 3.ReconfigurableHardwareTechnologies47•StratixIIfamilyenablesdesignsecuritythroughnon-volatile,128-bitAESdesignencryptiontechnologyforpreventingintellectualpropertytheft.•Aseamless,cost-reductionmigrationpathtolow-costHardCopystructuredASICsexistsforStratixIIdevices.1.1.6DesignflowDesignflowforStratixIIFPGAsisbasedontheQuartusIIsoftwareforhigh-densityFPGAs,whichprovidesacomprehensivesuiteofsynthesis,optimization,andverificationtoolsinasingle,unifieddesignenvironment.QuartusIIincludesintegrateddevelopmentenvironmentforNiosIIembeddedprocessors.UsingtheSOPCBuilderdesigntoolintheQuartusIIsoftware,designersselectfromthewidearrayofIPcomponents,includingmemory,interface,control,anduser-createdfunctions,customizethemfortheparticularapplication,andconnectthemautomaticallygeneratinghardware,software,andsimulationmodelsforthecustomimplementation.1.1.7ApplicationareaSTRATIXIIFPGAsareveryflexibleallowingrealizationofdifferentapplications.DuetotheirhighmemorydensityStratixIIdevicesareanidealchoiceformemoryintensiveapplications.UsingDSPblocks,StratixIIFPGAscaneasilymeettheDSPthroughputrequirementsofemergingstandardsandprotocolssuchasJPEG2000,MPEG-4,802.11x,code-divisionmultipleaccess2000(CDMA2000),HSDPandW-CDMA.1.2ALTERACycloneIICycloneIIFPGAs[3]havebeendesignedfromthegroundupforthelowestcost.TheCycloneIIFPGAfamilyoffersacustomer-definedfeatureset,highperformanceandlowpowerconsumptioncombinedwithhighdensity.AlteraclaimsthatCycloneIIFPGAsofferthelowestcostperlogicelementamongallcommerciallyavailabledevicesandthuscansupportcomplexdigitalsystemsonasinglechipatacostthatrivalsthatofASICs.1.2.1ArchitectureCycloneIIdevicescontainatwo-dimensionalrow-andcolumn-basedarchitecturetoimplementcustomlogic.Columnandrowinterconnectsofvaryingspeedsprovidesignalinterconnectsbetweenlogicarrayblocks(LABs),embeddedmemoryblocksandembeddedmultipliers.Thelogic 488Chapter3arrayconsistsofLABs,with16logicelements(LEs)ineachLAB.Alogicelement(LE)isasmallunitoflogicprovidingefficientimplementationofuserlogicfunctions.LABsaregroupedintorowsandcolumnsacrossthedevice.ThesmallestunitoflogicintheCycloneIIarchitecture,theLE,iscompactandprovidesadvancedfeatureswithefficientlogicutilization.EachLEfeatures:•Afour-inputlook-uptable(LUT),whichisafunctiongeneratorthatcanimplementanyfunctionoffourvariables,•aprogrammableregister,•acarrychainconnection,•aregisterchainconnection•andabilitytodrivealltypesofinterconnects.EachLEoperateseitherinnormalorinarithmeticmode(eachoneusingLEresourcesdifferently).ThearchitectureofLEisshowninFigure3-2.Figure3-2.CycloneIIlogicelementstructureTheCycloneIIembeddedmemoryconsistsofcolumnsofM4Kmemoryblocks.TheM4Kmemoryblocksincludeinputregistersthatsynchronizewritesandoutputregisterstopipelinedesignsandimprovesystemperformance.EachM4Kblockcanimplementvarioustypesofmemorywithorwithoutparity,includingtruedual-port,simpledual-port,andsingle-port 3.ReconfigurableHardwareTechnologies49RAM,ROM,andfirst-infirst-out(FIFO)buffers.EachM4Kblockhasasizeof4,608RAMbits.CycloneIIdeviceshaveupto150embeddedmultiplierblocksoptimizedformultiplier-intensivedigitalsignalprocessing(DSP)functions.Designerscanusetheembeddedmultipliereitherasone18-bitmultiplierorastwoindependent9-bitmultipliers.Embeddedmultiplierscanoperateatupto250MHz(forthefastestspeedgrade)for18×18and9×9multiplicationswhenusingbothinputandoutputregisters.EachCycloneIIdevicehasonetothreecolumnsofembeddedmultipliersthatefficientlyimplementmultiplicationfunctions.CycloneIIdevicessupportdifferentialandsingle-endedI/Ostandards,includingLVDSatdataratesupto805megabitspersecond(Mbps)forthereceiverand622Mbpsforthetransmitter,and64-bit,66-MHzPCIandPCI-XforinterfacingwithprocessorsandASSPandASICdevices.CycloneIIdevicesrangeindensityfrom4,608to68,416LEs.CycloneIIdevicesofferbetween119to1,152Kbitsofembeddedmemorywithamaximumclockspeedof250MHz.CycloneIIdevicesprovideaglobalclocknetworkanduptofourphaselockedloops(PLLs).Theglobalclocknetworkconsistsofupto16globalclocklinesthatdrivethroughouttheentiredevice.1.2.2GranularityCycloneIIarchitectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodules.1.2.3TechnologyCycloneIIdevicesaremanufacturedon300-mmwafersusingTSMC’s90-nm,1.2-V,all-layercopperSRAM,low-kdielectricprocesstechnology,thesameprovenprocessusedwithAltera’sStratixIIdevices.1.2.4ReconfigurationCycloneIIFPGAsarestaticallyreconfigurable.CycloneIIdevicesareconfiguredatsystempower-upwithdatastoredinanAlteraconfigurationdeviceorprovidedbyasystemcontroller.Serialconfigurationallowsconfigurationtimesof100ms.AfteraCycloneIIdevicehasbeenconfigured,itcanbereconfiguredin-circuitbyresettingthedeviceandloadingnewconfigurationdata. 500Chapter31.2.5OtherissuesTheCycloneIIFPGAfamilyisfullysupportedbyAltera’srecentlyintroducedNiosIIfamilyofsoftprocessors.ANiosIIdesigninaCycloneIIFPGAoffersmorethan100DMIPsperformance.WithaNiosIIprocessor,adesignercanbuildacompletesystemonaprogrammablechip(SOPC)onanyCycloneIIdevice,providingnewalternativestolow-andmid-densityASICs.1.2.6DesignflowAllCycloneIIdevicesaresupportedbytheno-costQuartusIIWebEditionsoftware.QuartusIIsoftwareprovidesacomprehensivesuiteofsynthesis,optimizationandverificationtoolsinasingle,unifieddesignenvironment.Designerscanselectfromalargeportfolioofintellectualproperty(IP)coresanddownloadAltera'suniqueOpenCorePlusversionofthechosencore(s).TheQuartusIIsoftwareisusedtointegrateandevaluatethecoresinCycloneIIdevices.QuartusIIincludesintegrateddevelopmentenvironmentforNiosIIembeddedprocessors.1.2.7ApplicationareaCycloneIIFPGAsareidealforcostsensitiveapplications.1.3XilinxVirtex4TheVirtex-4family[12]isthenewestgenerationFPGAfromXilinx.Virtex-4FPGAsincludethreefamilies(platforms):LX,FXandSX.Choiceandfeaturecombinationsareofferedforallcomplexapplications.ThebasicVirtex-4buildingblocksareanenhancementofthosefoundinthepopularVirtexdevicesallowingupwardcompatibilityofexistingdesigns.Combiningawidevarietyofflexiblefeatures,theVirtex-4familyenhancesprogrammablelogicdesigncapabilitiesandisapowerfulalternativetoASICtechnology.1.3.1ArchitectureTheconfigurablelogicblock(CLB)resourceofXilinxVirtex4ismadeupoffourslices.Eachsliceisequivalentandcontains:twofunctiongenerators,twostorageelements,arithmeticlogicgates,largemultiplexers,fastcarrylook-aheadchainandhorizontalcascadechain.Thefunctiongeneratorsareconfigurableas4-inputlook-uptables(LUTs).Twoslicesina 3.ReconfigurableHardwareTechnologies51CLBcanhavetheirLUTsconfiguredas16-bitshiftregisters,oras16-bitdistributedRAM.Inaddition,thetwostorageelementsareeitheredge-triggeredD-typeflip-flopsorlevelsensitivelatches.EachCLBhasinternalfastinterconnectandconnectstoaswitchmatrixtoaccessgeneralroutingresources.Thegeneralroutingmatrix(GRM)providesanarrayofroutingswitchesbetweeneachcomponent.Eachprogrammableelementistiedtoaswitchmatrix,allowingmultipleconnectionstothegeneralroutingmatrix.Theoverallprogrammableinterconnectionishierarchicalanddesignedtosupporthigh-speeddesigns.Allprogrammableelements,includingtheroutingresources,arecontrolledbyvaluesstoredinstaticmemorycells.Thesevaluesareloadedinthememorycellsduringconfigurationandcanbereloadedtochangethefunctionsoftheprogrammableelements.TheblockRAMresourcesare18Kbittruedual-portRAMblocks,programmablefrom16Kx1to512x36,invariousdepthandwidthconfigurations.Eachportistotallysynchronousandindependent,offeringthree"read-during-write"modes.BlockRAMiscascadabletoimplementlargeembeddedstorageblocks.Additionally,back-endpipelineregisters,clockcontrolcircuitry,built-inFIFOsupportandbytewriteenablearenewfeaturessupportedintheVirtex-4FPGA.TheXtremeDSPslicescontainadedicated18x18-bit2’scomplementsignedmultiplier,adderlogicanda48-bitaccumulator.Eachmultiplieroraccumulatorcanbeusedindependently.Theseblocksaredesignedtoimplementextremelyefficientandhigh-speedDSPapplications.Mostpopularandleading-edgeI/Ostandards(bothsingleendedanddifferential)aresupportedbyprogrammableI/Oblocks(IOBs).Inlargerdevices10-bit,200kSPSanalog-to-digitalconverterisincludedinthebuilt-insystemmonitorblock.Additionally,FXdevicessupportintegratedhardwiredhigh-speedserialtransceiversthatenabledataratesupto11.1Gb/sperchanneland10/100/1000Ethernetmedia-accesscontrol(EMAC)cores.Virtex4FXdevicessupportoneortwohardwiredIBMPowerPC405RISCCPUs(upto450MHz)withtheauxiliaryprocessorunitinterface,whichallowsoptimizedFPGAbasedcoprocessorconnection.PowerPC405CPUisbasedona32-bitHarvardarchitecturewithafive-stageexecutionpipelinesupportingaCoreConnectbusarchitecture.InstructionanddataL1cachesof16KBeachareintegrated.Virtex4devicesachieveclockratesof500MHz.Virtex4deviceshavelogicdensitiesofupto200000logiccells.Memorydensitiesofupto9935kbitsforblockRAMandupto1392kbitsdistributedRAMaresupported.DSPslicesofupto512maybeincludedleadingtoa256GMACsDSPperformance. 522Chapter31.3.2GranularityVirtex4architectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodulesandcompletePowerPCCPUs.1.3.3TechnologyVirtex-4devicesareproducedonastate-of-the-art90nmtripleoxide(forlowpowerconsumption)copperprocess,using300mm(12inch)wafertechnology.Thecorevoltageofthedevicesis1.2V.1.3.4ReconfigurationVirtex4FPGAsaredynamically(partially)reconfigurabledevices.1.3.5OtherissuesOptional256-bitAESdecryptionissupportedon-chip(withsoftwarebitstreamencryption)providingIntellectualPropertysecurity.1.3.6DesignflowXilinxISEdevelopmentsystemisusedtomapapplicationsonthelogicpartofVirtex4devices.Advancedverificationandreal-timedebuggingisofferedbyChipScopeProtools.Morethan200pre-verifiedIPcoresareavailableforVirtex4devices.TheEDKPowerPCdevelopmentkitisusedfortherealizationoffunctionalityonPowerPCCPUs.1.3.7ApplicationareaVirtex-4LXFPGAsaresuitableforhigh-performancelogicapplications.Virtex-4FXdevicesarewellsuitedforhigh-performance,full-featuredsolutionforembeddedplatformapplications.Virtex-4SXdevicesareagoodsolutionforhigh-performanceDigitalSignalProcessing(DSP)applications.1.4XilinxSpartan-3TheSpartan-3familyofField-ProgrammableGateArrays[10]isspecificallydesignedtomeettheneedsofhighvolume,cost-sensitiveconsumerelectronicapplications.TheSpartan-3familybuildsonthesuccess 3.ReconfigurableHardwareTechnologies53oftheearlierSpartan-IIEfamilybyincreasingtheamountofresources,theuseofthestate-of-the-artVirtex-IItechnologyandtheadvancedprocesstechnology.1.4.1ArchitectureEachConfigurableLogicBlock(CLB)comprisesfourinterconnectedslices,asshowninFigure3-3.Theseslicesaregroupedinpairs.Eachpairisorganizedasacolumnwithanindependentcarrychain.Allfoursliceshavethefollowingelementsincommon:twologicfunctiongenerators,twostorageelements,wide-functionmultiplexers,carrylogic,andarithmeticgates.Boththeleft-handandright-handslicepairsusetheseelementstoprovidelogic,arithmetic,andROMfunctions.Besidesthese,theleft-handpairsupportstwoadditionalfunctions:storingdatausingDistributedRAMandshiftingdatawith16-bitregisters.TheRAM-basedfunctiongenerator(Look-UpTable)isthemainresourceforimplementinglogicfunctions.Figure3-3.Spartan-3CLBstructureSpartan-3devicessupportblockRAM,whichisorganizedasconfigurable,synchronous18Kbitblocks.BlockRAMstoresefficientlyrelativelylargeamountsofdata.Theaspectratioi.e.,widthvs.depthofeachblockRAMisconfigurable.Furthermore,multipleblockscanbecascadedtocreatestillwiderand/ordeepermemories.TheblocksofRAMareequallydistributedin1to4columns. 54Chapter3Therearefourkindsofinterconnect:Longlines,Hexlines,Doublelines,andDirectlines.LonglinesconnecttooneoutofeverysixCLBs;hexlinesconnectoneoutofeverythreeCLBs;doublelinesconnecttoeveryotherCLB.DirectlinesaffordanyCLBdirectaccesstoneighboringCLBs.Spartan-3devicesprovideembeddedmultipliersthataccepttwo18-bitwordsasinputstoproducea36-bitproduct.Theinputbusestothemultiplieracceptdataintwo’s-complementform(either18-bitsignedor17-bitunsigned).OnesuchmultiplierismatchedtoeachblockRAMonthedie.Theclosephysicalproximityofthetwoensuresefficientdatahandling.Cascadingmultiplierspermitsmultiplicandsmorethanthreeinnumberaswellaswiderthan18-bits.Twomultiplierversionsarepossible:oneasynchronousandonewithregisteredoutput.Spartan-3deviceshavelogicdensitiesofupto74880logiccells(correspondingto5millionsystemgates).Asystemclockrateofupto326MHzissupported.Memorydensitiesrangefrom72to1872kbitsofblockRAMand12to520kbitsofdistributedRAM.Thenumberofhardwiredmultiplierscanbeupto104.Spartandevicesincludeupto784I/Opinswith622Mb/sdatatransferrateperI/O.Seventeensingle-endedsignalstandardsandsevendifferentialsignalstandardsincludingLVDSaresupported.1.4.2GranularitySpartan-3architectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodules.1.4.3TechnologySpartan-3FPGAsaremanufacturedona90nmprocesstechnology.Threepowerrailsareincludedinthedevices:forcore(1.2V),I/Os(1.2Vto3.3V)andauxiliarypurposes(2.5V).1.4.4ReconfigurationSpartan-3FPGAsaredynamically(partially)reconfigurabledevices.1.4.5OtherissuesSpartan-3devicesallowintegrationofMicroBlazesoftprocessor,PCI,andothercores. 3.ReconfigurableHardwareTechnologies551.4.6DesignflowImplementationofapplicationsonSpartan-3devicesisfullysupportedbyXilinxISEdevelopmentsystem,whichincludestoolsforsynthesis,mapping,placementandrouting.TheEDKMicroblazedevelopmentkitisusedfortherealizationoffunctionalityonMicroblazecores.1.4.7ApplicationareaBecauseoftheirlowcost,Spartan-3FPGAsareideallysuitedtoawiderangeofconsumerelectronicsapplications,includingbroadbandaccess,homenetworking,display/projectionanddigitaltelevisionequipment.2.INTEGRATEDCIRCUITDEVICESWITHEMBEDDEDRECONFIGURABLERESOURCESIntegratedcircuitswithembeddedreconfigurableresourcesrepresentanalternativetoFPGAICs.ThesearchitecturesareinprinciplebasedonacombinationofaprogrammableCPUandareconfigurablearrayofwordlevel(coarsegrain)datapathunits.SuchdevicesmainlytargetDSPapplicationsandarecompetitorsofconventionalDSPinstructionsetprocessorsaswell.ThetechnologyislessmaturethanFPGAs,howeveritpromisesimportantadvantagesoverFPGAssuchaspowerandsiliconareaefficiency.Themajorissueistheefficientcompilationonthecoarsegrainreconfigurableresources.2.1ATMELFieldProgrammableSystemLevelIntegratedCircuits(FPSLICs)TheAtmel’sAT94SeriesofFieldProgrammableSystem-LevelIntegratedCircuits(FPSLICs)[2]arecombinationsoftheAtmelAT40KSRAMFPGAsandtheAtmelAVR8-bitRISCmicrocontrollerwithstandardperipherals.2.1.1ArchitectureThearchitectureofAT94KfamilyisshowninFigure3-4.TheembeddedAVRcoreisbasedonanenhanced,Ccodeoptimized,RISCarchitecturethatcombinesarichinstructionset(morethan120instructions)with32general-purposeworkingregisters.All32registersaredirectlyconnectedto 566Chapter3theALU,allowingtwoindependentregisterstobeaccessedinonesingleinstructionexecutedinonecycle.AVRincludesthefullcomplementofperipheralssuchasSPI,UART,timer/countersandahardwaremultiplier.SRAMdeliversone-cycleoperationatupto40MHz,whichtranslatesintoabout30MIPSfortheAVRspipelineRISCdesign.Forflexibility,the36KBofdynamicallyallocatedAVRSRAMcanbepartitionedbetweenx16programstoreandx8dataRAM.Forexample,onesetupmightdedicate20and16KBforprogramanddatarespectively,another32and4KB.Figure3-4.AtmelFPSLICAT94KArchitectureTheAVRcoreandFPGAconnectionisbasedonasimpleapproachthattreatstheFPGAmuchlikeanotheronboard8-bitperipheral.Thereisanaddressdecoderforgeneratingupto16pseudochipselectsintotheFPGAand,goingtheotherway,16interruptlinesthatarefedfromtheFPGAintotheAVR.TheMCUhasaccesstotheFPGA’seightglobalclocksandcandrivetwoofthemrelyingonitsowncombinationofinternalandexternaloscillators,clockdividers,timer/countersandsoon.TheFPGAcoreisbasedonahigh-performanceDSPoptimizedcell.FPSLICdevicesinclude5,000to40,000gatesofSRAM-basedAT40KFPGAand2-18.4Kbitsofdistributedsingle/dualportFPGAuserSRAM. 3.ReconfigurableHardwareTechnologies572.1.2GranularityThearchitectureofAT94devicesrepresentsfine-grainedarchitectureasfarasprogrammablelogicisconcerned.2.1.3TechnologyFPSLICdevicesarefabricatedonhigh-performance,low-power,3.0V–3.6V,0.35µCMOSfive-layermetalprocess.2.1.4ReconfigurationTheAT40KSRAMFPGAfamilyiscapableofimplementingCacheLogic(Dynamicfull/partiallogicreconfiguration,withoutlossofdata,on-the-fly)forbuildingadaptivelogicandsystems.Asnewlogicfunctionsarerequired,theycanbeloadedintothelogiccachewithoutlosingthedataalreadythereordisruptingtheoperationoftherestofthechip,replacingorcomplementingtheactivelogic.Figure3-5.SystemDesignerdesignflow2.1.5DesignflowAtmelprovidesSystemDesignertoolsuite(seeFigure3-5)thatcoordinatesmicrocontrollerandFPGAdevelopmentwithsource-leveldebug 588Chapter3andfullhardwarevisibility.Forimplementation,thepackageincludesplace-and-route,floorplanning,macrogeneratorsandbitstreamutilities.2.1.6ApplicationareaAtmel'sAT94KseriesFPSLICdeviceprovidesthelogic,processing,control,memoryandI/Ofunctionsrequiredforlow-power,high-performanceapplicationsincludingamongothers:PDAandcellphoneafter-marketproducts,GPS,portabletestequipment,point-of-saleandsecurityorwirelessInternetappliances.2.2QuickSilverADAPT2000AdaptiveComputingMachineSystemICPlatformQuickSilverTechnologyAdapt2000systemplatform[1],basedonadaptivecomputingtechnology,attemptstointegratethesiliconcapabilityofASIC,DSP,FPGAandmicroprocessortechnologieswithinasingleIC,anAdaptiveComputingMachine(ACM).Adapt2000platformaimsatachievingcustom-siliconcapabilitydesignedinsoftware–inweeksormonthsinsteadofyears–withfastertimetomarket,reduceddevelopmentcostsandtheabilityfordesignerstofocusoninnovatinganddevelopingIP.TheAdapt2000ACMsystemplatformcomprisestheAdapt2400ACMarchitecture,theInSpireNodeControlKernelandtheInSpireSDKtoolset.2.2.1ArchitectureAdapt2400architectureconsistsoftwomajortypesofcomponents:NodesandMatrixInterconnectNetwork(MIN).AgenericviewofAdapt2400architectureisshowninFigure3-6.NodesarethecomputingresourcesintheACMarchitecturethatperformtheprocessingtasks.Nodesareheterogeneousbydesign,eachbeingoptimizedforagivenclassofproblems.Eachnodeisself-containedwithitsowncontroller,memory,andcomputationalresources.Assuch,anodeiscapableofindependentlyexecutingalgorithmsthataredownloadedintheformofbinaryfiles.Nodesareconstructedofthreebasiccomponents:TheNodeWrapper,NodalMemoryandtheAlgorithmicEngine.TheNodeWrapperhastwomajorfunctions:a)toprovideacommoninterfacetotheMINfortheheterogeneousAlgorithmicEnginesandb)tomakeavailableacommonsetofservicesassociatedwithinter-nodecommunicationandtaskmanagement.Eachnodeisnominallyequippedwith16kilobytesofnodalmemoryorganizedasfour1kx32bitblocks.WhenbuildinganACM, 3.ReconfigurableHardwareTechnologies59memoriescanbeadjustedinsize,largerorsmaller,tooptimizecostorincreasetheflexibilityofaspecificnode.EachheterogeneousnodetypeisdistinguishedbyitsAlgorithmicEngine.Thecomputationalresourcesofeachnodetypearecloselymatchedandoptimizedtosatisfyafiniterangeofalgorithms.Figure3-6.GenericviewofAdapt2400architectureTherearethreeclassesofnodesinadaptivecomputing:•Adaptivenodessupporttheheavyalgorithmicelementsthatrequirecomplexcontrol.Theyhaveahighdegreeofprogrammabilityandcomputationalunitadaptability.•Domainnodesaredesignedforthereallycomplexpiecesofthealgorithms.DomainNodesperformatspeedscomparabletopureASICdesigns.Theircontrolmechanismsarefinitestatemachines.•Programmablenodesaredesignedtosupportlargecodebasesthatdonotdemandmuchprocessingpower.DesignersarealsoabletobuildtheirownfullycustomizedAlgorithmicEnginesandmemorystructures,andplacetheminsidetheNodeWrapper.TheMatrixInterconnectNetwork(MIN)tiestheheterogeneousnodestogether,andcarriesdata,configurationbinaryfiles,andcontrolinformationbetweenACMnodes,aswellasbetweennodesandtheoutsideworld.Thisnetworkishierarchicalinstructure,providinghighbandwidthbetweenadjacentnodesforclosecouplingofrelatedalgorithms,whilefacilitating 600Chapter3easyscalingoftheACMatlowsiliconoverhead.EachconnectionbetweenblockswithintheMINstructuresimultaneouslysupports32bitsofdatapayloadineachdirection.DatawithintheMINistransportedinsingle32-bitwordpackets,withaddressingcarriedseparately.Each32-bittransferwithintheMINcanberoutedtoanyothernodeorexternalinterface,withtheMINbandwidthfullysharedbetweenallthenodesinthesystem.AnAdapt2400ACMhasabuilt-inSystemControllerconnectedtotheMINRoot.TheSystemControllerisresponsibleforthemanagementoftaskswithinanACM.Inthisrole,theSystemControllersetsuptheindividualNodeHardwareTaskManagers(HTMs),andoncesetup,theHTMsaregivencontrolofthetasksonthenodewithouttheneedforinterventionbytheSystemControllertoperformataskswap.2.2.2GranularityAdapt2400architectureisa(tasklevel)coarsegrainarchitecture.2.2.3TechnologyADAPT2000platforminstanceshavebeenrealizedon0.13µmtechnologies.2.2.4ReconfigurationAdapt2400ACMarchitecturedynamicallyreconfiguresduringoperation.ACMnodesareconfigured/programmedusingabinaryfile(SilverWare),whichismuchsmallerthanthatofatypicalFPGAconfigurationfile,andiscomparabletotheprogramsizeofaDSPorRISCprocessor.Thesmallerbinaryfilesize,combinedwithhardwarespecificallydesignedtoadaptonthefly,allowsthefunctionofanodetochangeinaslittleasafewclockcycles.2.2.5DesignflowTheInspireSDKToolSetbyQuickSilverisacompletedevelopmentsystemfortheAdapt2400ACMArchitecturethatprovidesaunifieddesignenvironmentthatenablesrealizationofanACMwithinasingleIC.TheInspireSDKcomprisestheSilverCdevelopmentlanguage(ANSI-Cderivative),modulelinker,assemblerforeachnodetypeandtheInSpireSimulationPlatform,includingtheACMVerificationSwitchBoard.Thelatter,providesmulti-modeverificationofACMdesignsusinganycombinationoftheCVirtualNode(CVN),InspireSimulationPlatform, 3.ReconfigurableHardwareTechnologies61InSpireEmulator,andanactualACMdevice.TheInspireSDKiscompletelysoftware-basedandsupportsallphasesofdevelopment,fromhigh-levelsystemsimulationtocompiledbinariesrunningonanemulatorortargetIC.ItsAdapt2400SilverStreamDesignFlowenablesdeveloperstofreelyexpresssystemfunctionalitywithouttheneedtoconsiderhardwarepartitioning,taskthreading,ormemoryallocation.TheInSpireSDKalsoenablesengineerstocreatecustomAdapt2400architecturecoresinsimulationandassemblenewnodalcombinationsforexploringawidevarietyofACMhardwareconfigurations.Figure3-7.ACMdesignflowThedevelopmentflowfortheAdapt2400ACMArchitectureisbasedontheuseofadataflowmodelofthesystemunderdevelopment.Inthismethodologythesystemisrepresentedinaseriesoftop-downdataflowmodelsthatusesuccessiverefinementtechniquestobuilduptoafinalhardwareimplementation.TheACMSilverStreamDesignFlowsupportsthetask-based“executewhenready”asynchronousnatureoftheAdapt2400ACMArchitecturewithoutrequiringexperthardwareknowledgeonthepartofthedeveloper.ThedesignflowconsistsofuptosixstepsasshowninFigure3-7: 62Chapter3•Thefirststepconsistsof:(a)modelingthedataflowofthesystemunderdevelopmentbyusingSilverCtodefinetasks,andpipesbetweenthetasks,(b)assigningacyclebudgettoeachtaskand(c)simulatingthedatathroughputofthesystem.•ThesecondstepistodefinethefunctionofeachtaskusingANSI-C,andthenverifyingthebehavioralintegrityofthesystemusingCVirtualNodes(CVN).•Thethirdstepisnodetypeandnodeinstanceassignment.•Thefourthstepishardwareoptimizationwithnodeverificationusingthenode-typecompilersorassemblers,andtheappropriatenodesimulators.StepfourprovidesanI/Oaccuratemodelofthesystemoperation.EachnodecanbesimulatedusingtheACMVerificationSwitchBoard.ThismoduleintheInSpireSimulationPlatformallowsdeveloperstomodelthehardwaresystemasCVNsontheInSpireAdapt2400PlatformEmulator,InSpireDevelopmentBoard,oratargetdevice.Anyofthesemodelscanbeusedincombinationorindividuallyatanytime.•Thefifthstepisrun-timeoptimization,whichconsistsofassignmentofmultipletaskstonodes.TheInSpireSimulationPlatformandPerformanceAnalyzerareusedtodeterminewhichtaskscanbeassignedtothesamenodewithoutaffectingsystemoperation.Inthisstep,performanceandhardware-sizetrade-offscaneasilybemadeandanalyzedtoprovidethebestfitforsystemrequirements.•ThesixthstepisfinalsystemsimulationandverificationusingtheInSpireSimulationPlatformtoensureoverallsystemcompliancewithdesignspecifications.ThefinalsystemmodelscontainSystemCAPIsforinclusionintoESLmodelingenvironments.2.2.6ApplicationareaQuickSilverclaimsthatACM-enableddevicesprovidehighperformance,smallsiliconarea,lowpowerconsumption,lowcostandarchitectureflexibilityandscalability–theidealattributesforhandheld,mobileandwirelessproductsthatspanmultiplegenerations.Theyparticularlytargetsignalandimageprocessingapplications.2.3IPflexDAPDNA-2processorTheDAPDNADynamicallyReconfigurableProcessor[4]developedbyIPFlexInc.aimsatproviding“hardwareperformance”whilemaintaining“softwareflexibility. 3.ReconfigurableHardwareTechnologies632.3.1ArchitectureTheDAPDNA-2dynamicallyreconfigurableprocessorisadual-coreprocessor,comprisedofIPFlex'sownDAPhigh-performanceRISCcore,pairedwiththeDNAtwo-dimensionalprocessingmatrix.TheDAPDNA-2processorcanoperateat166MHz.TheDAPRISCcore(32bitwith8kbytesdatacacheand8kbytesinstructioncache)controlstheprocessor'sdynamicreconfiguration,whileportionsofanapplicationthatrequirehigh-speedprocessingarehandledbytheDNAmatrix,whichprovidesbothparallelandpipelinedprocessing.TheDNAmatrixisanarrayof376ProcessingElements(PE)comprisedofcomputationunits,memory,synchronizers,andcounters.ThetotalRAMoftheDNAarrayis576kbytes.TheDNAmatrixcircuitrycanbereconfiguredfreelyintothestructurethatismostoptimalformeetingtheneedsoftheapplicationindemand.Oneforegroundandthreebackgroundbanksareavailableon-chiptostoredifferentconfigurations.Additionalbankscanbeloadedfromexternalmemoryondemand.ThearchitectureofDAPDNA-2processorisshowninFigure3-8.Figure3-8.DAPDNA-2processorarchitectureLargeon-chipmemoryreducestheneedtoaccessoff-chipmemoryaprocessthatoftenbecomesaperformancebottleneck.ThisfeatureallowstheDNAtoprovidethemaximumpossibleparallelprocessingperformance.Sincethememoryisdistributedthroughouttheprocessingarray,thereisplentyofavailablememorybandwidth. 644Chapter3TheDAPDNA-2hassixchannelsofDNADirectI/O,whichprovidestheinterfacefortransferingdatadirectlyontooroutoftheDNAmatrix.EachchannelofDNADirectI/Ois32-bitwideandoperatesatthemaximumDAPDNA-2systemclockfrequencyof166MHz.TheDNADirectI/Ocanbealsousedtocommunicatedirectlywithexternaldevices,bringingdatainforprocessingontheDNAmatrix,bypassingtheBusSwitchandmemoryinterface.2.3.2GranularityTheDNAmatrixarchitectureisacoarsegrainreconfigurablearchitecture.2.3.3TechnologyTheDAPDNA-2processorcomesina156-pinFCBGApackage.Thepowersupplyforthecoreis1.2VwhilefortheI/Osis2.5V.2.3.4ReconfigurationDAPDNAprocessorisdynamicallyreconfigurableandcanchangeitshardwareconfigurationinoneclockcycleaccordingtotheapplicationondemand.2.3.5DesignflowTheintegrateddevelopmentenvironmentfortheDAPDNAdynamicallyreconfigurableprocessorisdesignedaroundtheconceptof“SoftwaretoSilicon”.TheSoftwaretoSiliconconceptmeansthatevensomeonewhodoesn'tknowhowtodesignhardwarecandevelopaproductbydesigninganapplicationusingahigh-levellanguage,andhavingthatapplicationseamlesslyimplementedasahardware.TheDAPDNAprocessorseriesisprovidedwiththeDAPDNA-FWIIIntegratedDevelopmentEnvironment,afull-featuredtoolsetcoveringeverythingfromalgorithmdesigntovalidationofanapplicationrunningontheactualhardware.DAPDNA-FWIIprovidescompilersforalgorithmswritteninMATLAB/SimulinkandCwithdataflowextension.DAPDNA-FWIIenvironmentsupportsthreedifferentdesignmethodologies,givingthedesignertheflexibilitytochoosethemostappropriatedesignmethod.ThefirstoptionistousetheDataFlowC(DFC)Compiler.InthiscaseitispossibletousetheCprogramminglanguagetodirectlycreatecodeforthedynamicallyreconfigurableprocessor.Ina 3.ReconfigurableHardwareTechnologies65developmentprocessbuiltaroundtheDFCcompiler,thedesignercancreatecodedirectlyusingtheCprogramminglanguage,whichreducesthedevelopmenttime.ThesecondoptionistousetheDNABlockset,whichallowsalgorithmdesignandverificationusingMATLAB,Simulink(fromTheMathWorksInc).DNABlocksetenablesaseamlessdesignflowfromalgorithmdesigntoimplementationintheDAPDNA-2processor,allwithintheMATLAB/Simulinkenvironment.ThethirdoptionistheDNAdesignerwhichisaGUI-baseddevelopmentenvironmentallowingthedesignertodrag-and-droprepresentationsoftheDAPDNAProcessingElements(PEs),supportinggraphicalconstructionofprocessingalgorithms.2.3.6ApplicationareaIPflexclaimsthattheDAPDNA-2istheworld'sfirstgeneral-purposedynamicallyreconfigurableprocessor.Itissuitableforapplicationsthatdemand,highperformanceandsupportforawiderangeofprocessingtasks.Italsoprovidesasolutionthatisoptimalfortoday'smarketplace,withitsdemandforshort-run,mixed-modelproduction.Targetapplicationsincludeindustrialperformanceimageprocessing(forfactoryautomation,inspectionsystems),broadcastandmedicalequipment,highprecisionhighspeedimageprocessing(multi-functionperipherals,laserprintersetc),basestations(cellular,PHS,etc),acceleratorsforimageprocessing,dataprocessingandtechnicalcomputation,securityequipment,encryptionacceleratorsandsoftwaredefinedradio.2.4MotorolaMRC6011ReconfigurablefabricdeviceTheMRC6011deviceisthefirstreconfigurablecomputefabric(RCF)devicefromFreescaleSemiconductor[7].Itisahighlyintegratedsystemonachip(SoC)thatcombinessixreconfigurablecomputefabric(RCF)coresintoahomogeneouscomputenode.TheprogrammableMRC6011deviceaimsatofferingsystem-levelflexibilityandscalabilitysimilartoaprogrammableDSPwhileachievingthecost,powerconsumptionandprocessingcapabilityofatraditionalASIC-basedapproach.2.4.1ArchitectureTheMRC6011RCFcoresareaccessibleintwoscalablemodules,eachcontainingthreeRCFcores,viatwomultiplexeddatainput(MDI)interfacesandtwoslaveI/OInterfaces.EachMDIinterfacecancommunicatewithupto12channels(antennasforexample),andeachRCcontrollercanmanipulatethedatafromtwochannels.ThedataprocessedbytheRCF 666Chapter3coresgoeseithertooneofthetwoslaveI/Obusinterfaces(compatiblewithindustry-wideDSPs)ortoanothercorewithinthesamemoduleortheadjacentmodule.ExternalinterfacesincludetheMDIinterfacesandslaveI/Obusinterfaces(supportingDSPbootstrapping)operatingatupto100MHz,andaJTAGportforreal-timedebugging.ThearchitectureoftheMRC6011deviceisshowninFigure3-9.Figure3-9.ArchitectureofMRC6011deviceEachRCFcoreincludesanoptimized32-bitRISCprocessor(allowingefficientCcodecompilation)withinstruction(4kbytes)anddatacaches(4kbytes).Thereconfigurablecomputing(RC)arrayincludes16reconfigurableprocessingunitswith16bitdatapathsincludingapipelinedMACunit.TheRCFcorealsoincludesatwo-channelinputbuffer(8kbytes),alargeframebuffer(40kbytes)witheightaddressgenerationunits(AGUs),aspecial-purposecomplexcorrelationunitsupportsspreading,complexscrambling,complexcorrelationon8-bitand4-bitsamplesandasingleandbursttransferDMAcontroller.At250MHz,thesix-coreMRC6011devicedeliversapeakperformanceof24.0Gigacomplexcorrelationspersecondwithasampleresolutionof8bitsforIandQinputseach,oreven48.0Gigacomplexcorrelationspersecondat4-bitresolution. 3.ReconfigurableHardwareTechnologies672.4.2GranularityThearchitectureoftheMRC6011isacoarsegrainarchitecturebasedonthewordlevelreconfigurabledatapathsoftheRCarrays.2.4.3TechnologyMRC6011devicesaremanufacturedona0.13µmprocesstechnology.Theinternallogicvoltageis1.2Vwhiletheinput/outputvoltageis3.3V.Thecoremaximumoperatingfrequencyis250MHzwhilethemaximumoperatingfrequencyforalloff-corebusesis100MHz.2.4.4ReconfigurationMRC6011isadynamicallyreconfigurablemulti-contextdevice.2.4.5DesignflowDesignflowforMRC6011isbasedonCandassemblyprogramming.TheCodeWarriorDevelopmentStudioforFreescaleRCFBasebandSignalProcessorsisacompletedevelopmentenvironmentforFreescaleReconfigurableComputeFabric(RCF)baseddevices.TheCodeWarriorDevelopmentStudioisacompletecodedevelopmentstudioandincludes:a)theProjectManagerthatprovidesanythingrequiredforconfiguringandmanagingcomplexprojects,b)theEditorandCodeNavigationSystemthatallowscreationandmodificatonofsourcecodeandc)thegraphicalleveldebuggers.CodeWarriorDevelopmentStudio,inconcertwiththePowerTAPProhardwaretargetinterface,providesamulti-coredebuggingenvironmentthatallowsforquicksinglesteppingaswellasfastdownloadsofverylargetargetfiles.IncaseofmultipleMRC6011products,itispossibletoconnecttheJTAGconnectionsinawayallowingtalkingtoanyoftheMRC6011'sthroughasinglePowerTAPdevice.SincePowerTAPhasEthernetasit'sconnectionmethodtoCodeWarrior,debuggingcanbedoneremotelyaswellasprovidingamechanismtoshareasingleresourceamongseveralengineers.FunctionaltestingeffortcanbeminimizedthroughutilizationofCodeWarriorDevelopmentStudio'sfullscriptingcapability.2.4.6ApplicationareaHighlyflexibleandprogrammable,theMRC6011processorprovidesanefficientsolutionforcomputationallyintensiveapplications,suchas 688Chapter3widebandcodedivisionmultipleaccess(WCDMA),CDMA2000andTD-SCDMAbasebandprocessing,includingchiprate,symbolrateandadvanced3Gfunctionssuchasadaptiveantenna(AA)andmulti-userdetection(MUD).2.5picoChipPC102picoArrayprocessor–ThePC102isthe2ndgenerationofthepicoArrayhighlyparallelprocessingarchitecturedevelopedbypicoChip[9].ThepicoChip'sPC102picoArrayprocessorisasignalprocessingdeviceoptimisedfornextgenerationwirelessinfrastructure.Thesolutioncanbedescribedasa“SoftwareSystemonChip”(SSoC):fastenoughtoreplaceFPGAsorASICsbutwiththeflexibilityandeaseofprogrammingofaprocessor.PC102picoArrayprocessoroffersscalabilityallowingextremelylargesystemstobebuiltbyconnectingdozensofprocessors.2.5.1ArchitectureThearchitectureemphasiseseaseofdesign/verificationanddeterministicperformanceforembeddedsignalprocessing–especiallywireless.ThepicoArraycombineshundredsofarrayelements,eachwithaversatile16bitRISCprocessor(3wayLIWwithHarvardarchitecture)withlocaldataandprogrammemoryconnectedbyahigh-speedinterconnectfabric.ThearchitectureisheterogeneouswithfourtypesofelementoptimisedfordifferenttaskssuchasDSPorwirelessspecificfunctions.Aswellasthestandardarrayelements,othershandlecontrolfunctions,memoryintensiveandDSP-orientedoperations.Multiplearrayelementscanbeprogrammedtogetherasagrouptoperformparticularfunctionsrangingfromfastprocessingsuchasfiltersandcorrelators,throughtothemostcomplexcontroltasks.WithinthepicoArraycore,arrayelementsareorganisedinatwodimensionalgrid,andcommunicateoveranetworkof32bitbuses(thepicoBus)andprogrammablebusswitches.ArrayelementsareconnectedtothepicoBusbyports.TheportsactasnodesonthepicoBusandprovideasimpleinterfacetothebusbasedonputandgetinstructionsintheinstructionset.Theinter-processorcommunicationprotocolisbasedonatimedivisionmultiplexing(TDM)scheme,wheredatatransfersbetweenprocessorportsoccurduringtimeslots,scheduledinsoftware,andcontrolledusingthebusswitches.Thebusswitchprogrammingandtheschedulingofdatatransfersisfixedatcompiletime.AroundthepicoArrraycorearesysteminterfaceperipheralsincludingahostinterfaceandanSRAMinterface.FourhighspeedI/Ointerfaces 3.ReconfigurableHardwareTechnologies69connecttoexternalsystemsorlinkpicoArraydevicestogethertobuildscalablesystems.ThebasicconceptofpicoArrayarchitectureisshowninFigure3-10.Figure3-10.BasicconceptofpicoArrayarchitecturePC102picoArrayhashugeprocessingresourcesforcomputeintensivedatapath.Italsohasenormousamountsofgeneral-purposeMIPStohandletheevermorecomplexcontroloperations.ThePC102uses348arrayelementsrunningat160MHz,andwithpeakusecanhandleover197,100millioninstructionspersecond(MIPS),147,800millionoperationspersecond(MOPS)or38,400millionmultiplyaccumulate(MMAC)instructionspersecondover10timestheperformanceofotherprogrammablesolutions.ThemicroprocessorinterfaceisusedtoconfigurethePC102deviceandtotransferdatatoandfromthePC102deviceusingeitheraregistertransfermethodoraDMAmechanism.Theinterfacehasanumberofportsmappedintotheexternalmicroprocessormemoryarea.TwoportsareconnectedtotheconfigurationbuswithinthePC102andtheothersareconnectedtothepicoBus.Theseenabletheexternalmicroprocessortocommunicatewiththearrayelementsusingsignals.Alternatively,thePC102canself-configure(orboot)instandalonemodefromasupportedmemory.2.5.2GranularityPC102processor’spicoArrayarchitectureisa(CPUlevel)coarsegrainreconfigurablearchitecturebasedon16bitCPUs. 70Chapter32.5.3ReconfigurationThepicoArrayarchitectureistotallyprogrammableandcanbeconfiguredatruntime(singlecontextdevice).2.5.4TechnologyPC102deviceshavebeenmanufacturedona0.13µmprocesstechnology.HighperformanceflipchipBGApackageshavebeenusedforpackaging.Thecorevoltageis1.2Vwhiletheinput/outputvoltageis2.5V.2.5.5DesignflowpicoChip'spicoToolsisafully-integratedhomogeneous(overthewholesystem)developmentenvironmentforthepicoArraywhichincludesCcompiler,assembler,debuggerandcycle-accuratesimulator,inwhichsystemperformanceisguaranteedbydesign(withcompletepredictability).picoChipalsosuppliesaLibraryofExampleDesignsandarangeofDevelopmentplatforms.Thedeveloperdefinesthestructureandrelationshipsbetweenprocesses,completelyspecifyingsignalflowsandtimings.TheindividualprocessorsarethenprogrammedinstandardCorassemblerasblockstobeembeddedwithinthestructure.Theentiredesign(structure,data-pathandcontrol)isdebuggedatthesourcelevel.Thisallowsengineerstoworkonthewholesysteminanintegratedway,ratherthanhavingtodebugdifferenttechnologiesseparately.Theprogrammingofthearrayiscompletelyautomatic,andthedesignerisabstractedfromthisimplementationdetails.Theoutputisahardwareconfigurationfilecontainingthedesignandthetiminginformationtoruninthesimulation.Thiscreatesaseamless“closedloop”flowfromthesimulatortothedevelopmentkitthroughtothesystem.ThepicoChiparchitectureisextremelyscalable,andapplicationscanberunacrossmultiplelinkeddevices.Thetoolsallowlargedesignstobesimulated,placedandverifiedaseasilyassmallones.Thearchitecturegiveshighlevelsofconfidenceinusingmultiplepre-verifiedblocksinaseriesofstaticsoftwarearchitecturesthatcanbeimplementedatdifferenttimesonthesamehardwaretogiveatrulyreconfigurablesystem.2.5.6ApplicationareaThePC102isacommunicationsprocessor,optimizedforhighcapacitywirelessdigitalsignalprocessingapplications.Thedeviceenablesalllayer1(physicallayer)signalprocessingandlayer1controltobeimplementedin 3.ReconfigurableHardwareTechnologies71software.ThedeviceisabletorunanywirelessprotocolsincludingWCDMA(FDDandTDD),cdma2000andTD-SCDMA,oremergingstandardssuchas802.16(WiMAX).2.6LeopardLogicGladiatorConfigurableLogicDeviceTheGladiatorconfigurablelogicdevice(CLD)[6]familyrepresentstheonlydigitallogicdevicethatcombinesFieldProgrammableGateArray(FPGA)technologywithhardwiredApplicationSpecificIntegratedCircuit(ASIC)logic.GladiatorCLDaimsatachievingmuchlowerNREchargesthanASICsincombinationwithdramaticallylowerunitcostthancomplexFPGAs.InitsfirststepsLeopardLogicprovidedembeddedFPGAIPcoresforASIC/SoCandfoundrysuppliersbutindustry’sinterestwithrespecttothisapproachwaslimited.ThenLeopardLogicreinventeditselfasasiliconsupplier.2.6.1ArchitectureThearchitectureofGladiatorCLDisshowninFigure3-11.ThebasicbuildingblocksofGladiatorCLDaretheHyperBloxFP(FieldProgrammable)andtheMP(MaskProgrammable)fabrics,whicharecombinedwithoptimizedmemories,Multiply-Accumulateunits(MACs)andflexiblehigh-speedI/Os.GladiatorCLDisavailableindensitiesrangingfrom1.6Mupto25Msystemgateswithupto10Mbitsofembeddedmemory.Itsupportssystemspeedsupto500MHz.GladiatorCLDincludeshighspeedMACunitsforfastarithmeticandDSP,upto16PLLcontrolledclockdomainswithfrequencysynthesisanddivisionand,upto16DLLforphaseshiftingtosupportinterfacetimingadjustment.GladiatorCLDoffersflexibleI/OoptionsandsupportsseveralgeneralpurposeI/Ostandards.GladiatorCLDalsosupportsDDR/QDR. 722Chapter3Figure3-11.ArchitectureofGladiatorCLD2.6.2GranularityThearchitectureofGladiatorCLDrepresentsafinegrainarchitecture.2.6.3TechnologyTheHyperBloxFPfabricisbasedonLeopardLogic’sproprietaryHyperRouteFPGAtechnologythatutilizestheindustrysfirstfully’hierarchical,multiplexer-based,point-to-pointinterconnect.Thistechnologyenablessuperiorspeed,utilization,predictabilityandreliabilitycomparedtolegacyFPGAarchitectures.TheHyperBloxMPfabricusesthesamelogiccorecellarchitectureasHyperBloxFPbutreplacestheSRAMconfigurationwithasingle-layervia-maskconfiguration,calledHyperVia.Thistechnologyprovidessignificantlyhigherdensity,aswellasincreasedperformanceandlowerpower.2.6.4ReconfigurationTheGladiatorCLDisstaticallyfield-upgradeablethroughembeddedSRAM-basedFPGA. 3.ReconfigurableHardwareTechnologies732.6.5DesignflowTheGladiatorCLDdesignflowisbasedonleadingindustrystandarddesigntoolsandflowscombinedwithLeopardLogicshighlyoptimized’ToolBloxbackendtools.PartitioningbetweentheHyperBloxMPandFPsectionsofthedeviceisdoneintuitively.FixedandstableblocksofthedesignaremappedintotheHyperBloxMPfabric,whilehigh-riskblocksthatarestillinfluxaremappedintotheFPfabric.DesignsarequicklyandeasilysynthesizedfromRTLintoaCLDdevice.Fulltimingclosureisachievedbasedonaccuratetimingextractionperformedbytheuser.BitstreamsfortheFPGAsectionsofthedevicearegeneratedautomaticallyandcanbedownloadedintothedeviceinstantly.Partitioningbetweenhard(MP)andsoft(FP)functionsisasnapwiththeToolBloxdesignflowandtheunifiedhardwarearchitectureallowstheallocationofdesignblocksevenpost-synthesis.Startingfrompre-processedwafers,userscanimplementsubstantialamountsofhighspeedlogicinthemask-programmable(MP)sectionofthedevice.AftersendingthegeneratedconfigurationdatatoLeopardLogic,firstsamplesaredeliveredwithinweeks.Thisprocessisreferredtoas“marketizationbecauseittransformsthegenericdeviceintoauseror”marketsegmentspecificdevice.Duetominimummaskandprocessingrequirements,theNon-RecurringEngineering(NRE)costsforthisprocessareanorderofmagnitudelowerthanforatraditionalcell-basedASIC.Themarketizeddevicescanbefurthercustomizedanddifferentiatedby“”ktidprogrammingtheHyperBloxFPfabric.LikeanyotherSRAM-basedFPGA,thisfabricallowsforanunlimitednumberofreconfigurationsbysimplydownloadinganewbistreamintothedevice,thusofferingoptimalin-fieldprogrammability.2.6.6ApplicationareaGladiatorConfigurableLogicDeviceissuitableforareasthattodayuseacombinationofApplicationSpecificStandardProduct(ASSP)/ASICwithstandaloneFPGAssuchasnetworking(edge,access,aggregation,framers,communicationscontrollers,backplaneinterfaces),storage(bridges,controllers,interfaces,gluelogic)andwireless(DSPacceleration,chiprateprocessing,smartantenna,bridges,backplanes,gluelogic).Acrossallmarkets,Gladiatorisanidealfitforthefastandcost-effectiveimplementationofflexibleformatconverters,protocolbridges,businterfacesandgluelogicfunctions. 74Chapter33.EMBEDDEDRECONFIGURABLECORESAstheSystem-on-Chip(SoC)worldbegantodevelopattheendofthe1990s,itwasrecognisedthat,tomakethedevicesmoreuseful,someformofprogrammablefabricwouldbeneeded.ASICdevelopersalsoconsideredembeddedreconfigurablelogicasonewaytobringsomeformoffieldprogrammabilitytoanotherwisededicatedproduct.TheindustryrespondedinanenthusiasticfashionandanumberofreconfigurablehardwarecoresthatcanbeembeddedinSoCs/ASICshavebeenproposedsincelate1990s.Twomajorarchitectureshavebeenmainlyconsidered:embeddedFPGAs(finegrain)andreconfigurablearraysofwordleveldatapaths(coarsegrain).Despitetheinitialenthusiasmseveraloftheseattemptsfailedcommercially(AdaptiveSilicondisappearedwhileActelstoppedtheirembeddedFPGAtechnologyactivities).Majorreasonswerethehighsiliconarea(itcouldrequirehalfthechipareatoputadecentamountofprogrammablelogiconit),andthepoweroverheadsofembeddedFPGAsandtheimmaturecompilationtechniquesforthecoarsegrainreconfigurablearrays.InOctober2004duringtheEDATechForuminSanJose,itwasprojectedthatuntilthefirstquarterof2005twoembeddedFPGAcoresforASICs/SoCswillbeputonthemarket-onebyacombinationofIBMandXilinxandtheotherbySTMicroelectronics.Themajorreasonthatcouldleadtheseattemptstocommercialsuccessistheuseof90nmtechnologies.3.1MorphoTechnologiesMS1ReconfigurableDSPcoresMorphotechnologiesreconfigurableDSP(rDSP)coresMS1-16andMS1-64[8]aimatprovidinghardwareflexibilityinimplementingmultipleapplications,minimizedlevelsofobsolescence,andlowpowerconsumptionwhileloweringhardwarecosts.Thecoresareavailableasis,ormaybecustomdesignedand/orquicklyintegratedintoanySoC,tofittheneedsofthecustomerandapplication(s).3.1.1ArchitectureTheMS1familyofrDSPsisfullyautonomousIP(soft,firmorhard)coresthatfunctionasco-processorstoahostprocessorinasystem.TheMS1rDSParchitectureconsistsofa32-bitRISCwith5pipelinestagesandbuilt-indirect-mappeddataandinstructioncache,anRCArraywith8to64ReconfigurableCells(eachhavinganALU,MACandoptionalcomplexcorrelatorunit),Contextmemorywith32to512contextplanes,aFrameBufferwithupto2048Kbytesinsize,andthreeoptionalblocksspecificto 3.ReconfigurableHardwareTechnologies753G-WCDMAbasestationapplications(namelyaSequenceGenerator),anInterleaverandanIQBuffer(16bytesto4Kbytesperantenna).Amulti-master128-bitDMAbuscontrollersupportingbursttransferswithbothsynchronousandasynchronousmemoryinterfaceisalsoincludedintheMS1architecture.ThearchitectureoftheRCarrayisshowninFigure3-12.Figure3-12.ArchitectureofReconfigurableCellsarray3.1.2GranularityThereconfigurablecellsarray(RC)ofMorphotechnologiesrDSPcoresisareconfigurablearrayofcoarsegraindatapaths.3.1.3TechnologyEvaluationdevicesareavailablein0.18µmand0.13µmprocesstechnologieswithcorevoltagesat1.8V/1.2Vand3.3VdigitalI/Ovoltage.3.1.4ReconfigurationMorphotechnologiesreconfigurableDSP(rDSP)coresaredynamicallyreconfigurableandcanadaptontheflytorealizedifferentapplications.Switchingfromoneapplicationspecificsetofinstructionstoanotherisdoneonasingleclockcycle. 76Chapter33.1.5DesignflowTheMS1rDSPcoresandassociatedevaluationdevicesareaccompaniedwithacompletetoolchainthatincludessoftwaredevelopmenttoolssuchasacompilerandtranslator,asimulatorandadebugtool.MorphoTechnologiesdevelopedanextensiontotheCProgramminglanguagecalled“MorphoC”allowingforfastandsimpleprogrammingtotheMSIrDSPcores.MorphoCisdesignedtodescribetheSingleInstructionMultipleData(SIMD)executionmodeloftheMS1rDSParchitecture.MorphoTransreadstheMorphoCprogramandkernellibrarymappinginformationandgeneratesastandardCprogramthatisrecognizablebythecompiler(gcc).TheoutputofMorphoTransiscompiledandlinkedwiththekernellibraryobjectfilestogenerateanexecutablefile.TheoutcomeofthisprocessmaybeexecutedintheMorphoSimsoftwaresimulatoranddebuggedbythedebugger(gdb).Inaddition,thesameexecutablecodecanalsoberunontheMS1developmentboard.MorphoSimprovidesanenvironmentforbehavioralsimulationoftheMS1rDSPcores.Tomakethelatestwired,wirelessandimagingstandardsintoproductionapplicationreality,thedebuggerisusedinconjunctionwithMorphoSimtodebugapplicationprogramsthatutilizevariouskernelssuppliedbytheMorphoTechnologiesextensivelistorfromcustomerspecifickernellibraries.3.1.6ApplicationareaMorphotechnologiesreconfigurableDSPcoresarecapableofimplementingthebasebandprocessingofairinterfacessuchasWCDMAinadditiontosourceprocessingsuchasMPEG4andvocoders.IngeneralMorphotechnologiesreconfigurableDSPcoresaresuitableforsignalprocessingbasedproductsincludingcommunicationsequipmentforwirelessandwirelineterminalsandinfrastructure,homeentertainmentandcomputergraphics/imageprocessing.3.2PACTXPPIPcoresAPACTXPPprocessororcoprocessor[13]canbeintegratedinaSystem-on-Chip(SoC)andcanbedesignedfromasmallsetofmacroblocksofwhichthelargestisintherangeof90kgates.ThehomogeneousarchitectureofXPPallowssynthesizingeachoftheblocksseparatelyand,inthesecondstep,arrangingthesynthesizedblockshierarchicallytothefinalarray. 3.ReconfigurableHardwareTechnologies773.2.1ArchitectureAnarrayofconfigurableprocessingelementsistheheartoftheXPP.Thearrayisbuiltfromaverysmallnumberofdifferentprocessingelements(PEs).ALU-PEsperformthebasiccomputations.RAM-PEsareusedforstorageofdata.TheI/OelementsconnecttheinternalelementstoexternalRAMsordataports.Theconfigurationmanagerloadsprogramsontothearray.ThearchitectureofthearrayisshowninFigure3-13.TheALUisatwoinputtwooutputALUprovidingtypicalDSPfunctionssuchasmultiplication,addition,comparison,sort,shiftandboolean.Alloperationsareperformedwithinoneclockcycle.TheALUcanbeutilizedforaddition,barrelshiftandnormalizationtasks.TheForwardRegisterisaspecializedALUthatprovidesdatastreamcontrolsuchasmultiplexingandswapping.Itintroducesalwaysonecyclepipelinedelay.TheCommunicationNetworkallowspointtopointandpointtomultipointconnectionsfromoutputstoinputsofotherelements.Upto8datachannelsareavailableforeachhorizontaldirection.Switchesattheendofthelinescanconnectthechanneltothechanneloftheneighboringelement.Figure3-13.ArchitectureofXPP’sarrayofconfigurableprocessingelementsTheRAMElementsarearrangedattheedgesofthearrayandarenearlyidenticaltotheALUPEs,howevertheALUisreplacedbyamemory.ThedualportedRAMhastwoseparateportsforindependentreadandwriteoperations.TheRAMcanbeconfiguredtoFIFOmode(noaddressinputsneeded)orRAMwith9ormoreaddressinputs.TheIPmodelallowstodefinethestoragecapacity.Typicalvaluesrangefrom512to2kwords. 788Chapter3BackRegisterandForwardRegistercanbeconfiguredtobuildalinearaddressgenerator.TherebyDMAtoorfromRAMcanbedonewithinoneRAM-PE.SeveralRAM-PEscanbecombinedtoalargerRAMwithacontiguousaddressspace.I/OElementsareconnectedtohorizontalchannels.ThestandardI/O-Elementprovidestwomodes:•StreaminggTwoportsperI/OElementsareconfiguredtoinputoroutputmode.TheXPPPackethandlingisperformedbyaReady-Acknowledgehandshakeprotocol.Thusexternaldatastreams(e.g.fromaA/D-converter)mustnotbesynchronoustotheXPPclock.•RAMMOneoutputprovidestheaddressestotheexternalRAM,theotheristhebi-directionaldataport.ExternalSynchronousStaticRAMsaredirectlyconnectedtotheaddressports,dataportsandcontrolsignals.ThemaximumsizeofexternalRAMsdependsonthedatabuswidthoftheXPP(e.g.16Mwordsforthe24-bitarchitecture).TheConfigurationManager(CM)microcontrollerhandlesallconfigurationtasksofthearray.InitiallyitreadsconfigurationsthroughanexternalinterfacedirectlyfromS-RAMsintoitsinternalcache.Thenitloadstheconfiguration(i.e.opcodes,routingchannelsandconstants)tothearray.AssoonasaPEisconfigured,itstartsitsoperationifdataisavailable.Furtheron,theCMloadssubsequentconfigurationstothearray.Thelocaloperatingsystemensures,thatthesequentialorderofconfigurationismaintainedwithoutdeadlocks.ThestructureofXPParrayofconfigurableelementsisverysimplemakingthearrayhomogeneousandsimplifyingprogrammingandplacingofalgorithms.TheIPmodelofXPPallowsdefiningthesizeandarrangementoftheprocessingelementsaccordingtotheneedsoftheapplications.Inaddition,thewidthoftheDataPathsandALUscanbedefinedbetween8and32bit.XPPisdesignedtosimplifytheprogrammingtaskandtoallowhighlevelcompilerstotapthefullparallelpotentialoftheXPP.ThemostimportantXPPfeaturetosupportthis,isthepackethandling.Datapacketscontainoneprocessorword(e.g.24-bit)andarecreatedattheoutputsofobjectsassoonasdataisavailable.Fromthere,theypropagatetotheconnectedinputs.Ifmorethenoneinputisconnectedtotheoutput,thepacketisduplicated.Ontheotherhand,anXPPobjectstartsitscalculationonlywhenallrequiredinputpacketsareavailable.Ifapacketcannotbeprocessed,thepipelinestallsuntilthepacketisprocessed.Thismechanismensurescorrectoperationofthealgorithmunderallcircumstancesand,theprogrammerdoesnotneedtocareaboutpipelinedelaysinthearrayandhowtosynchronizetoasynchronousexternaldatastreams. 3.ReconfigurableHardwareTechnologies793.2.2GranularityPACTXPParraysarchitectureisacoarsegrainreconfigurablearchitecture.3.2.3TechnologyXPPcoresaretechnologyindependent.PACTprovidesXPPcoresassynthesizableVerilogRTLcode.3.2.4ReconfigurationXPParraysallowfastdynamicreconfiguration.IncontrasttoFPGAs,XPPneedsonlyKbitsforafullconfiguration;internalRAMsbufferdatabetweentheconfigurations.Foroptimalperformancethenumberofdata,whichiscalculatedinoneconfiguration,shouldbeashighaspossibletominimizetheeffectofthereconfigurationlatency.Smallpartsofthearraycanbereconfiguredwithouttheneedtostopcalculationsofotherconfigurationsonthesamearray.3.2.5DesignflowTheXDSdevelopmentsuitesupportsco-developmentandco-simulationofsystemswiththeXPP-array.TheXDSisacompletesetoftoolsforapplicationdevelopment.SinceinmostapplicationsXPPisusedasacoprocessortomicro-controllers,theXDSprovidesseamlessdesign-flowforboth,themicro-controllerandtheXPP.Derivedfromadataflowgraph,algorithmsaredirectlymappedontothearray.TheGraphs'snodesdefinedirectlythefunctionalityandoperationoftheALUorotherelements,whereastheedgesdefinetheconnectionsbetweentheelements.Suchaconfigurationremainsstaticallyonthearrayandasetofdatapacketsflowsthroughthisnetofoperators.ApplicationsarewritteninCorC++.Inanenvironmentwithamicro-controllerandtheXPPascoprocessor,thesoftwaretasksaredividedintotwosections.Thecontrol-flowtasksareprocessedwiththestandardtoolsforthemicro-controllerandthehighbandwidthdata-flowtasks,thatneedsupportbytheXPP,arecompiledbytheXPP-VC.ThisvectorizingC-compilermapsasubsetofCtotheXPP,andallowsintegratingoptimizedmodules.Thesemodulesoriginatefromalibrary,orarewrittenfortheapplicationintheNativeMappingLanguage,NML.APIfunctionsforloadingandstartingofconfigurations,configurationsequencing,dataexchangeviaDMAandtasksynchronizationprovideacomfortable 80Chapter3environmentforC-programmerswhoarefamiliarwithembeddeddesigns.Thelinkercombinescodeofbothsections,whichcaneitherbesimulatedbysoftware,oruploadedtothetargethardware.Theintegrateddebuggingtoolforthemicro-controllerandtheXPP,allowsinteractivetestandverificationofthesimulationresultsorthehardware.TheconfigurationandthedataflowintheXPParevisualizedinagraphicaltool.3.3ElixentDFA1000TheElixentDFA1000accelerator[5]wasdesignedfromthegrounduptodeliveronthepromiseofReconfigurableSignalProcessing(RSP).UtilizingtheadvancedD-Fabrixprocessingarray.Itaimsatdeliveringhugebenefitsinperformance,powerconsumptionandsiliconarea.TheseattributesmakeitidealforintegrationwithRISCprocessorsinmobile/consumer/communicationsapplicationsthatneedtheultimateinsignalormediaprocessing.Theseadvantagesaredeliveredthroughsiliconreuse.TheDFA1000acceleratorimplements“virtualhardware”–hardwareacceleratorsforspecificalgorithms,implementedassimpleconfigurationsontheD-Fabrixprocessingarray.Whenonealgorithmcompletes,anew“virtualhardware”acceleratorisloaded,performingthenexttaskinthesystem’sdataflow.3.3.1ArchitectureThebasisforElixentsDFA1000istheD-Fabrixprocessingarraya’platformthatrealisesthepotentialofReconfigurableAlgorithmProcessing.ThestructureofD-Fabrixissimplethecomponentsare4-bitALUs,registersandtheswitchbox.Twoofeacharecombinedintoabuildingblock,thetile“HundredsorthousandsoftilesarecombinedtocreatetheD-Fabrixarray.Specialfunctionscanbedistributedthroughthearrayforexample,memoryisalwaysdistributedtogivefast,localstoragewithmassivebandwidth.CreatingwiderexecutionunitsissimplyamatterofcombiningALUs–typicallyinto8,12or16-bitunits,butoccasionallyintofarlargerunits.MuchofthetaskoflinkingtheALUstogetherinthiswayisperformedbythearray’sroutingswitchboxes.ThearchitectureoftheD-FabrixarrayisshowninFigure3-14.TheDFA1000acceleratorintegratesseveralbanksoflocalhigh-speedRAMnexttothearray.Theseareforoften-useddata;forexample,theymaybeusedasimagelinestores,orasaudiobuffers.TheseRAMseliminatemanyhighbandwidthaccessesoff-chip,improvingpowerconsumptionwhileatthesametimeenhancingperformance. 3.ReconfigurableHardwareTechnologies81Figure3-14.ArchitectureofD-FabrixarrayTheDFA1000alsoincludesaperipheralsettofacilitateitsintegrationintoSOCdesigns.Thearchitectureoffershigh-speeddatainterfacestotheD-Fabrixcorearray.Thisallowshigh-speeddatatobedrivenintothearraydirectly,withlowlatencyandnooverheadonthesystembus.Thesehigh-speeddatainterfacesaresupplementedbytheAMBAbusinterface,usedforprogrammingthearray,andtransferringdatatoandfromthehostprocessor.Thisistypicallyamuchlowerbandwidthcontrolandconfigurationpath.Thearchitecturealsointegrateslocalhigh-speedRAMs,directlyaccessiblebythearrayorbytheRISC;andofcoursetheD-Fabrixarrayitself.3.3.2GranularityDFA1000architectureisamediumgranularityarchitecturebasedon4-bitALUs.3.3.3TechnologyDFA1000willbemadeavailableindifferentindustrystandardprocesses.Firstrealizationwasona0.18µmtechnology.3.3.4ReconfigurationDFA1000canbedynamicallyreconfiguredinmicroseconds. 822Chapter33.3.5DesignflowThekeytousingtheDFA1000acceleratoriscreatingthehigh-performancevirtualhardwareconfigurations.D-Sign,theD-Fabrixalgorithmprocessor'stoolsetoffers,threemaindesignstylesforthispurpose:•HDLentry,usingeitherVerilogorVHDL•C-styleentry,usingCeloxica'sHandel-C•Matlabentry,usingAccelchip'sAccel-FPGAAllthedesignentrytoolsfeedacommonback-end.Thisperformsoptimisationstothecode,beforemappingresourcestotheD-Fabrixarray.Theentireprocessisautomatic.Oncethearraydescriptionhasbeen“compiled”forthearchitecture,itisplacedandrouted.ThisstageisanalogoustotheresourceallocationphasesthatacompilerusesforaVLIWprocessor,allocatingarrayresourcetothefunctionswithinthealgorithm.Theoutputofthe“placeandroute”toolisthefinalprogram.3.3.6ApplicationareaD-Fabrixissuitableforseveralapplicationsfromnetworkedmultimedia(MPEG-4,JPEG,camera,graphics,rendering)towireless(3G,CDMA,OFDMetc)orevensecurity(RSA,DES,AES...).REFERENCES1.Adapt2000QuickSilverTechnologies(2004)4)Availableat:http://www.qstech.com/default.htm2.AT40KAtmel(2004)Availableat:http://www.atmel.com/atmel/products/prod39.htm3.CycloneIIAltera(2004)Availableat:http://www.altera.com/products/devices/cyclone2/cy2-index.jsp4.DAPDNAIPFlexInc(2004)Availableat:http://www.ipflex.com/en5.DFA1000Elixent(2004)Availableathttp://www.elixent.com/products6.GladiatorLeeopardLogicLogic(2004)(Available2004)vailableat:http://www.leopardlogic.com/products/index.php7.MRC6011Freescale(2004)Availableat:http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MRC6011&nodeId=01279LCWs8.MS1MorphoTechnologies(2004)Availabelat:http://www.morphotech.com/9.picoArraypicoChip(2004)Availableat:http://www.picochip.com/technology/picoarray10.Spartan-3Xilinx(2004)Availableat:http://www.xilinx.com/xlnx/xil_pro-dcat_landingpage.jsp?title=Spartan-311.StratixIIAltera(2004)Availableat:http://www.altera.com/products/devices/stratix2/st2-index.jsp12.Virtex-4Xilin(2004)Availabxbleat:http://www.xilinx.com/xlnx/xil_prodcat_landingpage.jsp?title=Virtex-4 3.ReconfigurableHardwareTechnologies8313.XPPIPcoresPACT(2004)Availableat:http://www.pactcorp.com/ PARTBSYSTEMLEVELDESIGNMETHODOLOGY Chapter4DESIGNFLOWFORRECONFIGURABLESYSTEMS-ON-CHIP1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:AtopdowndesignflowforheterogeneousreconfigurableSystems-on-Chipispresentedinthischapter.Thedesignflowcoversissuesrelatedtosystemleveldesigndowntobackendtechnologydependentdesignstages.Emphasisisgivenonissuesrelatedtoreconfiguration,especiallyinsystemlevelwhereexistingflowsdonotcoversuchaspects.Keywords:Designflow,systemlevel,reconfiguration,reconfigurableSystems-on-Chip1.INTRODUCTIONHeterogeneousSystems-on-Chip(SoCs)withembeddedreconfigurableresourcesformaninterestingoptionfortheimplementationofwirelesscommunicationsandmultimediasystems.Thisisbecausetheyoffertheadvantagesofreconfigurablehardwarecombinedwiththeadvantagesofotherarchitecturalstylessuchasgeneralpurposeinstructionsetprocessorsandapplicationspecificintegratedcircuits(ASICs).Furthermore,suchSoCsallowcustomizationonthewayreconfigurableresourcescanbeused(typeanddensityofresources)dependingonthetargetedapplicationorsetofapplications.AgenericviewofaheterogeneousreconfigurableSystem-on-ChipisshowninFigure4-1.SuchaSoCwillnormallyincludeinstructionsetprocessors(generalpurpose,DSPs,ASIPs),customhardwareblocks(ASICs)andreconfigurablehardwareblocks.Theembeddedreconfigurableblockscanbeeithercoarsegrained(wordlevelgranularity)orFPGAlike(bitlevelgranularity).Thedifferentprocessingelementsmaycommunicate87N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,87-105.©2005Springer.PrintedintheNetherlands. 888Chapter4throughabus,howevercurrenttrendsaremoretowardscommunicationnetworksonchip(forscalability,flexibilityandpowerconsumptionissues).DirectMappedInstructionSetHardwareProcessors(ASIC)DistributedCommunicationNetworksharedmemoryorganizationFinegrainCoarsegrainreconfigurablereconfigurablehardwarehardwareFigure4-1.AbstractviewoftargetedimplementationplatformThedesignofaSoCwithreconfigurablehardwareisnotatrivialtask.Toobtainanefficientimplementationanextendeddesignflowisneededinordertocopewiththereconfigurationaspectsonawidescaleofcommerciallyavailableplatforms.Inaddition,ahighabstractionlevelmethodologyneedstobedevelopedforhelpingindecidingtheinstancesoftheimplementationtechnologies,bothforfinegrainedandcoarsegrainedreconfigurablehardware.Therequirementsandtheprinciplesofsuchdesignmethodologyarefurtherdiscussedintherestofthischapter.Itmustbenotedthatthedesignflowandhighleveldesignmethodsdescribedintherestofthischaptercanbeequallyapplytooff-the-shelfsystemlevelFPGAsthatincludeembeddedhardwiredblocks(includingsoftwareprocessorsandASICblocks).2.DESIGNFLOWREQUIREMENTSFORRECONFIGURABLESYSTEMS-ON-CHIPTheintroductionofreconfigurableresourcesinSystems-on-Chipcreatestheneedformodificationsandextensionstoconventionaldesignflowswithemphasisonthehigherabstractionlevels,wheremostimportantdesigndecisionsaremade.Inthissection,conventionalsystemleveldesignflows 4.DesignFlowforReconfigurableSystems-on-Chip89arebrieflypresentedandthensystemleveldesignflowrequirementsforreconfigurableSystems-on-Chiparediscussed.2.1OverviewofconventionalsystemleveldesignflowsDrivenbytheSoCdesigngrowth,thedemandforsystemlevelco-designmethodologiesisalsoincreasing[6].Academicandcommercialsourceshaveprovidedco-designmethodologies/toolsforavarietyofapplicationdomains,withmanyhardware/softwarepartitioningopportunities,synthesis,simulationandvalidationmechanisms,atdifferentdegreesofautomationandlevelsofmaturity.Asfarassystemspecificationisconcerned,avarietyoflanguages(HDL,objectoriented,proprietary)arebeingusedforsystemlevelspecification.Somemethodologiesexploitacombinationoflanguagesinordertoproperlydescribethehardwareorsoftwarepartsofthedesign.Thetrendishowevertounifythesystemdesignspecificationinonedescriptionlanguagecapableofrepresentingthesystematthehighlevelofabstraction[6].Thegoalofhardware/softwarepartitioningistheoptimizeddistributionofsystemfunctionsamongsoftwareandhardwarecomponents.Withrespecttothat,mostbeneficialarethemethodologiesthatprovidethepartitioningatdifferentlevelsofmodelingwithoutthenecessityofrewritingthehardwareorsoftwarespecifications.Thisnotonlyreducesthedesigniterationsteps,butalsoenableseasyinclusionofpredefinedlibraryelementsorIPblocks.Theimportantfeaturethatshouldbetakenintoaccountduringco-synthesisisthepossibilityofinterfacesynthesis.Thedifferentpossibleinter-processcommunicationprimitivesarecoveredindifferentmethodologies.Theyareeitherfixedtotheparticularmethodologyorwiththeoptionalpossibilityofcreatingnewprimitivesbasedontheexistingones.Co-simulationtechniquesrangefromcommercialsimulationbasedonmethodologyspecificsimulationenginestocombinationofmultiplesimulationengines.Mostofthemethodologydependentco-simulatorsarebasedoneventdrivensimulation,whilesomeofthemcomewithanoptionforco-simulationwithothersimulators[8].Co-verificationismainlysimulationbased,meaningthattheresultsoftheHDL,ISSorproprietarysimulationsatdifferentlevelsoftheco-designflowarecomparedforcorrectfunctionalityandtiming,withtheinitialspecifications.Debuggingisenabledinsomemethodologiesbyexploitingagraphicaltooloraproprietaryuserinterfaceenvironment.Themainfeaturesofrepresentativesystemlevelhardware/softwareco-designmethodologiesaresummarizedinTable4-1. 900Chapter4Asanaturalconsequenceofwhathasbeenmentionedinthepreviousparagraphs,itisconcludedthatagenerictraditionalsystemleveldesignflowusuallyinvolvesthefollowingkeyphases:•Systemspecification•Hardware/softwarepartitioningandmapping•Architecturedesign•Systemlevel(usuallybuscycleaccurate)simulationand•Fabricationofhardwareandsoftwareusingtoolsprovidedbytechnologyvendors.Table4-1.Summaryofthemainfeaturesofsystemlevelhardware/softwareco-designmethodologiesSystemHW/SWCo-Co-simulationRemarkSpecificationPartitioningsynthesisCo-verificationUsingC++forConcurrentInterfaceUnifiedco-simulationFuturefunctionalityprocesses,synthesisandenvironment,performanceversionsandpartitioningonindustrialestimation,co-simulationbuildontopCAPI-XLarchitecturaltheseprocessestoolsforRTLwithothersimulationofSystemCOpropertiescanbemadesynthesisenginesanywhereinthedesignflowUsingSystemChannels,Simulationengineincluded,BecomingSystemCspecificationsinterfacesperformanceestimationindustry(basedoncanberefinedandeventsstandardSystemCC++)fortomixedSWenabletofunctionalityandHWmodelandimplementa-communi-architecturetionscationandsynchroni-zationCUsingTemplateUnifiedco-simulationPerhapstheVCC/VHDLformodelsofenvironmentwithemphasismostarchitecturearchitectureonperformanceestimationcompleteandwheresoftwaretoolsetfunctionalityandhardwareareamappedUsingRetargetableinstructionsetUsefulforproprietarysimulatorsimulatesthethedesignofChess/nMLlanguageexecutionofcodeontargetembeddedCheckersforprocessorprocessorprocessorsarchitecture,Cforapplicationcontinued 4.DesignFlowforReconfigurableSystems-on-Chip91SystemHW/SWCo-Co-simulationRemarkSpecificationPartitioningsynthesisCo-verificationUsingVerilogAllocateInterfaceHW/SWco-simulationIncludesforfunctionalitytosynthesisandengineincludedinterfacefunctionalityprocessorsindustrialsynthesisbutHINOOKandpre-toolsforRTLrequirestoolCdefinedsynthesisspecificcomponentsmodelsofforarchitectureprocessorsandbusesLUsingasubsetSynthesisandNetlist&CommercialVHDLPreciseofVHDLforcompilationcontrollerssimulatortosimulatemodelingofCOOarchitecturetoolsusedtoforfunctionalityofthesystemcostandandcomputethecommuni-specificationanditsperformancefunctionalityvalueforthecationimplementationafterco-metricscostmetrics;betweenHWsynthesisspecificandSWalgorithmstogeneratedinsolvetheVHDLHW/SWpartitioningOneprocessorAllocatealltoInterfaceHW/SWco-simulationApplicableandVerilogSW,thenmovesynthesis,engineincludedonlytooneYMAfunctionalityslowestpartsandincludedprocessorCOStoHW.toolsforRTLarchitecturesynthesiswithhardwareco-processorCC/C++,ManualAutomaticSimulationenvironment;Supportfor2SystemCforinterfaceco-simulationwithIPcoresNsystemlevelsynthesisandcommerciallyavailabledescription,industrialinstructionsetsimulatorsextendedCfortoolsforRTLhardware.synthesiselEsterelBDDandtemporallogicCompilationrlanguagebasedverificationtechniquesofEsterelsteprogramsEintoFSM,HWorCprogramscontinued 92Chapter4SystemHW/SWCo-Co-simulationRemarkSpecificationPartitioningsynthesisCo-verificationSSubsetofCforDifferentHW/SWExperimentalSW,subsetofpartitioningcommuni-co-synthesisVHDLforHWmodelsandcationenvironmentLYCOalgorithmsthroughavailablememorymappedI/OHTextualModelingProjectinSthreeearlyEMindependentresearchlayersforphaseSW,scheduler/protocolandHWresourceyManymodelsSomecode-Powerfulco-simulationFeaturesofgenerationenginefordifferentmodelsvarywithcomputationstoolsofcomputationmodelsofPtolemthatcanbecomputationusedinsingledesign2.2SystemleveldesignflowrequirementsforreconfigurableSystems-on-ChipThewayinwhichthepresenceofembeddedreconfigurableresourcesaffectsthemajorstagesofasystemleveldesignflow,andtheadditionalrequirementsitcreatesarediscussedinthissubsection.2.2.1SystemspecificationInthesystemspecificationphase,therequirements,restrictionsandspecificationsaregatheredaswhennotusingreconfigurableresources,butextraeffortmustbespentonidentifyingpartsoftheapplicationsthatserveascandidatesforimplementationwithreconfigurablehardware.Theincorporationofreconfigurablehardwarebringsnewaspectstothearchitecturedesigntaskandtothepartitioningandmappingtask.Inthearchitecturedesigntask,anewtypeofarchitecturalelementisintroduced.Inarchitecturaldesignspace,thereconfigurablehardwarecanbeviewedasbeingatimeslicescheduledapplicationspecifichardwareblock.Onewayofincorporatingreconfigurablepartsintoanarchitectureistoreplacesome 4.DesignFlowforReconfigurableSystems-on-Chip93hardwareacceleratorswithasinglereconfigurableblock.Theeffectsofreconfigurableblocksonthearea,speedandpowerconsumptionshouldbecompletelyunderstoodbeforetheycanbeefficientlyused.2.2.2Hardware/softwarepartitioningandmappingDuringthisphase,anewdimensionisaddedtotheproblem.Thepartsofthetargetedsystemthatwillberealizedonreconfigurablehardwaremustbeidentified.Therearesomerulesofthumbthatcanbefollowedtogiveasimplesolutiontothisproblem:•Iftheapplicationhasseveralroughlysamesizedhardwareacceleratorsthatarenotusedinthesametimeorattheirfullcapacity,adynamicallyreconfigurableblockmaybeamoreoptimizedsolutionthanahardwiredlogicblock.•Iftheapplicationhassomepartsinwhichspecificationchangesareforeseeable,theimplementationchoicemaybereconfigurablehardware.•Ifthereareforeseeableplansfornewgenerationsofapplication,thepartsthatwillchangeshouldbeimplementedwithreconfigurablehardware.Furthermore,forthedesignofreconfigurablehardwareinsteadofconsideringjustarea,speedandpowerconsumptionasithappensintraditionalhardwaredesign–thetemporalallocationandschedulingproblemmustalsobeaddressed.Thisisachievedinawaysimilartothepoliciesfollowedforsoftwaretasksrunningonasingleprocessor.Thisleadstoincreasedcomplexityinthedesignflow,sincethecostfunctionsofthefunctionalityimplementedwithreconfigurabletechnologyincludetheproblemsofbothhardwareandsoftwaredesign.Therearebasicallytwopartitioning/mappingapproachessupportedbytheexistingcommercialdesignflows:(a)thetoolorienteddesignflow,and(b)thelanguageorienteddesignflow.ExamplesoftoolorienteddesignflowsaretheN2CbyCoWare[7]andVCCbyCadence[5].Thedesignflowssupportedbythesetoolsworkwellontraditionalhardware/softwaresolutions.Nevertheless,therefinementprocessofadesignfromunifiedandun-timedmodeltowardsRTListoolspecific,andtheincorporationofnewreconfigurablepartsisnotpossiblewithoutunconventionaltrickery.ExamplesoflanguageorienteddesignflowsareOCAPI-XL[12]andSystemC[13].Especiallyforthelatter,sinceitpromotestheopennessofthelanguageandthestandard,theadditionofanewdomaincanbemadetothecorelanguageitself.However,themethodmostlypreferredistomodelthebasicconstructsrequiredformodelingandsimulationofreconfigurablehardware,usingbasicconstructsofthelanguage.Inthisway,thelanguage 944Chapter4compatibilitywithexistingtoolsanddesignsispreserved.SystemCextensionsforreconfigurablehardwaredesignandOCAPI-XLarethoroughlycoveredinChapters5and6respectively.2.2.3ArchitecturedesignAdesignflowthatsupportssystemdescriptionsathighabstractionlevel,mustalsosupportthereconfigurabletechnologiesofdifferenttypesandvendors.Themainquestionthatmustbeanswered,evenatthehighestlevelofabstraction,is:Whattoimplementwithreconfigurabletechnologyandwhichreconfigurabletechnologytouse?2Thedesignflowmayanswerthesequestionsbyusingdifferenttechniques.First,analysisbasedtoolscompiletheunifiedrepresentationoftheapplicationfunctionalityandproduceinformationonwhichpartsoftheapplicationareneverruninparallel.Thisinformationcanbeusedtodeterminewhatfunctionalitycanbeimplementedindifferentcontextsofareconfigurableblock.Analternativemethodistheuseofcostfunctionsforeachimplementationtechnology.Costfunctionshelpinmakingquickdesigndecisionsusingseveralparametersandoptimizationcriteriaatthesametime.Anothercategoryoftoolsuseprofilinginformationgatheredinsimulationsinordertopartitiontheapplicationandtoproduceacontextschedulertobeusedinthefinalimplementation.ExampleofthisapproachisatoolsetforMorphoSys[14]reconfigurablearchitecture.Finally,themostrealisticalternativeforindustrialapplicationsisthesimulationbasedapproach.Inthisapproach,thepartitioning,mappingandschedulingareaccomplishedmanuallybythedesigner,whiletheresultsandtheefficiencyareverifiedthroughsimulations.Thisapproachisalsotheeasiesttoincorporateintoanexistingflow,sincetherequiredtoolsupportislimitedcomparedtothepreviousapproaches.Thisalsoleavesallthedesigndecisionstothedesigner,whichispreferredbymanyindustriallyuseddesignflows.Whenconsideringdesigningadditionstoalanguageoratoolthatcansupportmodelingandsimulationofreconfigurabletechnologies,asetofparametersthatdifferentiatetheimplementationtechnologiesneedtobeidentified:(a)thereconfigurableblockcapacityingates,(b)theamountofcontextmemoryrequiredtoholdconfigurations,(c)thereconfigurationtimeandsupportforpartialreconfiguration,(d)typicalclockortransactionspeed,and(e)powerconsumptioninformation.The2AbriefintroductiontoexistingreconfigurablehardwaretechnologiesispresentedinChapter3. 4.DesignFlowforReconfigurableSystems-on-Chip95aforementionedparametersareadequateformodelinganytypeofhomogenousreconfigurabletechnology.Thesimulationaccuracyresultingfromusingtheseparametersisnotoptimal,butitissufficientforgivingthedesigneranideaofhoweachdifferentreconfigurabletechnologyaffectsthetotalsystemperformance.Theresultsneededforsteeringthedesignspaceexploration,andverifyingthatthedesigndecisionsfulfillthetotalsystemperformance,are:•Spatialutilization,whichisneededtovalidatethecorrectsizeoftheblockandalsogranularityofthecontexts.•Temporalutilization,thatismeasuredtocomparethetimespentinconfiguringtheblock,waitingforactivationandactivelydoingthecomputation.•Contextmemorybusload,whichismeasuredtoanalyzetheeffectsofthereconfigurationmemorybustrafficontheperformanceofsystembuses.•Areaandpowerconsumptionwhicharecomparedagainsthardwareorsoftwareimplementation.Theaforementionedresultsshouldbeusedasadditionalinformationinordertodecidewhichreconfigurabletechnologytouseandwhichpartsoftheapplicationwillbeimplementedwithit.Whencomparingtherequirementspertainingtoreconfigurabilityinexistingdesignflows,itcanbeseenthattheexistingdesignflowsandtoolsdonotsupportanyoftherequirementsdirectly.Eitherthetoolsandlanguagesshouldbeimprovedorcompanyspecificmodificationsareneeded[1,2,3].3.THEPROPOSEDDESIGNFLOWFORRECONFIGURABLESoCsThissectionprovidesthegeneralframeworkoftheproposeddesignflowfordesigningcomplexSoCsthatcontainreconfigurableparts.TheflowaimstoimprovethedesignprocessofaSoCinordertousetheavailabletoolsinanoptimalway[11].Themainideaofthedesignflowproposedistoidentifythepartsofaco-designmethodology,wheretheinclusionofreconfigurabletechnologieshasthegreatesteffect.Thisisveryimportantsincetherearenocommercialtoolsormethodologiestosupportreconfigurabletechnologies,yet.ThedesignflowisdividedinthreepartsasshowninFigure4-2.TheSystem-LevelDesign(SLD)referstothehighlevelpartoftheproposedflow,whiletheDetailedDesign(DD)andImplementationDesign(ID)refertothebackendpartofthemethodology. 966Chapter4SystemRequirements/SpecificationCaptureArchitectureSystemDefinitionPartitioningMappingSystem-LevelSystem-LevelDesignSimulationSpecificationRefinementHardwareSoftwareReconfigurableDesignDesignHardwareDesignExternalIPIntegrationDetailedDesignCo-VerificationFPGA/ASICSoftwareImplementationImplementationDesignDesignVerificationFPGADownloading/SiliconManufacturingImplementationProductDesignQualificationFigure4-2.TheproposedDesignFlow 4.DesignFlowforReconfigurableSystems-on-Chip97DetailsontheformalismsusedarethoroughlycoveredinChapters5and6,whileChapters7,8and9provideinformationhowtheproposedframeworkcanbeappliedforthedesignofrealworldcasestudies.3.1SystemLevelDesign(SLD)AttheSLDphase,themaintargetsare:•todevelopaspecificationoftheapplicationassociatedwiththerequirementscaptured(andanalyzed),•todesignthearchitectureoftheSoC,•toselectmajorimplementationtechnologies,•topartitiontheapplicationforimplementationinhardware,softwareorreconfigurablehardwareand,•toevaluatetheperformanceofthepartitionedsystem.Therequirementsarecapturedandanalyzedinthespecificationphaseandtheresultsarefedtothenextphasesofthedesignflow.Architecturetemplatescanbeusedtoderiveaninitialarchitecture.Theycanbebasedonpreviousversionsofthesameproduct,adifferentproductinthesameproductfamily,adesign/implementationplatformprovidedbythedesigntoolorsemiconductorvendororevenoninformationofasimilarsystembyacompetitor.Atthearchitecturedefinitionphase,buscycleaccuratemodelsofthearchitecturalunitsarecreated,sothattheperformanceofthearchitecturecanbeevaluatedusingsystemlevelsimulations.Inthepartitioningphase,thefunctionalmodeloftheapplicationispartitionedinsoftware,hardwareandreconfigurablehardware.Thesepartitionsarethenmappedontothearchitecture,annotatedwithestimationsoftimingandothercharacteristicsneededinthemappingphase.AttheSLD,thereconfigurationissuesemergeinthefollowingforms:•Thegoalsforreconfiguration(e.g.flexibilityforspecificationchangesandperformancescalability)withassociatedconstraintsareidentifiedattherequirementsandspecificationstep.•Atthedesignspaceexplorationstep,thereconfigurablehardwaremanifestsitselfasacomputingresourceinasimilarwayasaninstructionsetprocessororablockoffixedhardware,thusbringinganewdimensiontothedesignspaceexploration.3.2DetailedDesign(DD)AttheDDphase,thespecificationsarerefinedandverificationisplannedaccordingtotargetedimplementationtechnologies,processorsetc. 98Chapter4Thedesigntoolsusedarefixedaccordingtotheselectedprocessorsandthechosenreconfigurableandfixedhardwaretechnologies.Additionally,theverificationandtestingstrategyareplanned.Afterthis,theindividualpartitionsofhardware,softwareandreconfigurablehardwarearedesignedandverified.Whenallpartsarefinished,thedesignedmodulesofhardware,softwareandreconfigurablehardwareareintegratedintoasinglemodel.Intheco-verificationstep,thefunctionalityoftheintegratedmodelischeckedagainstthereferenceimplementationortheexecutablespecification.Moreover,implementationrelatedissuesliketimingandpowerconsumptionaremodeled.Iftheresultsaresatisfactory,thedesignismovedtotheImplementationDesignphase,otherwiseiterationstoDetailedDesignoreventoSystemLevelDesignphasesarerequired.AttheDD,thereconfigurationissuesemergeinthefollowingways:•Atthespecificationrefinementandtechnologyspecificdesign,thereconfigurablehardwarerequirescommunicationmechanismstosoftwareand/orfixedhardwaretobeadded;incaseofdynamicreconfigurationmechanismstohandlecontextmultiplexingarealsoneeded.•Theintegrationandco-verificationcombinesthereconfigurablehardwarecomponentswithotherhardwareandsoftwarecomponentsontoasingleplatformthataccommodatesalsoexternalIP(e.g.processor,memoryandI/Osub-systemmodels)andprovidesco-verificationoftheoveralldesign.ThereconfigurablehardwareissimulatedinaHDLsimulatororemulatedinanFPGAemulator.•SpecificHDLmodelingrulesneedtobefollowedformultipledynamicallyreconfigurablecontexts[2,3].•Thereconfigurablehardwaremodulesmustbeimplementedusingtheselectedtechnology,includingtherequiredcontrolandsupportfunctionsforreconfiguration.•Intheintegrationandverificationphases,thevendorspecificdesignandsimulation/emulationtoolsmustbeused.3.3ImplementationDesign(ID)AttheID,thereconfigurationissuesemergeinthefollowingforms:•Dynamicreconfigurationrequiresconfigurationbitstreamsofmultiplecontextstobemanaged.•Specificdesignrulesandconstraintsmustbefollowedformultipledynamicallyreconfigurablecontexts[2,3]. 4.DesignFlowforReconfigurableSystems-on-Chip994.RECONFIGURATIONISSUESINTHEPROPOSEDDESIGNFLOWAsindicatedintheprevioussection,thereareseveralissuesregardingreconfiguration.Thenextsectionsemphasizehowtheseaspectsareaddressedinthecontextoftheproposeddesignframework.Thefocusisonsystemleveldesignissues,althoughdetailedandimplementationdesignapsectsarebrieflydiscussedtocompletethepicture.4.1ReconfigurationissuesatSystemLevelDesign4.1.1NeedsandRequirementsforReconfigurationTherequirementsandspecificationcaptureidentifiestherequiredfunctionality,performance,criticalphysicalspecifications(e.g.area,power)andthedevelopmenttimerequiredforthesystem.Alltheaforementionedcharacteristicsaredescribedintheformofanexecutablemodel,wherethegoalsforreconfiguration(e.g.flexibilityforspecificationchangesandperformancescalability)areidentifiedaswell.Ingeneral,simultaneousflexibilityandperformancerequirementsformthebasicmotivationforusingreconfigurationinSystem-on-Chipdesigns.Otherwiseeitherpuresoftwareorfixedhardwaresolutionscouldbemorecompetitive.Reconfigurabletechnologiesareapromisingsolutionforaddingflexibility,whilenotsacrificingperformanceandimplementationefficiency.Theycombinethecapabilityofpostfabricationfunctionalitymodificationwiththespatial/parallelcomputationstyle.Theinclusionofreconfigurablehardwaretoatelecommunicationsystemmayintroducesignificantadvantagesbothfrommarketandimplementationpointsofview:•Upgradability−Needtoconformtomultipleormigratinginternationalstandards−Emergingimprovementsandenhancementstostandards−Desiretoaddfeaturesandfunctionalitytoexistingequipment−Serviceprovidersarenotsurewhattypesofdataserviceswillgeneraterevenueinthewirelesscommunicationsworld−Introductionofbugfixingcapabilityforhardwaresystems.•Adaptivity−Changingchannel,trafficandapplications−Powersavingmodes.Althoughthereconfigurablehardwareisbeneficialinmanycases,significantoverheadsmayalsobeintroduced.Thesearemainlyrelatedto 1000Chapter4thetimerequiredforthereconfigurationandtothepowerconsumedforreconfiguringasystem.Areaimplicationsarealsointroduced(memoriesstoringconfigurations,circuitsrequiredtocontrolthereconfigurationprocedure).Therequirementscaptureshouldidentifyanddefinethefollowingreconfigurationaspects:•Typeofreconfigurationwantedinthesystem−Staticordynamic(singleormultiplecontexts)−Levelofgranularity(fromcoarsetofine)−Styleofcoupling(fromlooselytocloselycoupled).•Requirementsandconstraintsonsystemproperties(performance,power,cost,etc)•Requirementsandconstraintsondesignmethodology(pre-definedarchitecture,pre-selectedtechnologiesandIPs,tools,etc)Theinformationoutlinedaboveisneededinthelaterstagesofthedesignflow.However,thetechniquesforidentificationofneedsandcaptureofrequirementsarecompanyspecific.4.1.2ExecutableSpecificationThespecificationcaptureissimilartothecaseofsystemsthatemployonlytraditionalhardware.ThefunctionalityofthesystemisdescribedusingaC-likeformalisme.g.SystemC,OCAPI-XL.Theexecutablespecificationcanbeusedforseveralpurposes:•Thetestbenchusedinallphasesofthedesignflowcanbederivedfromtheexecutablespecification.•Thecompilertoolsandprofilinginformationmaybeusedtodeterminewhichpartsofanapplicationaremostsuitableforimplementingwithdynamicallyreconfigurablehardware.Thisisachievedinthepartitioningphaseofthedesignflow.•Theabilitytoimplementexecutablespecificationvalidatesthatthedesignteamhassufficientexpertiseontheapplication.Executablespecificationisamustinordertobeabletotacklereconfigurabilityissuesatthesystemleveldesign.4.1.3DesignSpaceExplorationThedesignspaceexplorationphaseanalysesthefunctionalblocksoftheexecutablemodelwithrespecttoreconfigurablehardwareimplementations.Morespecifically:•Itdefinesarchitecturemodelscontainingreconfigurableresourcesbasedontemplates. 4.DesignFlowforReconfigurableSystems-on-Chip101•Itdecidesthesystempartitioningontoreconfigurableresources(inadditiontohardwareandsoftware)basedontheanalysisresults.•Itmapsthepartitionedmodelontoselectedarchitecturemodels.•Itperformssystemlevelsimulationtoestimatetheperformanceandresourceusageoftheresultingsystem.Thearchitectureofthedeviceisdefinedpartlyinparallelandpartlyusingthesystemspecificationasinput.Theinitialarchitecturedependsonmanyfactorsinadditiontotherequirementsoftheapplication.Forexamplesacompanymayhaveexperienceandtoolsforcertainprocessorcoreorsemiconductortechnology,whichrestrictsthedesignspace.Moreover,thedesignofmanytelecomproductsdoesnotstartfromscratch,sincetheyimplementadvancedversionsofexistingdevices.Thereforetheinitialarchitectureandthehardware/softwarepartitioningisoftengivenatthebeginningofthesystemleveldesign.Therearealsocaseswherethereusepolicyofeachcompanymandatesdesignerstoreusearchitecturesandcodemodulesdevelopedinpreviousproducts.Theoldmodelsofanarchitecturearecalledarchitecturetemplates.Asfarasdynamicreconfigurationisconcerned,itrequirespartitioningtoaddressbothtemporalandspatialdimensions.Automaticpartitioningisstillanunsolvedproblem,butinspecificcasessolutionsfortemporalpartitioning[4],taskschedulingandcontextmanagement[10]havebeenproposed.InthecontextofindustrialSoCdesign,however,thesystempartitioningismostlyamanualeffort.Basedontheneedsandrequirementsforreconfiguration,theexecutablespecificationisanalyzedinordertoidentifypartsthatcouldgainbenefitsfromimplementationonreconfigurableresources.Thisanalysiscanbesupportedbyestimationsofperformanceandareadonewithrespecttopre-selectedtechnologies,architecturesandIPs,e.g.specificISPandreconfigurabletechnology.Duringthemappingphase,thefunctionalitydefinedinexecutablespecificationisrefinedaccordingtothepartitioningdecisionssothatitcanbemappedontothedefinedarchitecture.Inordertoincludeinthesystemlevelsimulationtheeffectsofthechosenimplementationtechnology,differentestimationtechniquescanbeused:•Softwarepartsmaybecompiledforgettingrunningtimeandmemoryusageestimates.•Hardwarepartsmaybesynthesizedathighleveltogetestimatesofgatecountsandrunningspeed.•Thefunctionalblocksimplementedwithreconfigurablehardwarearealsomodelledsothattheeffectsofreconfigurationcanbeestimated.Finallysimulationsarerunatthesystemlevel,togetinformationconcerningtheperformanceandresourceusageofallarchitecturalunitsofthedevice. 1022Chapter4Efficientdesignspaceexplorationisthecoreoftheproposeddesignframework.Withrespecttothedesignofreconfigurablesystemsparts,itsupports:•Earlyestimationoffunctionblocks/processesforperformance(hardware,softwareandreconfigurable),cost(area)etc.•Systempartitioning,especiallymulticontextpartitioningandscheduling•Architecturedefinition•Mapping•Performanceevaluation.4.2ReconfigurationissuesatDetailedDesignThespecificationrefinementandtechnologyspecificdesigntransformthefunctionalblocksoftheexecutablemodeltodesigncomponentstargetingreconfigurablehardware(inadditiontohardwareandsoftware)accordingtothepartitioningdecisions.Importantissuesatthisstageincludeiterativeimprovementsinhardware,softwareandreconfigurablehardwarespecification.Thedesignerstakeintoaccountnotonlydesign(modelinglanguage,targetedplatform,co-simulationandtestingstrategy),butalsoeconomicalandproductsupportaspectsofthedesign,exploitingthespecificreconfigurablehardwarefeatures.Theintegrationphasecombinesthehardware,softwareandreconfigurablehardwarecomponentsintoasingleplatformthataccommodatesalsoexternalIPe.g.processor,memory,I/Osub-systemmodels.Theintegrationphaseconsiderstwodifferentapproaches:languagebasedapproach(SystemC,OCAPI-XL)andtoolsorientedapproach(CoWareN2C)tocombinetheheterogeneouscomponentsofthetargetsystemonasingleplatform.Thereconfigurablehardwarerequirescommunicationmechanismstosoftwareand/orfixedhardwaretobeadded.Differenttypesofmechanismscanbechosentohandlecommunicationbetweenthecomponents:memorybasedcommunication,busbased,coprocessorstyleandevendatapathintegratedreconfigurablefunctionalunits.Busbasedcommunicationbetweenthecomponentsrequiresspecificinterfacesforboththereconfigurablefabricandhardware/softwaresidesofthesystem.Onthesoftwareside,driversarerequiredtoturnsoftwareoperationsintosignalsonthehardware.OntheFPGAfabricandhardwareside,interfacestothesystembusmustbebuilt.TheFPGAfabricandCPUcanalsocommunicatedirectlybysharedmemory.Regardingthesoftwareandfixedhardwaredesignflows,theydonotdifferfromtraditionalones.Forstaticallyreconfigurablehardwarethe 4.DesignFlowforReconfigurableSystems-on-Chip103designflowissimilartothatoffixedhardware.Fordynamicallyreconfigurablehardware,themoduleinterfaces,communicationandsynchronizationaredesignedaccordingtotheprinciplesofacontextscheduler.SpecificHDLmodelingrulesneedtobefollowedformultipledynamicallyreconfigurablecontexts[3,9].Inthecaseofdynamicreconfiguration,mechanismstohandlecontextmultiplexingarealsoneeded.Ahighlevelschemefordescribingdynamicreconfigurationshouldaddresshowdynamicallyreconfigurablecircuitscomposewithothercircuitsoverabusstructure.4.3ReconfigurationissuesatImplementationDesignReconfigurationpartitionstheapplicationtemporallyandmultiplexesintimetheprogrammablelogictomeetthehardwareresourceconstraints.Whenreconfigurationtakesplaceatruntime,thereconfigurationtimeispartoftheruntimeoverheadandhastobeminimized.Also,multiplereconfigurationbitstreamsneedtobestoredforthedifferentcontextsbeingmultiplexedontotheprogrammablelogic.ThisproblemisexacerbatedforSystem-on-Chipimplementationswheretheentireapplicationneedstobestoredinon-chipmemory.Whenmultiplecontextreconfigurabletechniquesareconsidered[3,9],dedicatedpartitioningandmappingtechniquesareappliedduringSystemLevelDesignphase.Later,duringImplementationDesignstep,aninter-contextcommunicationschemehastobeprovided.Inter-contextcommunicationreferstohowdataorcontrolinformationistransferredamongdifferentcontexts.Usually,transferregistersareusedforinterconnectingbetweenthepreviouslast,andcurrentnextcontext.Backupregistersarealsousedtostorethestatusvalueswhenthecontextswitchesoutandlaterswitchesin.Whenbulkbuffersaremorepracticalforinter-contextcommunication,memoryregionscanbeallocatedanywhereinthechipbyusingmemorymodeofthereconfigurablecells.Thesememoryregionscanbeaccessedfromallthecontextsassharedbuffers.Itisinstructivetocomparethishighbandwidthforinter-contextcommunicationwithamultipleFPGAsituation,wherebandwidthisinherentlylimitedtoexternalpins.Thehugebandwidthmakesmulti-contextpartitioningmucheasierthanthemulti-FPGApartitioning.5.CONCLUSIONSThedesignflowforreconfigurableSoCspresentedintheprevioussectionsisdividedinthreephases:Inthesystemleveldesignphase,where 1044Chapter4therequirementsandspecificationsarecaptured;functionalityintheformofexecutablespecificationisanalyzed,partitionedandmappedontothearchitecture,andtheperformanceofthesystemisvalidated.Inthedetaileddesignphase,thecommunicationandmodulesarerefinedandtransformed,integratedandco-verifiedthroughco-simulationorco-emulation.Theimplementationdesignmapsthedesignontotheselectedimplementationplatform.Theimplementationtechnologiestreatedinthismethodologyaresoftwareexecutedinaninstructionsetprocessor,traditionalfixedhardwareanddynamicallyreconfigurablehardware.EmphasisisgivenonthesystemlevelpartofthedesignflowwheremethodsforthemodelingandsimulationofreconfigurablehardwarepartsofareconfigurableSoCarerequired.MethodsandtoolstowardsthisdirectionarepresentedinChapters5and6respectively.REFERENCES1.ADRIATICProjectIST-2000-30049(2002)DeliverableD2.2:DefinitionofADRIATICHigh-LevelHardware/SoftwareCo-DesignMethodologyforReconfigurableSoCs.Availableat:http://www.imec.be/adriatic2.ADRIATICProjectIST-2000-30049(2003)DeliverableD3.2:ADRIATICback-enddesigntoolsforthereconfigurablelogicblocks.Availableat:http://www.imec.be/adriatic3.ADRIATICProjectIST-2000-30049(2004)AddendumtoDeliverableD3.2:ADRIATICback-enddesigntoolsforthereconfigurablelogicblocks.Availableat:http://www.imec.be/adriatic4.BobdaC(2003)SynthesisofDataflowGraphsforReconfigurableSystemsusingTemporalPartitioningandTemporalPlacement.PhDDissertation,UniversityofPaderborn5.Cadence(2004)http://www.cadence.com/datasheets/vcc_environment.html6.CavalloroP,GendarmeC,KronlofK,MermettJ,VanSasJ,TiensyrjaK,VorosNS(2003)SystemLevelDesignModelwithReuseofSystemIP,KluwerAcademicPublishers7.CoWareInc(2004)Availableat:http://www.coware.com8.GioulekasF,BirbasM,VorosNS,KouklarasG,BirbasA(2005)HeterogeneousSystemLevelCo-SimulationfortheDesignofTelecommunicationSystems.JournalofSystemsArchitecture(toappear),Elsevier9.KeatingM,BricaudP(1999)ReuseMethodologyManual.SecondEdition,KluwerAcademicPublishers10.MaestreR,KurdahiFJ,FernandezM,HermidaR,BagherzadehN,SinghH(2001)Aframeworkforreconfigurablecomputing:taskschedulingandcontextmanagement.IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems,vol.9,issue6,pp.858–87311.MasselosK,PelkonenA,CupakM,Blionas,S(2003)Realizationofwirelessmultimediacommunicationsystemsonreconfigurableplatforms.Journalofsystemsarchitecture,vol.49(2003)no:46,pp.155175 4.DesignFlowforReconfigurableSystems-on-Chip10512.OCAPI-XL(2004)Availableat:http://www.imec.be/ocapi/welcome.html13.SystemC(2004)Availableat:http://www.systemc.org14.TiwariV,MalikS,WolfeA,LeeMTC(1996)Instructionlevelpoweranalysisandoptimizationofsoftware.JournalofVLSISignalProcessing,KluwerAcademicPublishers,pp.223–238 Chapter5SYSTEMCBASEDAPPROACHYangQuandKariTiensyrjäVTTElectronics,P.O.Box1100,FIN-90571Oulu,FinlandAbstract:ThischapterdescribestheSystemCbasedmodellingtechniquesandtoolsthatsupportthedesignofreconfigurablesystems-on-chip(SoC).Fordesigningofreconfigurablepartsatsystemlevel,wedeveloped:1)anestimationmethodandtoolforestimatingtheexecutiontimeandtheresourceconsumptionoffunctionblocksondynamicallyreconfigurablelogictosupportsystempartitioning,2)aSystemCbasedmodelingmethodandtoolforreconfigurablepartstoallowfastdesignspaceexplorationthrough3)system-levelsimulationusingtransaction-levelmodelsofthesystem.Keywords:Configurationoverhead;contextswitching;designspaceexploration;dynamicreconfiguration;estimation;mapping;partitioning;reconfigurable;reconfigurability;SystemC;system-on-chip;workloadmodel.1.INTRODUCTIONReconfigurabilitydoesnotappearasanisolatedphenomenon,butasatightlyconnectedpartoftheoverallSoCdesignflow.TheSystemC-basedapproachisthereforenotintendedtobeauniversalsolutiontosupportthedesignofanytypeofreconfigurabily.Instead,wefocusonacase,wherethereconfigurablecomponentsaremainlyusedasco-processorsinSoCs.SystemC2.0isselectedasthebackboneoftheapproachsinceitisastandardlanguagethatprovidesdesignerswithbasicmechanismslikechannels,interfacesandeventstomodelthewiderangeofcommunicationandsynchronizationfoundinsystemdesigns.Moresophisticatedmechanismsforthesystem-leveldesigncanbebuiltontopofthebasicconstructs.Duetothestandardlanguageandopensourcereferenceimplementation,SystemC2.0hasbecomealanguageofchoiceforagrowingnumberofsystemarchitectsandsystemdesigners.107N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,107-131.©2005Springer.PrintedintheNetherlands. 1088Chapter5TheSystemCbasedapproachcoversthereconfigurationextensionandtherelatedmethodsandtoolsthatcanbeeasilyembeddedintoaSoCdesignflow.Thesystem-leveldesignpartofthedesignflowpresentedinChapter4isshowninFigure5-1.StSstemSystemyRiReqirements/Requirements/qt/SSpecificationpifitiCtCaptreCapturepAhittArchitectureTlTemplateptAhiAhittArchitectureSStSystemyStLlSystemLevelSystemSystemLevely-LevelDfiitiDefinitionPtitiiPartitioninggIPMiMappingppgSystem-LevelSLStLlSystemLevelSystem-LevelSystemLevelylDesignSiSimulationltiFigure5-1.System-leveldesignpartofproposeddesignflow.Thefollowingnewfeaturesareidentifiedineachphaseofsystem-leveldesignwhenreconfigurabilityistakenintoaccount:•SystemRequirementsandSpecificationCaptureneedstoidentifyrequirementsandgoalsofreconfigurability.•ArchitectureDefinitionneedstotreatthereconfigurableresourcesasabstractmodelsandincludetheminthearchitecturemodels.•SystemPartitioninggneedstoanalyzeandestimatethefunctionsoftheapplicationforsoftware,fixedhardwareandreconfigurablehardware.•Mappinggneedstomapfunctionsallocatedtoreconfigurablehardwareontotherespectivearchitecturemodel.•System-LevelSimulationneedstoobservetheperformanceimpactsofarchitectureandreconfigurableresources.IntheSystemCbasedapproach,weassumethatthedesigndoesnotstartfromscratch,butitisamoreadvancedversionofanexistingdevice.Thenewarchitectureisdefinedpartlybasedontheexistingarchitectureandpartlyusingthesystemspecificationasinput.Theinitialarchitectureisoftendependentonmanythingsnotdirectlyresultingfromtherequirementsoftheapplication.Thecompanymayhaveexperienceandtoolsforcertainprocessorcoreorsemiconductortechnology,whichrestrictsthedesignspaceandmayproduceaninitialhardware/software(HW/SW)partition. 5.SystemCBasedApproach109Therefore,theinitialarchitectureandtheHW/SWpartitionareoftengiveninthebeginningofthesystem-leveldesign.TheSystemCextensionisdesignedtoworkwithaSystemCmodeloftheexistingdevicetosuitthedesignconsideringdynamicallyreconfigurablehardwareFigure5-2(a)givesagraphicalviewoftheinitialarchitecture,andFigure5-2(b)showsthemodifiedarchitecturewithusingtheSystemCbasedextensions.ThewaythattheSystemCbasedapproachincorporatesdynamicallyreconfigurablepartsintoarchitectureistoreplaceSystemCmodelsofsomehardwareacceleratorswithasingleSystemCmodelofreconfigurableblock.TheobjectiveoftheSystemCbasedextensionsistoprovideamechanismthatallowsdesignerstoeasilytesttheeffectsofimplementingsomecomponentsinthedynamicallyreconfigurablehardware.TheprovidedsupportsintheSystemCbasedapproachinclude:•Analysissupportfordesignspaceexplorationandsystempartitioning.•ReconfigurabilitymodellingbyusingstandardmechanismsofSystemC.•System-levelsimulationusingtransaction-levelmodelsoftheapplicationworkloadandthearchitecture.SWSWSWSWfunctionsfunctionsfunctionsfunctionsCPUDMACPUDMAMEMHWHWReconfigurableMEMAcceleratorAcceleratorfabricHWHWAcceleratorAcceleratorfunctionalityfunctionalitySWfunctions(a)(b)Figure5-2.(a)TypicalSoCarchitectureand(b)modifiedarchitectureusingdynamicallyreconfigurablehardware. 110Chapter52.SYSTEMC2.0OVERVIEWSystemCisastandardmodellinglanguagebasedonC++.Itsversion1providesaclasslibrarythatimplementsobjectslikeprocesses,modules,ports,signalsanddatatypesforhardwaremodelling.ThemodeliscompiledbyastandardC++compilerforexecutiononaneventbasedsimulationkernel.Theversion2introducesalanguagearchitectureshowninFigure5-3[1].Itprovidescorelanguageconstructslikechannels,interfacesandeventsforsystem-levelmodelling.Elementaryandmoresophisticatedchannelscanbebuiltusingthecorelanguagetosupportvariouscommunication,synchronizationandmodelofcomputationparadigms.Thebasicsystem-levelconstructsofthelanguageareintroducedinfollowingsections,butformorecompleteinformationitisadvisabletoreadtheFunctionalSpecificationforSystemC2.0[2].StandardChannelsMethodology-SpecificforVariousMOC'sChannelsKahnProcessNetworksMaster/SlaveLibrary,etc.StaticDataflow,etc.ElementaryChannelsSignal,Timer,Mutex,Semaphore,Fifo,etc.CoreLanguageDataTypesModulesLogicType(01XZ)PortsLogicVectorsProcessesBitsandBitVectorsInterfacesArbitraryPrecisionIntegersChannelsFixedPointIntegersEventsC++LanguageStandardFigure5-3.SystemClanguagearchitecture.2.1ChannelsSystemC2.0channelsimplementoneormanyinterfacesandtheycontainthefunctionalityofthecommunication.Channelsareusedespeciallyindesigningandsimulatingfunctionalityofbuses.Functionalitysuchasaddresses,addressingschemes,prioritiesbuffersizesetc.canbeconfigured 5.SystemCBasedApproach111runtimeandthereforetheeffectofthesedesigndecisionscanbesimulatedeasilywithoutlargemodificationstothecode.Also,sinceitispossibletoattachmultipleportstoaninterfacethenumberofbusmastersorslavescanbechosenincompiletimewithoutmodifyingthebuscode.Whensystemlevelmodulesareimplementedcorrectlyforuseofparametersandvariablenumberofconnectedports,designspaceexplorationbecomesaneasytask.2.2PortsandInterfacesThemodelofcommunicationinSystemC2.0canbemoreabstractthaninregister-transferlevel(RTL)description.Usercandefineasetofinterfacemethodsthatmodulesuseforcommunication.Forexampleasystemlevelmodelofamemorycontrollercancontainthreeinterfacemethods,areadmethod,awritemethodandaburstreadmethod.Theactualbehaviouralimplementationofamethodislefttothemodulethatprovidestheinterface.Themodulethatusesaninterfacedoesthisviaaport.Thiswaythedetailedimplementationofaninterfacecanbeseparatedfromtheobjectthatisusingtheinterface.Usinginterfacesmakesitalsosimplertosimulateandmeasuretheeffectofforexampleburstreadingtotheperformanceofasystem.Thisiscalledtransactionlevelmodelling(TLM).2.3EventsandDynamicSensitivityEventsarelow-levelsynchronizationmechanisms.Theycanbeusedtotransfercontrolfromoneprocesstoanother.Theeffectcanoccurimmediately,afternextdeltacycleoraftersomedefinedtime.DynamicsensitivityinSystemC2.0meansthataprocesscanalteritssensitivitylistduringruntime.Processcanwaitanysetofeventsortimemakingforexampledesignandsimulationofstatemachineseasyanderrorsarereducedsincethesensitivitylistcanbesuppressedineachstatetominimum.3.OVERVIEWOFSYSTEMCBASEDEXTENSIONSSinceSystemCpromotestheopennessofthelanguageandthestandard,theadditionofnewdomaincanbemadetothecorelanguageitself.However,apreferredmethodistomodelthebasicconstructsrequiredformodellingandsimulationofreconfigurablehardware(RHW)usingbasicconstructsofthelanguageandthereforepreservingthecompatibilitywith 112Chapter5existingtoolsanddesigns.Forthisreason,theextensiondoesnotintendtoextendtheSystem2.0languageitself.ThetermsandconceptsspecifictotheSystemCbasedapproachusedinthefollowingsectionsaredefinedasfollows:•CandidateComponent:Candidatecomponentsdenotethoseapplicationfunctionsthatareconsideredtogainbenefitsfromtheirimplementationonareconfigurablehardwareresource.Thedecisionwhetherataskshouldbeacandidatecomponentisclearlyapplicationdependent.Thecriterionisthatthetaskshouldhavetwofeaturesincombination:flexibility(thatwouldexcludeanASICimplementation)andhighcomputationalcomplexity(thatwouldexcludeasoftwareimplementation).Flexibilitymaycomeeitherfromthepointthatthetaskwillbeupgradedinthefutureorinviewofhardwareresourcessharingwithothertaskswithnon-overlappinglifetimesforglobalareaoptimization.•Dynamicallyreconfigurablefabric(DRCF):Thedynamicallyreconfigurablefabricisasystem-levelconceptthatrepresentsasetofcandidatecomponentsandtherequiredreconfigurationsupportfunctions,whichlateroninthedesignprocesscanbeimplementedonareconfigurablehardwareresource.•DRCFcomponent:TheDRCFcomponentisatransaction-levelSystemCmoduleoftheDRCF.Itconsistsoffunctions,whichmimicthereconfigurationprocess,andtheinstancesofSystemCmodulesofthecandidatecomponentstopresenttheirfunctionalityduringsystem-levelsimulation.Itcanautomaticallydetectreconfigurationrequestandtriggerthereconfigurationprocesswhennecessary.•DRCFtemplate:TheDRCFtemplateisanincompleteSystemCmodule,fromwhichtocreatetheDRCFcomponent.TheSystemCbasedextensions[3]arehighlightedinthemodifiedversionoftheSystem-LevelDesigndiagramasshowninFigure5-4.Thethreefocusesareestimationsupport,DRCFmodellingmethodandsystemsimulation.•Theestimationapproach[4]isbasedonaprototypetoolthatcanproducetheestimatesofsoftwareexecutiontimeonaninstruction-setprocessor(ISP)andtheestimatesofhardwareexecutiontimeandresourceconsumptiononanFPGA.Theestimatesprovideinformationforsystempartitioningandselectionofcandidatecomponents.WhenafullSW/HW/RHWsystempartitioningisconsidered,traditionalanalysismethodsandtoolsarestillrequired.•TheDRCFmodellingmethod[5,6]focusesonthemodellingofthereconfigurationoverhead.Modellingthefunctionalityofthecandidatecomponentsthataremappedontothereconfigurable 5.SystemCBasedApproach113resourcesisnotaffectedbytheextension.Differentfeaturesassociatedwithreconfigurationtechnologyarenotdirectlymodelled.Instead,themodeldescribesthebehaviourofthereconfigurationprocessandrelatestheperformanceimpactofthereconfigurationprocesstoasetofparametersthatareextractedandannotatedfromthereconfigurationtechnology.Thus,bytuningtheparameters,designerscaneasilyevaluatetrade-offsamongdifferenttechnologyalternativesandperformfastdesignspaceexplorationatthesystemlevel.•Thesystem-levelsimulationisbasedonthetransaction-levelSystemCmodelandusesabstractworkloadandcapacitymodelsofapplicationandarchitectureforperformanceevaluationandstudyingofalternativearchitecturesandmappings.ISAandFPGAEstimationEtitiTechnologyModelsC/C++AlgorithmSpecificationSystemPartitioningArchitectureTemplateanalysisanddecompositionDRCFModellingDRCFMdlligDRCFTemplateSystem-LevelSystemSimulationSstemSimlationStSyiltiTransaction-levelSystemCmodelDesignFigure5-4.SystemCreconfigurabilityextensionsforsystem-leveldesign.4.ESTIMATIONAPPROACHTOSUPPORTSYSTEMANALYSISSystemanalysisisappliedintwophasesintheSystemCbasedapproach.Inthefirstphase,itfocusesonHW/SWpartitioningandhelpsdesignerstocreatetheinitialarchitecturebasedonanagreedpartitioningdecision.TheinitialarchitecturesetsthestartingpointfromwhichtheSystemCbasedapproachproducesthesystem-levelmodelforthearchitectureincludingtheDRCFcomponentthatisacorrespondingSystemCmodelofthedynamicallyreconfigurablehardwarewiththemodulestobeimplementedin 1144Chapter5it.Inthesecondphase,systemanalysisfocusesonstudyingthetrade-offofperformanceandflexibilityandhelpsdesignerstoidentifycandidatecomponentstobeimplementedinthedynamicallyreconfigurablehardware.Systemanalysisisperformedbydesignersmainlybasedontheirexperience,whichmaynotproducereliableresultsinallcasesespeciallyifdesignershavetocarryoutsystemanalysisfromthescratch.Inthissection,anestimationapproachtosupporttheworkofsystemanalysisispresented.TheestimationapproachfocusesonareconfigurablearchitectureinwhichthereisaRISCprocessor,anembeddedFPGA,andasystembusasacommunicationchannel.ItstartsfromfunctionblocksrepresentedusingC-languageandproducesthefollowingestimatesforeachfunctionblock:softwareexecutiontimeintermsofrunningthefunctionontheRISCcore,mappabilityofthefunctionandtheRISCcore,hardwareexecutiontimeintermsofrunningthefunctionontheembeddedFPGA,andresourceutilizationoftheembeddedFPGA.TheframeworkoftheestimationapproachisshowninFigure5-5.C-codeC-cod-codkrowoFunctionSUIFewmeamarCDFGDFGFniotaHihlHighlevelHigh-levelHighlevelHighleveiglmaimiSStSystemystemysteysyntthesis-bahesis-bahesis-besis-basedtsanalysisanalysinalysinalysnalslyHWestimatorHWestimatoWestimatoWtWiitiEsEHWresourceHWresorceHWMappabilityMppbilitiiySpeedupSSpdputilizationtiliilitiiSupportingAttributesStSippigAttibtFigure5-5.Estimationframework.Blocksinsidetheshadedareaarethefunctionsperformedbytheestimationapproach,anddatarepresentationsusedbytheestimationapproach.Detailedexplanationsaregiveninthefollowingsections.Outsidetheshadedarea,theblockswiththename“Functionblock”servesasinputtotheestimationapproach.Thesefunctionblockscaneitherbetheresultsfromsystemdecomposition,withthegranularitydecidedbydesigners,ortheycanbethecorrespondingSystemCmodulesfromtheinitialarchitecture.Intheformercase,theestimationapproachismeantforthefirstphaseofsystemanalysis,whichistohelpdesignerstomaketrade-offbetweenhardwareimplementationandsoftwareimplementation.Inthelatter 5.SystemCBasedApproach115case,theestimationapproachismeantforthesecondphaseofsystemanalysis,whichistohelpdesignerstoevaluatethetrade-offbetweenperformanceandflexibilitywhencomparingfixedhardwareimplementationanddynamicallyreconfigurablehardwareimplementation.EstimatesofhardwareresourceutilizationofthemodulesarefedintotheSystemCextensionasseparateparameters.4.1CreationofControl/DataFlowGraphfromCCodeControl/dataflowgraph(CDFG)isacombinedrepresentationofdataflowgraph(DFG),whichexposesthedatadependenceofalgorithms,andcontrolflowgraph(CFG),whichcapturesthecontrolrelationofDFGs.C-basedfunctionblockisusedasthestartingpointandCDFGisusedastheintermediaterepresentationoftheestimationapproach.SUIFcompiler[7]isusedasafront-endtooltoanalyzetheCcode,andapurpose-specificcodeconverterisusedtotransformtheSUIFintermediaterepresentationintoCDFG.Themainprocessinconversionistofindbasicblocks,whichcontainonlysequentialexecutionswithoutanyjumpinbetween,andtomapeachofthemontoasingleDFGandthejumpstatementsbetweenthebasicblocksontothecontrolrelationofDFGs.ThecharacteristicsoftheCfunctionsarestudiedthoughprofiling,andtheprofilingdataareattributesinthetargetCDFG.4.2High-LevelSynthesis-BasedHardwareEstimationAgraphicalviewofthehardwareestimationisshowninFigure5-6.TakingtheCDFGwithcorrespondingprofilinginformationandamodelofembeddedFPGAasinputs,thehardwareestimatorcarriesoutahigh-levelsynthesis-basedapproachtoproducetheestimates.Maintasksperformedinthehardwareestimatoraswellasinarealhigh-levelsynthesistoolareschedulingandallocation.Schedulingistheprocessinwhicheachoperatorisscheduledinacertaincontrolstep,whichisusuallyasingleclockcycle,orcrossingseveralcontrolstepsifitisamulti-cycleoperator.AllocationistheprocessinwhicheachrepresentativeintheCDFGismappedtoaphysicalunit,e.g.variablestoregisters,andtheinterconnectionofphysicalunitsisestablished.TheembeddedFPGAisviewedasaco-processingunit,whichcanindependentlyperformalargeamountofcomputationwithoutconstantsupervisionoftheRISCprocessor.ThebasicconstructionunitsoftheembeddedFPGAarestaticrandomaccessmemory(SRAM)-basedlook-uptables(LUT)andcertaintypesofspecializedfunctionunits,e.g.custom-designedmultiplier.Routingresourcesandtheircapacityarenottakeninto 116Chapter5account.ThemodeloftheembeddedFPGAisinaformofmapping-table.Theindexofthetableisthetypeofthefunctionunit,e.g.adder.ThevaluemappedtoeachindexishardwareresourcesintermsofthenumberofLUTsandthenumberofspecializedunits,requiredforthistypeoffunctionunit.EmbeddedEmbeddembeddembeddbedCCDFGDFGDFGFPGAmodelFPGAmodePGAmodePGAmodGAmoGAmGAAASAPASASAALAPALALAMModifiedModifieodifieodifiodifdAllocationAllocatiollocatiolocatiocatiocatcaFDSSHWResourceResourcesourcesoursoursouexecutiontimexecutiontimecutiontimecutionticutionticutiontutiontiiutiliztiontiliztioiliztioiliztilitilitiFigure5-6.High-levelsynthesis-basedhardwareestimation.As-soon-as-possible(ASAP)schedulingandas-late-as-possible(ALAP)scheduling[8]determinethecriticalpathsoftheDFGs,whichtogetherwiththecontrolrelationoftheCFGsareusedtoproducetheestimateofhardwareexecutiontime.Foreachoperator,theASAPandALAPschedulingprocessesalsosettherangeofclockcycleswithinwhichitcouldbelegallyscheduledwithoutdelayingthecriticalpath.Theseresultsarerequiredinthenextschedulingprocess,amodifiedversionofforce-directed-scheduling(FDS)[9],whichintendstoreducethenumberoffunctionunits,registersandbusesrequiredbybalancingtheconcurrencyoftheoperationsassignedtothemwithoutlengtheningthetotalexecutiontime.ThemodifiedFDSisusedtoestimatethehardwareresourcesrequiredforfunctionunits.Finally,allocationisusedtoestimatethehardwareresourcesrequiredforinterconnectionoffunctionunits.Theworkofallocationisdividedinto3parts:registerallocation,operationassignmentandinterconnectionbinding.Inregisterallocation,eachvariableisassignedtoacertainregister.Inoperationassignment,eachoperatorisassignedtoacertainfunctionunit.Botharesolvedusingtheweighted-bipartitealgorithm,andthecommonobjectiveisthateachassignmentshouldintroducetheleastnumberofinterconnectionunitsthatwillbedeterminedinthelastphase,theinterconnectionbinding.Inthisapproach,multiplexeristheonlytypeof 5.SystemCBasedApproach117interconnectionunit,whicheasetheworkofinterconnectionbinding.Thenumberandtypeofmultiplexerscanbeeasilydeterminedbysimplycountingthenumberofdifferentinputstoeachregisterandeachfunctionunit.4.3MappabilityBasedSoftwareEstimationSoftwareestimatorproducestwoestimates:softwareexecutiontime,andmappabilityofanarchitecture-algorithmpair.Aprofile-directedoperation-countingbasedstatictechniqueisusedtoestimatesoftwareexecutiontime.Thearchitectureofthetargetprocessorcoreisnottakenintoaccountinthetiminganalysis.Themainideaofestimatingthesoftwareexecutiontimeisasfollowing.Firstly,thenumberofoperationswitheachtypeiscountedfromtheCDFG.Then,eachtypeofoperationnodesintheCDFGismappedtooneorasetofinstructionsofthetargetprocessorinapre-definedmanner.Thenthetotalnumberofinstructionsiscalculatedfromtheresultsofthefirsttwostepssimplyusingmultiplicationandaddition.Finally,withtheassumptionthattheseinstructionsareperformedwithanidealpipeline,thesoftwareexecutiontimeisthemultiplicationresultofthetotalnumberofinstructionsandtheperiodoftheclockcycle.Mappabilityofanarchitecture-algorithmpairmeansthedegreeofmatchingbetweenresourcesprovidedbytheprocessorarchitectureandtherequirementsdescribedbythealgorithm[10].Themappabilityestimateiscalculatedviaasetofcorrelationfunctions,whichtakeintoaccounttheinstructionset,registerstructure,busefficiency,brancheffect,pipelineefficiencyandparallelism.CAMALAisaprototypetooltostudymappabilityofanarchitecture-algorithmpair.IttakesCDFGasinputandproducesestimateofmappabilitywithintherangefrom0to1.Anoptimalmappingisanexactmappingwithavalueofone,andbothover-requiredresourcesandunder-utilizedresourcesarereflectedaspoormappingresultswithvaluesnearzero.4.4CandidateComponentSelectionCandidatecomponentselectionisanapplication-dependentprocedure.Whenglobalresourcesavingisanissue,theresourceestimatesareimportantinputs.However,tomakejustifieddecisions,otherinformation,suchaspowerconsumptionshouldbeincludedasinputs.Moreimportantly,control/datadependencebetweencandidatecomponentsshouldbeanalyzed.Obviously,thereshouldbecontroldependencebetweencandidatecomponentsthataremappedtodifferentcontexts.Currentapproachdoesnot 118Chapter5includeautomatedtoolstosupporttheanalysis.Othertoolsandmanualanalysisarethesolutionsfornow.5.MODELLINGRECONFIGURATIONOVERHEADThemodellingmethodoftheDRCFfocusesonhowtorepresentthereconfigurationoverheadandhowtorevealitsperformanceimpactduringsystemsimulation.Thecandidatecomponentsthataremappedontothereconfigurableresourcesarehardwareacceleratortasks.Reconfigurationisrequiredwhenacalledtaskisnotloadedinthereconfigurableresources.ThedifferenceofhandlingincomingmessagesbetweentasksmappedtoafixedacceleratorandtasksmappedtoreconfigurableresourcesisshowninFigure5-7.incomingmessageistheaccesstargetedtoanactivecontext?YNreconfigurationfrequestreconfigurationdoneexecuteexecutefunctionalityfunctionality(a)(b)Figure5-7.(a)Handlingincomingmessagesasafixedhardwareaccelerator(b)Handlingincomingmessagesasareconfigurabletask.TheideaoftheDRCFistoautomaticallycapturethereconfigurationrequestandtriggerthereconfiguration.Inaddition,atooltoautomatetheprocessthatreplacescandidatecomponentsbyaDRCFcomponentisdeveloped,sosystemdesignerscaneasilyperformthetest-and-tryandthedesignspaceexplorationprocessiseasier.InordertolettheDRCFcomponentbeabletocaptureandunderstandincomingmessages,theSystemCmodulesofthecandidatecomponentsmustimplementtheread(), 5.SystemCBasedApproach119write(),get_low_addr()andget_high_addr()interfacemethodsshowedinthecodebelow.classbus_slv_if:publicvirtualsc_interface{public:virtualsc_uint
此文档下载收益归作者所有