Springer.System.Level.Design.of.Reconfigurable.SoC

Springer.System.Level.Design.of.Reconfigurable.SoC

ID:34905918

大小:6.53 MB

页数:220页

时间:2019-03-13

上传者:U-14522
Springer.System.Level.Design.of.Reconfigurable.SoC _第1页
Springer.System.Level.Design.of.Reconfigurable.SoC _第2页
Springer.System.Level.Design.of.Reconfigurable.SoC _第3页
Springer.System.Level.Design.of.Reconfigurable.SoC _第4页
Springer.System.Level.Design.of.Reconfigurable.SoC _第5页
资源描述:

《Springer.System.Level.Design.of.Reconfigurable.SoC 》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库

SYSTEMLEVELDESIGNOFRECONFIGURABLESYSTEMS-ON-CHIP SystemLevelDesignofReconfigurableSystems-on-ChipEditedbyNIKOLAOSS.VOROSINTRACOMS.A.,Patra,GreeceandKONSTANTINOSMASSELOSImperialCollegeofScienceTechnologyandMedicine,London,U.K. AC.I.P.CataloguerecordforthisbookisavailablefromtheLibraryofCongress.ISBN-100-387-26103-6(HB)ISBN-13978-0-387-26103-4(HB)ISBN-100-387-26104-4(e-book)ISBN-13978-0-387-26104-1(e-book)PublishedbySpringer,P.O.Box17,3300AADordrecht,TheNetherlands.www.springeronline.comPrintedonacid-freepaperAllRightsReserved©2005SpringerNopartofthisworkmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,electronic,mechanical,photocopying,microfilming,recordingorotherwise,withoutwrittenpermissionfromthePublisher,withtheexceptionofanymaterialsuppliedspecificallyforthepurposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthework.PrintedintheNetherlands. ContentsContributingAuthors7Preface9Acknowledgments11PartAReconfigurableSystemsIntroductiontoReconfigurableHardware15KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS15ReconfigurableHardwareExploitationinWirelessMultimediaCommunications27KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS27ReconfigurableHardwareTechnologies43KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS43PartBSystemLevelDesignMethodologyDesignFlowforReconfigurableSystems-on-Chip87KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS87SystemCBasedApproach107YANGQUANDKARITIENSYRJÄ107 6SystemLevelDesignofReconfigurableSystems-on-ChipOCAPI-XLBasedApproach133MIROSLAVČUPÁKANDLUCRIJNDERS133PartCDesignCasesMPEG-4VideoDecoder155MIROSLAVČUPÁKANDLUCRIJNDERS155PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip179KONSTANTINOSMASSELOSANDNIKOLAOSS.VOROS179WCDMADetector209YANGQU,MARKOPETTISSALOANDKARITIENSYRJÄ209 ContributingAuthorsMiroslavCupak,IMEC,Kapeldreef75,B-3001Leuven,BelgiumKonstantinosMasselosImperialCollegeofScienceTechnologyandMedicine,ExhibitionRoad,London,SW72BT,UnitedKingdomMarkoPettissaloNokiaTechnologyPlatforms,P.O.Box50,FIN-90571Oulu,FinlandYangQuVTTElectronics,P.O.Box1100,FIN-90571Oulu,FinlandLucRijndersIMEC,Kapeldreef75,B-3001Leuven,BelgiumKariTiensyrjäVTTElectronics,P.O.Box1100,FIN-90571Oulu,FinlandNikolaosS.VorosINTRACOMS.A.,254Panepistimioustr.,26443,Patra,Greece PrefaceThisbookpresentstheperspectiveoftheADRIATICprojectforthedesignofreconfigurablesystems-on-chip,asperceivedinthecourseoftheresearchduring2001-2004.Theprojectprovided:(a)ahigh-levelhardware/softwareco-designandco-verificationmethodologyandtoolsforreconfigurablesystems-on-chip,supplementedwithback-enddesigntoolsfortheimplementationofthereconfigurablelogicblocksofthechip,(b)thedefinitionofthetechnologicalrequirementsforreconfigurableprocessorsforwirelessterminalsand(c)theimplementationofMPEG-4,WCDMAandWLANdesigncasestovalidatethemethodologyandtools.ReconfigurabilityisbecominganimportantpartofSystem-on-Chip(SoC)designtocopewiththeincreasingdemandsforsimultaneousflexibilityandcomputationalpower.Currenthardware/softwareco-designmethodologiesprovidelittlesupportfordealingwiththeadditionaldesigndimensionintroduced.Furthersupportatthesystem-levelisneededfortheidentificationandmodellingofdynamicallyre-configurablefunctionblocks,forefficientdesignspaceexploration,partitioningandmapping,andforperformanceevaluation.Theoverheadeffects,e.g.contextswitchingandconfigurationdata,shouldbeincludedinthemodellingalreadyatthesystem-levelinordertoproducecredibleinformationfordecision-making.Thisbookfocusesonhardware/softwareco-designappliedforreconfigurableSoCs.Wediscussexplorationofadditionalrequirementsduetoreconfigurability,reportrextensionstottotwoC+++based+languages/methodologies,SystemCandOCAPI-XL,tosupportthoserequirements,andpresentresultsofthreecasestudiesinthewirelessandmultimediacommunicationdomainthatwereusedforthevalidationoftheapproaches. 10SystemLevelDesignofReconfigurableSystems-on-ChipThebookincludesninechapters,dividedinthreeparts:PartAcontainsChapters1–3andprovidesanintroductiontoreconfigurablesystems-on-chip;PartBcontainsChapters4–6anddescribesindetailtheproposedsystemleveldesignmethodologyandtheassociatedtools;PartC,whichcontainsChapters7–9,providesthedetailsofapplyingtheproposedmethodologyinpractice. AcknowledgmentsTheresearchworkthatprovidedthematerialforthisbookwascarriedoutduring20012004mainlyintheADRIATICProject(AdvancedMethodologyforDesigningReconfIgurableSoCandApplication-TargetedIP-entitiesinwirelessCommunications)supportedpartiallybytheEuropeanCommissionunderthecontractIST-2000-30049.GuidanceandcommentsofMrRonanBurgess,DrLechJozwiakandDrMarkHellyaronresearchdirectionarehighlyappreciated.Inadditiontotheauthors,thecontributionsofthefollowingprojectmembersandpartners'personnelaregratefullyacknowledged:AnttiAnttonen,SpyrosBlionas,KristofDenolf,KlausKronlöf,TarjaLeinonen,DimitrisMetafas,RobertPasko,AnttiPelkonen,KonstantinosPotamianos,TapioRautio,GeertVanmeerbeeck,SergeVernalde,PeterVos,ErikWatzeels,MattiWeisssenfeltandYanZhang.Ofthem,theeditorsexpresstheirspecialthankstoAnttiPelkonenandYanZhangfortheirvaluablecontributionstoChapter5andChapter9,RobertPaskoandGeertVanmeerbeeckfortheirvaluablecontributionstoChapter6,KristofDenolfandPeterVosfortheirsubstantialcontributionstoChapter7andSergeVernaldeandErikWatzeelsformanagementrelatedissues. PARTARECONFIGURABLESYSTEMS Chapter1INTRODUCTIONTORECONFIGURABLEHARDWARE1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:Thischapterintroducesthereadertomainconceptsofreconfigurablecomputingandreconfigurablehardware.Differenttypesofreconfigurationarediscussed.Adetailedclassificationofreconfigurablearchitectureswithrespecttothegranularityoftheirbuildingblocks,thereconfigurationschemeandthesystemlevelcouplingisalsopresented.Keywords:Reconfigurablehardware,reconfigurablearchitectures,reconfiguration,reconfigurablecomputing1.RECONFIGURABLECOMPUTINGANDRECONFIGURABLEHARDWAREReconfigurablecomputingreferstosystemsincorporatingsomeformofhardwareprogrammability–customizinghowthehardwareisusedusinganumberofphysicalcontrolpoints[2].Thesecontrolpointscanthenbechangedperiodicallyinordertoexecutedifferentapplicationsusingthesamehardware.ReconfigurablehardwareoffersagoodbalancebetweenimplementationefficiencyandflexibilityasshowninFigure1-1.Thisisbecausereconfigurablehardwarecombinespost-fabricationprogrammabilitywiththespatial(parallel)computationstyle[2]ofapplicationspecificintegratedcircuits(ASICs),whichismoreefficientincomparisontothetemporal(sequential)computationstyleofinstructionsetprocessors.Duetotheincreasingflexibilityrequirements(e.g.foradaptationtodifferentevolvingstandardsandoperatingconditions)thatareimposedbycomputationallyintensiveapplicationssuchaswirelesscommunications,15N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,15-26.©2005Springer.PrintedintheNetherlands. 16Chapter1devicesneedtobehighlyadaptabletotherunningapplications.Ontheotherhand,efficientrealizationsofsuchapplicationsarerequired,especiallyintheresourcestheyuseduringdeployment,wherepowerconsumptionmustbetradedagainstperceivedqualityoftheapplication.ThecontradictoryrequirementsforflexibilityandimplementationefficiencycannotbesatisfiedbyconventionalinstructionsetprocessorsandASICs.Reconfigurablehardwareformsaninterestingimplementationoptioninsuchcases.TemporalcomputationstyleLimitedSpatialcomputationparallelismEmbeddedstyleGeneralPurposeInstructionSetUnlimitedProcessorparallelismInstructionSet(LPARM)DSP(TI320CXX)FlexibilityApplicationSpecificInstructionSetProcessor(ASIP)ReconfigurableProcessor/FPGAEmbeddedPostfabricationReconfigurableprogrammabilityLogic/FPGAFactorof100-1000Dedicated/DirectMappedHardware(ASIC)Area/PowerFigure1-1.PositioningofreconfigurablehardwareTherearealsootherreasonswhytousereconfigurableresourcesinsystem-on-chip(SoC)design.Theincreasingnon-recurringengineering(NRE)costspushdesignerstousesameSoCinseveralapplicationsandproductsforachievinglowcostperchip.Thepresenceofreconfigurableresourcesallowsthefinetuningofthechipfordifferentproductsorproductvariations.Also,theincreasingcomplexityinthefuturedesignsaddsthepossibilityofincludingdesignflows,whichcanrequirecostlyandslowredesignofthechip.Reconfigurableelementsareoftenhomogenousarrays,whichcanbepre-verifiedtominimizethepossibilityofhavingdesignerrors.Alsothepost-manufacturingprogrammabilityallowscorrectionorcircumventionofproblemslaterthanwithfixedhardware. 1.IntroductiontoReconfigurableHardware172.TYPESOFRECONFIGURATIONThenextparagraphsdescribedifferenttypesofreconfiguration.2.1LogicreconfigurationAtypicallogicblockreconfigurablearchitecturecontainsalook-uptable(LUT),anoptionalDflip-flopandadditionalcombinationallogic.TheLUTallowsanyfunctiontobeimplemented,providinggenericlogic.Theflip-flopcanbeusedforpipelining,registers,stateholdingfunctionsforfinitestatemachines,oranyothersituationwhereclockingisrequired.Thecombinatoriallogicisusuallythefastcarrylogicusedtospeedupfastcarry-basedcomputationssuchasaddition,parity,wideANDoperationsandotherfunctions.ThelogicblockslocatedattheperipheryofthedevicecanbeofdifferentarchitecturededicatedtoI/Ooperations.Thelogicblocksaregroupedtomatricesoverlaidwithareconfigurableinterconnectionnetworkofwires.Interconnectionnetworkreconfigurationiscontrolledbychangingtheconnectionsbetweenthelogicblocksandthewiresandbyconfiguringtheswitchboxes,whichconnectdifferentwires.ThereconfigurationofboththelogicblocksandtheinterconnectionnetworkisachievedbyusingSRAMmemorybitstocontroltheconfigurationoftransistors.Thefunctionalityofthelogicblocks,I/Oblocksandtheinterconnectionnetworkismodifiedbydownloadingbitstreamofreconfigurationdataontothehardware.2.2Instruction-setreconfigurationTheconceptofinstruction-setreconfigurationreferstothehybridarchitecturesconsistingofmicroprocessorandreconfigurablelogic.Thekeybenefitisacombinationoffullsoftwareflexibilitywithhighhardwareefficiency.Onepromisingapproachisthereconfigurableinstructionsetprocessors(RISP),whichhavethecapabilitytoadapttheirinstructionsetstotheapplicationbeingexecutedthroughareconfigurationintheirhardware.Theresultisareconfigurableandextensibleprocessorarchitecture,whichcouldbetailoredcloselytothedesigners'specificneeds.Throughtheadaptation,specializedhardwareacceleratestheexecutionoftheapplications.Ifsharedresourcesareusedintheadaptation,thefunctionaldensityisalsoimproved.Bymovingtheapplication-specificdata-pathsintotheprocessor,aremarkableimprovementinperformancecomparedtofixedinstruction-setprocessorscanbeachieved.Atthesametime,designingatthelevelofinstruction-setarchitecturesignificantlyshortensthedesigncycleandreducesverificationeffortandrisk.Onthe 18Chapter1otherhand,newmethodologies,toolsandprocessorfoundationsarerequired.Automatedextensionofprocessorfunctionunitsandassociatedsoftwareenvironment-compilers,debuggers,instructionsimulatorsetc.,arealsothekeypointstosuccess.Differentsystemswithdifferentcharacteristicshavebeendesigned.Usuallytwomaindesigntasksareinvolved:1.Whatisthetypeofinterfacesbetweenthemicroprocessorandthereconfigurablelogic?2.Howtodesignthereconfigurablelogicitself?Eachofthemcontainsmanytrade-offs.Thecommonclassificationofthereconfigurableprocessorscouldbemadeaccordingtothecouplinglevelsofreconfigurablelogic.Theconceptofcouplinglevelsappliesalsowithoutareferencetoreconfigurableprocessors.AsshowninFigure1-2,therearethreetypesofcouplinglevels:ProcessorCo-processorRFUMainBusMemoryI/OBusAttachedprocessorFigure1-2.Basiccouplinglevelsofreconfigurablelogic1.Reconfigurablefunctionalunit(RFU)-thelogicisplacedinsidetheprocessor,theinstructiondecoderissuesinstructionstothereconfigurableunitasifitwereoneofthestandardfunctionalunitsoftheprocessor.Inthisway,thecommunicationcostisverysmall,sothespeedcouldbeeasilyincreased.Thisisalsothemostpromising 1.IntroductiontoReconfigurableHardware19approachbecauseitcanbeusedtoacceleratealmostanyapplication[1].2.Coprocessor-thelogicisnexttotheprocessor.Communicationisdoneusingaprotocol.3.Attachedprocessor-thelogicisplacedonsomekindofI/Obus.Withthecoprocessorandattachedprocessorapproaches,thespeedimprovementusingthereconfigurablelogichastocompensatefortheoverheadoftransferringthedata.Thisusuallyhappensinapplicationswhereahugeamountofdatahastobeprocessedusingasimplealgorithmthatfitsinthereconfigurablelogic.2.3StaticanddynamicreconfigurationTherearetwobasicreconfigurationapproaches:staticanddynamic.2.3.1StaticreconfigurationStaticreconfiguration(oftenreferredascompiletimereconfiguration)isthesimplestandmostcommonapproachforimplementingapplicationswithreconfigurablelogic.Staticreconfigurationinvolveshardwarechangesatarelativelyslowrate.Itisastaticimplementationstrategywhereeachapplicationconsistsofoneconfiguration.Themainobjectiveistoimprovetheperformance.ConfigureExecuteFigure1-3.PrincipleofstaticreconfigurationThedistinctivefeatureofthisconfigurationisthatitconsistsofasinglesystem-wideconfiguration.Priortocommencinganoperation,thereconfigurableresourcesareloadedwiththeirrespectiveconfigurations.Onceoperationcommences,thereconfigurableresourceswillremaininthisconfigurationthroughouttheoperationoftheapplication.Thushardwareresourcesremainstaticforthelifeofthedesign(orapplication).ThisisdepictedinFigure1-3.Muchhigherperformancethanwithpuresoftwareimplementation(e.g.microprocessorapproaches),costadvantageover 20Chapter1ASICsincertaincasesandconventionalCADtoolsupportarethemainadvantagesofthistechnology.2.3.2DynamicreconfigurationWhereasstaticreconfigurationallocateslogicforthedurationofanapplication,dynamicreconfiguration(oftenreferredtoasruntimereconfiguration)usesadynamicallocationschemethatre-allocateshardwareatrun-time.Thisisanadvancedtechniquethatsomepeopleregardasaflexiblerealizationofthetime/spacetrade-off.ItcanincreasesystemperformancebyusinghighlyoptimizedcircuitsthatareloadedandunloadeddynamicallyduringtheoperationofthesystemasdepictedinFigure1-4.Inthiswaysystemflexibilityismaintainedandfunctionaldensityisincreased[9].ConfigureExecuteFigure1-4.PrincipleofdynamicreconfigurationDynamicreconfigurationisbasedupontheconceptofvirtualhardware,whichissimilartotheideaofvirtualmemory.Here,thephysicalhardwareismuchsmallerthanthesumoftheresourcesrequiredbyalloftheconfigurations.Therefore,insteadofreducingthenumberofconfigurationsthataremapped,weinsteadswaptheminandoutoftheactualhardware,astheyareneeded.Therearetwomaindesignproblemsforthisapproach:thefirstistodividethealgorithmintotime-exclusivesegmentsthatdonotneedto(orcannot)runconcurrently.Thisisreferredtoastemporalpartitioning.BecausenoCADtoolssupportthisstep,thisrequirestediousanderror-proneuserinvolvement.Thesecondproblemistoco-ordinatethebehaviourbetweendifferentconfigurations,i.e.themanagementoftransmissionofintermediateresultsfromoneconfigurationtothenext[8]. 1.IntroductiontoReconfigurableHardware213.CLASSIFICATIONOFRECONFIGURABLEARCHITECTURESInthissectionreconfigurablehardwarearchitecturesareclassifiedwithrespecttoseveralparameters.Theseparametersaredescribedbelow:•GranularityofbuildingblocksThisreferstothelevelsofmanipulationofdata.Inthischapterwedistinguishthreetypesofgranularity:fine-grainwhichcorrespondstobit-levelmanipulationofdata,mediumgrainmanipulatingdatawithvaryingnumberofbitsandcoarse-graingranularitywhichimplieswordleveloperations.•ReconfigurationschemeSystemscanbereconfiguredstaticallyordynamically.Dynamicallyreconfigurablesystemspermitthepartialreconfigurationofcertainlogicblockswhileothersareperformingcomputations.Staticallyreconfigurabledevicesrequireexecutioninterrupt.•CouplingThisreferstothedegreeofcouplingwithahostmicroprocessor.Inacloselycoupleddsystemreconfigurableunitsareplacedonthedatapathoftheprocessor,actingasexecutionunits.Looselycoupleddsystemsactasacoprocessor.Theyareconnectedtoahostcomputersystemthroughchannelsorsomespecial-purposehardware.3.1ClassificationwithrespecttobuildingblocksgranularityThegranularitycriterionreflectsthesmallestblockofwhichareconfigurabledeviceismade.Infine-graineddarchitectures,thebasicprogrammedbuildingblockusuallyconsistsofacombinatorialnetworkandafewflip-flops.Thelogicblockcanbeprogrammedintoasimplelogicfunction,suchasa2-bitadder.Theseblocksareconnectedthroughareconfigurableinterconnectionnetwork.Morecomplexoperationscanbeconstructedbyreconfiguringthisnetwork.CommerciallyavailableFieldProgrammableGateArrays(FPGAs)arebasedonfinegrainarchitectures.Althoughhighlyflexible,thesesystemsexhibitalowefficiencywhenitcomestomorespecifictasks.Forexample,althoughan8-bitaddercanbeimplementedinafine-grainedcircuit,itwillbeinefficient,comparedtoareconfigurablearrayof8-bitadders,whenperforminganaddition-intensivetask.An8-bitadderwillalsooccupymorespaceinthefine-grainedimplementation. 222Chapter1Reconfigurablesystemswhichuselogicblocksoflargergranularityarecategorizedasmedium-grained[6,7,10,11,17].Forexample,Garp[6]isdesignedtoperformanumberofdifferentoperationsonuptofour2-bitinputs.Anothermedium-grainedstructurewasdesignedspecificallytoimplementmultipliersofaconfigurablebit-width[7].ThelogicblockusedinthemultiplierFPGAiscapableofimplementinga4x4multiplication.TheCHESSarchitecture[11]alsooperateson4-bitvalues,witheachofitscellsactingasa4-bitALU.Themajoradvantageofmedium-grainedsystemswhencomparedtothefine-grainedarchitectureis,thattheybetterutilizethechiparea,sincetheyareoptimizedforthespecificoperations.However,adrawbackofthisapproachisrepresentedinahighoverheadwhensynthesizingoperationswhichareincompatiblewiththesimplestlogicblockarchitecture.Coarse-graineddarchitecturesareprimarilyintendedfortheimplementationoftasksdominatedbyword-widthoperations.Becausethelogicblocksusedareoptimizedforlargecomputations,theywillperformtheseoperationsmuchmorequickly(andconsumelesschiparea)thanasetofsmallercellsconnectedtoformthesametypeofstructure.However,becausetheircompositionisstatic,theyareunabletoleverageoptimizationsinthesizeofoperands.Ontheotherhand,thesecoarse-grainedarchitecturescanbemuchmoreefficientthanfiner-grainedarchitecturesforimplementingfunctionsclosertotheirbasicwordsize.Anexampleofcoarse-grainedsystemistheRaPiDarchitecture[4].Averycoarsegranularityisthecasewhenthesimplestlogicblockisbasedonanentiremicroprocessorwithorwithoutspecialaccelerators.ExamplesofsucharchitecturesaretheREMARC[12]andRAW[13]architectures.3.2Classificationwithrespecttoreconfigurationscheme3.2.1StaticallyreconfigurablearchitecturesTraditionalreconfigurablearchitecturesarestaticallyreconfigurable,whichmeansthatthereconfigurableresourcesareconfiguredatthestartofexecutionandremainunchangedforthedurationoftheapplication.Inordertoreconfigureastaticallyreconfigurablearchitecture,thesystemhastobehaltedwhilethereconfigurationisinprogressandthenrestartedwiththenewconfiguration.TraditionalFPGAarchitectureshaveprimarilybeenseriallyprogrammedsingle-contextdevices,allowingonlyoneconfigurationtobeloadedatatime.ThistypeofFPGAsisprogrammedusingaserialstreamof 1.IntroductiontoReconfigurableHardware23configurationinformation,requiringafullreconfigurationifanychangeisrequired.3.2.2DynamicallyreconfigurablearchitecturesDynamicallyreconfigurable(run-timereconfigurable)architecturesallowreconfigurationandexecutiontoproceedatthesametime.ThedifferentreconfigurablestylesofdynamicreconfigurationaredepictedinFigure1-5anddiscussedinthefollowingparagraphs.SinglecontextdynamicallyreconfigurablearchitecturesAlthoughsinglecontextarchitecturescantypicallybereconfiguredonlystatically,arun-timereconfigurationontosinglecontextFPGAcanalsobeimplemented.Typically,theconfigurationsaregroupedintocontexts,andeachcontextisswappedasneeded.Attentionhastobepaidonproperpartitioningoftheconfigurationsbetweenthecontextsinordertominimizethereconfigurationdelay.Multi-contextdynamicallyreconfigurablearchitecturesAmulti-contextarchitectureincludesmultiplememorybitsforeachprogrammingbitlocation.Thesememorybitscanbethoughtofasmultipleplanesofconfigurationinformation[3,15].Onlyoneplaneofconfigurationinformationcanbeactiveatagivenmoment,butthearchitecturecanIIngFigure1-5.Thedifferentbasicmodelsofdynamicallyreconfigurablecomputing 244Chapter1quicklyswitchbetweendifferentplanes,orcontexts,ofalready-programmedconfigurations.Inthismanner,themulti-contextarchitecturecanbeconsideredamultiplexedsetofsingle-contextarchitectures,whichrequiresthatacontextbefullyreprogrammedtoperformanymodificationtotheconfigurationdata.However,thisrequiresagreatdealmoreareathantheotherstructures,giventhattheremustbeasmanystorageunitsperprogramminglocationastherearecontexts.Thisalsomeansthatmulti-contextschemesaremainlyusedincoarse-grainarchitectures.PartiallyReconfigurableArchitecturesInsomecases,configurationsdonotoccupythefullreconfigurablehardware,oronlyapartofaconfigurationrequiresmodification.Inbothofthesesituationsapartialreconfigurationofthereconfigurableresourcesisdesired,ratherthanthefullreconfigurationsupportedbytheserialarchitecturesmentionedabove.Inpartiallyreconfigurablearchitectures,theunderlyingprogramminglayeroperateslikeaRAMdevice.Usingaddressestospecifythetargetlocationoftheconfigurationdataallowsforselectivereconfigurationofthereconfigurableresources.Frequently,theundisturbedportionsofthereconfigurableresourcesmaycontinueexecution,allowingtheoverlapofcomputationwithreconfiguration.Whenconfigurationsdonotrequiretheentireareaavailablewithinthearray,anumberofdifferentconfigurationsmaybeloadedintootherwiseunusedareasofthehardware.Partiallyrun-timereconfigurablearchitecturescanallowforcompletereconfigurationflexibilitysuchastheXilinx6200[18],ormayrequireafullcolumnofconfigurationinformationtobereconfiguredatonce,asintheXilinxVirtexFPGA[19].4.COUPLINGThetypeofcouplingoftheReconfigurableProcessingUnit(RPU)tothecomputingsystemhasabigimpactonthecommunicationcost.Itcanbeclassifiedintooneofthefourgroupslistedbelow,whicharepresentedinorderofdecreasingcommunicationcostsandillustratedinFigure1-6:•RPUscoupledtotheI/Obusofthehost(Figure1-6.a).Thisgroupincludesmanycommercialcircuitboards.SomeofthemareconnectedtothePCIbusofaPCorworkstation.•RPUscoupledtothelocalbusofthehost(Figure1-6.b). 1.IntroductiontoReconfigurableHardware25•RPUscoupledlikeco-processors(Figure1-6.c)suchastheREMARC-ReconfigurableMultimediaArrayCoprocessor[12].•RPUsactinglikeanextendeddata-pathoftheprocessor(Figure1-6.d)suchastheOneChip[16],thePRISC-ProgrammableReducedInstructionSetComputer[14],andtheChimaera[5].Figure1-6.CouplingoftheRPUtothehostcomputerREFERENCES1.BaratF,LauwereinsR(2000)ReconfigurableInstructionSetProcessors:ASurvey.In:ProceedingsofIEEEinternationalWorkshoponRapidSystemPrototyping,pp168-173 26Chapter1rd2.BrodersenB(2002)WirelessSystems-on-a-ChipDesign.In:Proceedingsof3InternationalSymposiumonQualityofElectronicDesign,pp221-2223.DeHonA(1996)DPGAUtilizationandApplication.In:ProceedingsofACM/SIGDAInternationalSymposiumonFPGAs,pp115-1214.EbelingC,CronquistDC,FranklinP(1996)RaPiDReconfigurablePipelinedDatapath.In:LectureNotesinComputerScience1142–FieldProgrammableLogic:SmartApplications,NewParadigmsandCompilers,SpringerVerlag,pp126-1355.HauckS,FryTW,HoslerMM,KaoJP(1997)TheChimaeraReconfigurableFunctionalthUnit.In:Proceedingsofthe5IEEESymposiumonFieldProgrammableCustomComputingMachines,pp87-966.HauserJR,WawrzynekJ(1997)Garp:AMIPSProcessorwithaReconfigurableCoprocessor.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp12-217.HaynesSD,CheungPYK(1998)Areconfigurablemultiplierarrayforvideoimageprocessingtasks,suitableforembeddinginanFPGAstructure.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp226-2358.HutchingsBL,WirthlinMJ(1995)Implementationapproachesforreconfigurablelogicapplications.BrighamYoungUniversity,Dept.ofElectricalandComputerEngineering9.KhatibJ(2001)ConfigurablerableComputing.ting.Availablelableat:http://www.geocities.com/siliconvalley/pines/6639/fpga10.LucentTechnologiesInc(1998)FPGADataBook,Allentown,Pennsylvania11.MarshallA,StansfieldT,KostarnovI,VuilleminJ,HutchingsB(1999)AReconfigurableArithmeticArrayforMultimediaApplications.In:ProceedingsofACM/SIGDAInternationalSymposiumonFPGAs,pp135-14312.MiyamoriT,OlukotunK(1998)Aquantitativeanalysisofreconfigurablecoprocessorsformultimediaapplications.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp2-1113.MoritzCA,YeungD,AgarwalA(1998)Exploringoptimalcostperformancedesignsforrawmicroprocessors.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp12-2714.RazdanR,BraceK,SmithMD(1994)PRISCSoftwareAccelerationTechniques.In:ProceedingsoftheIEEEInternationalConferenceonComputerDesign,pp145-14915.TrimbergerS,CarberryD,JohnsonA,WongJ(1997)ATime-MultiplexedFPGA.In:ProceedingsofIEEESymposiumonField-ProgrammableCustomComputingMachines,pp22-2916.WittingRD,ChowP(1996)OneChip:AnFPGAProcessorwithReconfigurableLogic.In:ProceedingsoftheIEEESymposiumonFPGAsforCustomComputingMachines,pp126-13517.XilinxInc.(1994)TheProgrammableLogicDataBook18.XilinxInc.(1996)XC6200:Advancedproductspecificationv1.0.In:TheProgrammableLogicDataBook19.XilinxInc.(1999)VirtexTM:ConfigurationArchitectureAdvancedUsersGuide’ Chapter2RECONFIGURABLEHARDWAREEXPLOITATIONINWIRELESSMULTIMEDIACOMMUNICATIONS1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:Thischapterpresentscaseswherereconfigurablehardwarecanbeexploitedfortheefficientrealizationofwirelessmultimediacommunicationsystems.Thevariousscenariosdescribedarereferringto(a)theDLC/MAClayerandthebasebandpartofthephysicallayerofHIPERLAN/2andIEEE802.11aWLANprotocols,and(b)theapplicationlayerofasophisticatedpersonaldevice.Thegoalofthischapteristoprovideaninsightontheadvantagesreconfigurablehardwaremaybringinreallifeapplications.Keywords:Reconfiguration,WLAN,applicationlayer,wirelessmultimediacommunications1.RECONFIGURABLEHARDWAREBENEFITSFROMASYSTEM’SPERSPECTIVEThepresenceofreconfigurablehardwareresourcesinasystemcanbeexploitedintwomajordirections:•Tocreatespaceforpost-fabricationfunctionalmodificationse.g.toupgradesystemfunctionalityorforsoftwarelikebugfixing.Softwarerealizationsallowpost-fabricationfunctionalmodifications,howeverforcomplextaskssoftwarerealizationsmightbeinefficient.Thisfeaturemayallowimportanttime-to-marketimprovement.•Toallowsharingofhardwareresourcesamongtasksthatarenotactivesimultaneouslythusreducingthetotalareacostofthesystem.Such27N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,27-42.©2005Springer.PrintedintheNetherlands. 28Chapter2tasksmaybelongtodifferentmodesofoperationofagivensystem,todifferentapplicationsorstandardsrealizedonthesameplatformoreventotimenon-overlappingtasksofasinglesystem.Givenanapplication,tasksthataresuitableforrealizationonreconfigurablehardwarearethosethatmaysharehardwareresourceswithothertasksovertimeorarelikelytobemodified/upgradedinthefutureandalsohavehighcomputationalcomplexity(thatpreventsefficientrealizationoninstructionsetprocessors).Intherestofthischapter,reconfigurationscenariosarediscussedfromthewirelesscommunicationsandmultimediadomains.ReallifecomplexsystemsareusedforthisanalysisnamelytheHIPERLAN/2andIEEE802.11aWLANsystems(coveringMACandphysicallayersfunctionality)andtheMPEGsystem(coveringtheapplicationlayer).2.RECONFIGURATIONSCENARIOSFORHIPERLAN/2ANDIEEE802.11aWLANSYSTEMSInthissectionreconfigurationscenariosfortheHIPERLAN/2andIEEE802.11aWLANsystemsarediscussed.ThetwosystemstargetedfunctionalitiescovertheDLC/MAClayerandthebasebandpartofthephysicallayer.2.1HIPERLAN/2andIEEE802.11asystemsHIPERLAN/2[1]isaconnection-orientedtime-divisionmultipleaccess(TDMA)system.PhysicallayerisbasedoncodedOFDMmodulationscheme[2].Thephysicallayerismulti-ratetypeallowingcontroloflinkcapabilitybetweenaccesspointandmobileterminalaccordinginterferencesituationsanddistance.TheflowgraphoftheHIPERLAN/2transmitterisshowninFigure2-1.Theblocksintheinputsandoutputsofthedifferenttasksgivetheinputandoutputratesofthetasksrespectively.Theinputrateofagiventaskcorrespondstotheminimumamountofdatarequiredforthetasktoproduceagivenoutput(outputrate).ThecomputationalcomplexityandthetypeofprocessingofthetransmittertasksareanalyticallypresentedinTable2-1.Theanalysisofcomputationalcomplexityisdonebyestimatingthenumberofrequiredbasicoperationsperoutputdataitemineachfunction.Thebasicoperationsincludearithmetic,logicandmemoryread/writeoperations.Itisassumed, 2.ReconfigurablehardwareexploitationinWirelessMultimedia29Communicationsthataprocessingoftransmittedorreceiveddatashouldbepossibleatasustainednominaldatarateofeachphysicallayermode.Theinputandoutputoperationsincludedinthiscomplexityanalysiscorrespondtodatacomingfromprevioustasksandbeingpassedtofollowingtasks(inarealimplementationtheseoperationsarelikelycorrespondingtoaccessestodatastoragelocations).TailbitsappendingMAC/PHYTxmemory1bitDataScrambler2to1MUXInterface1bit13bits/1bitRate1bit12bits/Convolutional1to2DEMUXindependent1bitencoderpuncturingP113bits/1bit1bit9bits/3bits/1bit8bits/RatedependentNN2bits/CBPSInterleaverCBPSpuncturingP2288bits(WC)288bits(WC)9bits/1bit3bits/1bitI(real)part64I's48I's6bits/ofsamplePilotConstellation4bits/Insertionmapper2bits/Q(real)part1bit64Q's48Q'sofsample64real80realsamplesIFFTCyclicprefixsamplesPhyburst64imaginaryinsertion80imaginaryformationsamplessamplesPreamblesmemoryFigure2-1.HIPERLAN/2transmitterFromthecomputationalcomplexityanalysisitcanbeseenthattherearesomealgorithmsthatgenerateaconstantcomputationalcomplexityinallphysicallayermodes.ThemostimportantisIFFTthatisdominatingtheoveralltransmitsidecomplexityinthelowbitratemodes.Thecomplexitiesofchannelcodingfunctionsarenaturallyrelatedtotheusedbitrate. 300Chapter2Table2-1.ComputationalcomplexityoftransmittertasksindifferentphysicallayermodesTaskTypeofprocessingComputationalcomplexity(MOPS)/PHYmode(Mb/s)691218273654bitlevel-shiftScrambling108162216324486648972register,XORConvolutionalbitlevel-shift17426134852278310441566encodingregister,XORPuncturing(Ratebitlevel–logic0.310.310.310.310.310.310.31dependent)operationsPuncturing(Ratebitlevel–logic033066105132198dependent)operationsGroupofbits–Interleaving48489696192192288LUTaccessesConstellationGroupofbits–30453654547290mappingLUTaccessesWordlevel-Pilotinsertion56565656565656memoryaccessesWordlevel–multiplications,IFFT922922922922922922922additions,memoryaccessesCyclicprefixWordlevel-72727272727272insertionmemoryaccessesSum1410159917462112267031384164Timingandfrequency80complexCyclic64complexsynchronizationandsamplesPrefixsamplescorection(160words)Extraction(128words)Channel1complex64complexConstellationestimationandsamplesamplesFFTdecoderfrequencydomain(2words)(128words)equalizationRateNNdependent8/2/1bitsCBPSDe-interleaverCBPS6/4/2/1bits288bits(wc)288bits(wc)depuncturingRateViterbi9/3/1bits12/1bitsindependent13/1bits2bitsdecoderdepuncturingMAC/PHY1bitDescrambler1bitinterfaceFigure2-2.HIPERLAN/2receiver 2.ReconfigurablehardwareexploitationinWirelessMultimedia31CommunicationsTheflowgraphofareferenceHIPERLAN/2receiverispresentedinFigure2-2.ThereceiverchainoftheHIPERLAN/2isleftopenbythestandardsothereismorefreedomforalgorithmselectionforcertainblockssuchasthetimingandfrequencysynchronizationandthechannelestimation(differentchainsoftaskscanbeadoptedforthesetwogenericblocks).ThecomputationalcomplexityandthetypeofprocessingofthereceivertasksareanalyticallypresentedinTable2-2.Table2-2.ComputationalcomplexityofreceivertasksindifferentphysicallayermodesTaskTypeofprocessingComputationalcomplexity(MOPS)/PHYmode(Mb/s)691218273654CyclicprefixWordlevelmemory96969696969696extractionaccessesWordlevel–Frequencyerrormultiplications,208208208208208208208correctionadditions,memoryaccessesWordlevel–multiplications,FFT922922922922922922922additions,memoryaccessesWordlevel–Frequencymultiplications,domain132132132132132132132additions,memoryequalizationaccessesConstellationGroupofbits–LUT4848240240288288336demappingaccessesGroupofbits–LUTDeinterleaving48489696192192288accessesDepuncturingbitlevel–logic050099118198297(Ratedependent)operationsDepuncturingbitlevel–logic(Rate0.160.200.160.200.280.200.20operationsindependent)BitlevelI/OwordViterbidecodingleveladditions,11701755234035105265702010530comparisonsbitlevelshiftregister,Descrambling108162216324486648972XORSum27323421425056277707970413781 32Chapter2Asitcanbededuced,theViterbidecodingdominatestheoverallcomplexityfiguresinallphysicallayermodes.Itcanbealsoseenthatthereceiversideprocessingisuptothreetimesmorecomplexthantransmitsideprocessing.BBBBBBBBBBCPCCIEEE802.11aPREAMBLE16161616161616161616326464sampleseessamplesseessamplesseessamplessessamplesessamplesessampleseessamplessessamplesessamplessamplessamplessamplesAIAAIAIABBBBIBCPCCHIPERLAN/2BroadcastburstPREAMBLE16161616161616161616326464sampleseessamplesseessamplesseessamplessessamplesessampleseessamplesseessamplesseessamplessessamplessamplessamplessamplesHIPERLAN/2CPCCDownlinkburstPREAMBLE326464samplessamplessamplesHIPERLAN/2BBBBIBCPCCUplinkburstshortPREAMBLE1616161616326464samplesessamplesessampleseessamplesseessamplesssamplessamplessamplesHIPERLAN/2BBBBBBBBBIBCPCCUplinkburstlongPREAMBLEandDirectlinkburstPREAMBLE16161616161616161616326464sampleseessamplessessamplesessamplesessamplesessampleseessamplesseessamplessessamplesessamplessamplessamplessamplesFigure2-3.IEEE802.11aandHIPERLAN/2preamblesThebasebandpartoftheIEEE802.11asystem[3]isalmostsimilartothatofHIPERLAN/2system.Onlysomeminordifferencesexist.IEEE802.11ausesonlyonepreamblesequence(showninFigure2-3)of320samples.HIPERLAN/2uses4differenttypesofpreamblesequencesforthedifferenttypesofPDUswithsizesrangingfrom160samplesto320samples.ThecontentsofthefirsthalfofthePREAMBLEsequencesofHIPERLAN/2arealwaysdifferenttothatofIEEE802.11a.Fromanimplementationpointofviewthismayaffectthesynchronizationblockofthereceiver.Differentsequencesareusedbythetwosystemsfortheinitializationofthe(de)scrambler.InIEEE802.11atheinitializationisperformedusingthefirst7bitsoftheservicefieldwhicharealwayssettozero.InHIPERLAN/2theinitialstateofthescramblerissettopseudorandomnon-zero7-bitstatedeterminedbytheframecounterfieldintheBCH(firstfourbitsofBCH)atthebeginningofthecorrespondingMACframe.Theinitialstateisderived 2.ReconfigurablehardwareexploitationinWirelessMultimedia33CommunicationsbyappendingthefirstfourbitsofBCHtothefixedbinarynumber(111)2.Thisdifferenceissmallfromanimplementationpointofview.Intheencoderside,IEEE802.11asupports1/2,3/4and2/3coderateswhileHIPERLAN/2supports1/2,3/4and9/16coderates.Twocoderatesareincommonwhileeachsystemsupportsathirddifferentextraone.HIPERLAN/2appliestwopuncturingstages(arateindependentonefollowedbyaratedependentone)whileIEEE802.11aappliesasinglepuncturingstage.ThepuncturingpatternsappliedbythetwosystemstoachievethedifferentcoderatesarepresentedinFigure2-4(nopuncturingpatternisrequiredfor1/2coderate).Thedifferencefromanimplementationpointofviewissmall.Thecombinationsofmodulation,codingrateandachievednominalbitrate(physicalmodesofoperation)supportedbyIEEE802.11aandHIPERLAN/2arepresentedinTable2-3.Sixmodesofoperationarecommon,IEEE802.11asupportstwoextramodeswhileHIPERLAN/2supportsoneextramode.Fromanimplementationpointofviewthenumberofmodesofoperationsupportedaffectsthemodemcontrollerfromwhichthemodemcontrolwordsareissued.1111110111111XHIPERLAN/2rateindependentpuncturingpatterns1111111111110Y111111110XHIPERLAN/29/16puncturingpattern111101111Y110XCommon3/4puncturingpattern101Y11XIEEE802.11a2/3puncturingpattern10YFigure2-4.PuncturingpatternsusedbyIEEE802.11aandHIPERLAN/2TheMACframedurationoftheHIPERLAN/2isfixedto2ms.TheHIPERLAN/2MACframestructuredescribedinFigure2-5comprisestime 34Chapter2slotsforbroadcastcontrol(BCH),framecontrol(FCH),accessfeedbackcontrol(ACH)anddatatransmissionindownlink(DL),uplink(UL)anddirectlink(DiL)phases,whichareallocateddynamicallydependingontheneedfortransmissionresources.Amobileterminal(MT)firsthastorequestcapacityfromtheaccesspoint(AP)inordertosenddata.Thiscanbedoneintherandomaccesschannel(RCH),wherecontentionforthesametimeslotisallowed.Downlink,uplinkanddirectlinkphasesconsistoftwotypesofPDUs.ThelongPDUshaveasizeof54bytesandcontaincontroloruserdata.Thepayloadis49.5bytesandtheremaining4.5bytesareusedforthePDUType(2bits),asequencenumber(10bits,SN)andcyclicredundancycheck(CRC-24).LongPDUsarereferredtoasthelongtransportchannel(LCH).ShortPDUscontainonlycontroldataandhaveasizeof9bytes.Theymaycontainresourcerequests,ARQmessagesetcandtheyarereferredtoastheshorttransportchannel(SCH).AphysicalburstiscomposedofthePDUtrainpayloadandapreambleandistheunittobetransmittedviathephysicallayer.Table2-3.PhysicalmodesofoperationofIEEE802.11aandHIPERLAN/2CodingNominalbitrateCodedbitsModulationRateR(Mbit/s)perOFDMsymbolBPSK1/2648BPSK3/4948QPSK1/21296QPSK3/4189616QAM9/1627192(HL/2only)16QAM1/224192(IEEE802.11aonly)16QAM3/43619264QAM3/45428864QAM2/348288(IEEE802.11aonly)ThestructureoftheIEEE802.11aPPDUframeisdescribedinFigure2-6.Theheadercontainsinformationaboutthelengthoftheexchangeddataandthetransmissionrate.TheRATEfieldconveysinformationaboutthetypeofthemodulationandthecodingrateusedintherestofthepacket.TheLENGTHfieldtakesavaluebetween1and4095andspecifiesthenumberofbytestobeexchanged(PSDU).Thesixtailbitsareusedtoresettheconvolutionalencoderandtoterminatethecodetrellisinthedecoder.Thefirst7bitsoftheservicefieldaresettozeroandareusedtoinitialisethe(de)scrambler.Theremaining9bitsarereservedforfutureuse. 2.ReconfigurablehardwareexploitationinWirelessMultimedia35CommunicationsThepadbitsareusedtoensurethatthenumberofbitsinthePPDUframemapstoanintegernumberofOFDMsymbols.Acyclicredundancycheck(CRC-32)isincludedintheIEEE802.11aPSDU.2msBCHFCHACHDLphaseDiLphaseULphaseRCHMACFrameLongPDUs(LCH)ShortPDUs(SCH)PDUType(2bits)SN(2bits)Payload(49.5bytes)CRC(3bytes)LongPDUs(LCH)54bytesPreamblePDUTrainPhysicalBurstFormatFigure2-5.HIPERLAN/2MACframe,LongPDUandPhysicalBurstformatAnimportantissueisthatthetransmissionduration(TXTIME)foraPPDUframeinIEEE802.11aisnotfixedbutafunctionofLENGTHfieldasshowninthefollowingequation:TXTIME=T+T+T×Ceiling(((16+8×LENGTH+6)/N)(1)PREAMBLEPSIGNALSSYMSDBPSDwhereNDBPSisthenumberofdatabitspersymbolandcanbederivedfromtheDATARATEparameter.FromanimplementationpointofviewthisfactimposesastricttimingrequirementtotheMAC/PHYinterfaceforthedecodingoftheSIGNALsymbolinordertodeterminethenumberofOFDMsymbolstobeexchanged.HEADERRATEReservedLENGTHParityTailSERVICETailPadPSDU(4bits)(1bit)(12bits)(1bit)(6bits)(16bits)(6bits)BitsPREAMBLESIGNALDATA12SymbolsOneOFDMsymbolVariablenumberofOFDMsymbolsBPSK1/2RateModeindicatedfromRATEFigure2-6.IEEE802.11aPPDUframeformat 36Chapter2ThemajordifferencesbetweenIEEE802.11aandHIPERLAN/2systemsoccurintheMACsublayer.InHIPERLAN/2themediumaccessisbasedonaTDD/TDMAapproach.ThecontroliscentralizedtoanAP,whichinformstheMTsatwhichpointintimeintheMACframetheyareallowedtotransmittheirdata.IEEE802.11ausesadistributedMACprotocolbasedonCarrierSenseMultipleAccesswithCollisionAvoidance(CSMA/CA).2.2WLANReconfigurationscenariosSomereconfigurationscenariosfortheMACandbasebandpartsoftheHIPERLAN/2andIEEE802.11aWLANsystemsaredescribedinthissection.HIPERLAN/2andIEEE802.11abasebandprocessingalgorithmsarequitesimpleasfarascontrolflowisconcernedandtheirfunctionalitydoesnotdependinprincipleonthephysicallayermodethatisusedintransmissionorreception.Thebasebandprocessingcomputationalcomplexitydependsverymuchontheusedphysicallayermodeinthetransmissionorreception.ComplexComplexAlgorithmTask1TaskNDistributedReconfigurableISPHardwareSharedArchitectureMemoryInterconnectNetworkI/OFigure2-7.RealizationonahighlyflexibleplatformThemostcomputationallycomplextasksaretheViterbidecodingandtheFFTonthereceiversideandtheIFFTinthetransmitterside.Assumingahighlyflexibleimplementationusinginstructionsetprocessors(ISP)andreconfigurablehardware(alongsideinterconnect,memory,I/Osetc.)thesetasksshouldbeassignedtoreconfigurablehardware(forincreasedspeedandreducedpower).ThisscenarioisillustratedinFigure2-7.Howeveralmostnoflexibilityisrequiredforthesetasksonastand-alonebasis(nodifferentcandidateimplementationchoicesexist).IfASICblockswereincludedinthetargetimplementationplatformthesetasksshouldbepreferablymovedtothem. 2.ReconfigurablehardwareexploitationinWirelessMultimedia37CommunicationsReconfigurablehardwareresourcescanbesharedamongbasebandprocessingtasksthatarenotactivesimultaneously.Thismayleadtosiliconareaoptimization(takingintoconsiderationreconfigurationrelatedoverheads).ThisscenarioisdescribedinFigure2-8.Forexampleunderahalfduplexingscenariothetransmitterandthereceiverwillnotbeactivesimultaneously.Inthiscase,tasksofthetransmitterandthereceivermaysharethesamereconfigurablehardwareresources.GroupoftaskswithnonoverlappingAlgorithmlifetimesDistributedReconfigurableDedicatedISPHardwareHardwareSharedArchitectureMemoryInterconnectNetworkI/OFigure2-8.Reconfigurablehardwaresharingamongtaskswithnon-overlappinglifetimesTaskTaskAlgorithmInstance1InstanceNDistributedReconfigurableDedicatedISPHardwareHardwareSharedArchitectureMemoryInterconnectNetworkI/OFigure2-9.RealizationofdifferentalgorithmicinstancesofthesametaskonreconfigurablehardwareCertaintasksinthereceiverchainofthebasebandprocessingallowdifferentalgorithmicimplementationswithdifferenttrade-offsbetweenalgorithmicperformanceandcomputationalcomplexity(e.g.channelestimation).Loweralgorithmicperformancerequirements(e.g.SNR,BER)mayallowtheuseoflesssophisticatedandcomputationalcomplexalgorithmicinstancesleadingtoimprovedimplementationefficiency(speed, 38Chapter2power).Furthermorerealizationofdifferentalgorithmicinstancesforthesametaskinagivensystemmaybebeneficiale.g.allowingadaptationtodifferentoperatingconditions.Suchtasksaregoodcandidatesforimplementationonreconfigurablehardware(withtheirdifferentinstancessharingthesamereconfigurablehardwareresources)iftheircomplexityishigh(preventingefficientrealizationoninstructionsetprocessors).ThisscenarioisdescribedinFigure2-9.Task1TaskNcandidateforcandidateforpostfabricationpostfabricationAlgorithmmodificationmodificationDistributedReconfigurableDedicatedISPSharedArchitectureHardwareHardwareMemoryInterconnectNetworkI/OFigure2-10.PostshipmentmodificationscenarioStandard1Standard2AlgorithmTaskTaskDistributedReconfigurableDedicatedISPSharedArchitectureHardwareHardwareMemoryInterconnectNetworkI/OFigure2-11.Multi-standardrealizationscenarioAnotheropportunityforreconfigurablehardwareexploitationistowardspost-shipmentmodification/enhancementofthesystem’sfunctionality(e.g.withmoresophisticatedrealizationsofcertaintasks).Basebandprocessingtasksthatarecandidatesforbeingupgradedarethosethatareleftopenbythestandard.ThisscenarioisdescribedinFigure2-10.Moreopportunitiesforreconfigurationandreconfigurablehardwaresharingexistinthecaseofrealizationofmultiplestandardsonthesamereconfigurableimplementationplatform.ThisscenarioisdescribedinFigure2-11.LetassumeaHIPERLAN/2–IEEE802.11adualstandard 2.ReconfigurablehardwareexploitationinWirelessMultimedia39Communicationsrealizationwiththetwosystemsnotbeingactivesimultaneously.GiventhatthemajordifferencesbetweenthetwostandardsareintheMAClayersreconfigurablehardwarecanbeusedfortherealizationofthemostcomplexandperformancedemandingpartsoftheMAClayers(andtheMACtobasebandinterfaces)ofthetwosystems.3.RECONFIGURATIONSCENARIOSATTHEAPPLICATIONLAYERAsportabledevicesbecomemorepowerful,italsobecomespossibletorunmorecomputationallyintensiveservicesontheseappliances.Duetotheincreasingflexibilityrequirementsthatareimposedbytheseapplications,thedevicesneedtobehighlyadaptabletotherunningapplications.Attheotherhand,efficientrealizationsoftheseapplicationsarerequired,especiallyintheresourcestheyuseduringdeployment,wherepowerconsumptionmustbetradedagainstperceivedqualityoftheapplication.Tobeabletorealizeavarietyofapplicationsorservices,theimplementationplatformneedstobehighlyadaptable.AssumeawirelesscommunicationterminalasisshowninFigure2-12,whichconsistsoutofinstructionsetprocessors(ISP)andreconfigurablehardwarethatareconnectedtoacommoninterconnectnetworkandtomemory.Thisdeviceispowerfulenoughtosupportvariousapplications,includingvideo.Becauseofthehighcomputationaldemandofsuchavideoapplication,itwillberunonthereconfigurablehardware(seeFigure2-12)asthatpartcanbeconfiguredforoptimalperformanceforagivenapplication.Whentheuserdecidestoviewthevideoinasmallwindowandtostartupa3Dgame,thesituationchanges.Thenthevideoapplicationcanberunwithmuchlessresources,whilethegamebecomesthemostcomputationallyintensiveapplication.Thismeansthatthis3Dgamewillneedtoberunonthereconfigurablehardware.Toenablethat,thevideoapplicationismovedtorunfurtherinsoftwareonaninstructionsetprocessor(ISP).Thehardwareisthenreconfiguredforthe3Dgameandthatapplicationisstarted(seeFigure2-13).Bymovingthevideoapplicationtosoftwareandrunningitinasmallerwindowalsoimpliesthatalowerdataratecanbeusedonthewirelessterminalinterconnect.Thismeansthatthewirelessapplianceshouldsendbacktotheserverthatalowerresolution(andthusalowerbit-rate)isallowedforthevideoapplication.Theapplicationqualityasperceivedbytheuserisstillsatisfying. 400Chapter2Figure2-12.AvideoapplicationisrunningonthereconfigurablehardwareFigure2-13.A3Dapplicationisrunningonthereconfigurablehardware,whilethevideoapplicationcontinuesinareducedwindowandonasoftwareprocessorFromtheapplicationscenarioabove,itisclearthatitmustbepossibletorunmanydifferentapplicationsonthereconfigurablehardware.Thismeansthatgeneralreconfigurablehardwareisneeded,incontrasttoincorporatingdedicatedhardwareblocks,likeFFTprocessor,FIRfilteretc.Alsowenoticethatapplicationsareverydifferentinnature,asalreadydescribedinthecaseofvideostreamingandinteractive3Dapplications.Aselectionofthe 2.ReconfigurablehardwareexploitationinWirelessMultimedia41Communicationsreconfigurationcharacteristicsisalsobasedongeneralcharacteristicsofthemulti-mediaapplicationsandontheusagescenarioabove.Requirementsonreconfigurationtimearemodest:becausereconfigurationisuser-initiated,fastreconfigurationtimes(<1msec)arenotneeded.Whene.g.switchingavideoapplicationfromhardwaretosoftware,itisnotimportantthatanumbersofframesarenotdecoded.Assoonastheapplicationisrunninginsoftware,itdecodesthenextincomingframe.Requirementsonthereconfigurationgranularityarecomplicatedbytheunknownnatureoftheapplication,thegranularityshouldbefineenoughsothatforeachapplicationanoptimalimplementationinreconfigurablehardwareispossible.Howeverduetopowerrequirements,wordlevelcoarsegrainreconfigurationismoreappropriatethanbit-levelreconfiguration.Thisisespeciallythecasewhentheword-lengthsarematchedtotheapplicationathand.Table2-4.OperationalpowerrequirementsforMPEG2videodecodingMPEG-2MP@MLDecoderFunctionMOPSInputOutputBitstreamparsingandVLD12440DequantizationandIDCT1054070MotionCompensation2737070YUVtoRGBcolorconversion2997035Total689184215Table2-5.Operationalpowerrequirementsfora3DapplicationQualityCPUtime#triangles#pixelsArchitecture31dB40ms50005%SW31dB2ms50005%HW25dB70ms500019%SW30dB80ms800019%SW43dB118ms1750019%SW43dB21ms1750019%HWTosummarizetherequirementsonapplications,itisnotonlyemphasizedthatdifferentapplicationsmustbeabletorunonthewirelessLANplatform,butalsothattheycanhavehugecomputationaldemandsforwhichdedicatedorreconfigurablehardwareisneeded.Tohaveanindicationoftherequiredoperationalpower,werefertoliterature[4,5]theresultsofwhicharesummarizedinTable2-4forMPEG2andinTable2-5fora3Dapplication.InthelatterapplicationtheCPUtime,andthustheframerate,isclosely 422Chapter2relatedtotherequiredquality(applicationQoS)butalsodependsonthearchitecture,beitahardwareorasoftwarerealization.REFERENCES1.ETSI(2000),BroadbandRadioAccessNetworks(BRAN);HIPERLANtype2;Physical(PHY)layer,v1.2.12.VanNeeR,PrasadR(1999)OFDMforMobileMultimediaCommunications.Boston:ArtechHouse3.IEEEStd802.11a/D7.0(1999)Part1:WirelessLANMediumAccessControl(MAC)andPhysicalLayer(PHY)specifications:HighSpeedPhysicalLayerinthe5GHzBand4.ZhouCG,KabirI,KohnL,JabbiA,RiceD,HuXP(1995)MPEGvideodecodingwithththeUltraSPARCvisualinstructionset.In:Proceedingsofthe40IEEEComputerSocietyInternationalConference,pp.4704775.LafruitG,NachtergaeleL,DenolfK,BormansJ(2000)3DComputationalGracefulDegradation.In:ProceedingsofISCASWorkshopandExhibitiononMPEG-4,vol.3,pp.547-550 Chapter3RECONFIGURABLEHARDWARETECHNOLOGIES1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:Alargenumberofreconfigurablehardwaretechnologieshavebeenproposedbothinacademiaandcommercially(someofthemintheirfirstmarketsteps).Theycanberoughlyclassifiedinthreemajorcategories:a)FieldProgrammableGateArrays(FPGAs),b)integratedcircuitdeviceswithembeddedreconfigurableresourcesandc)embeddedreconfigurablecoresforSystems-on-Chip(SoCs).Inthischapterrepresentativecommercial1technologiesarediscussedandtheirmainfeaturesarepresented.Keywords:FieldProgrammableGateArrays(FPGAs),embeddedreconfigurablecores,finegrainreconfigurablearchitecture,coarsegrainreconfigurablearchitecture1.FIELDPROGRAMMABLEGATEARRAYS(FPGAS)Fieldprogrammablegatearrayscurrentlyrepresentthemostpopularandmaturesegmentofreconfigurablehardwaretechnologies.TechnologyadvanceskeepincreasingthegatescountsandmemorydensitiesofFPGAswhiletheyalsoallowtheintegrationoffunctionsrangingfromhardwiredmultipliersthroughhighspeedtransceiversandallthewayupto(hardorsoft)CPUcoreswithassociatedperipherals.TheseadvancesmakepossibletherealizationofcompletesystemsonasingleFPGAchipimprovingend-systemsize,powerconsumption,performance,reliabilityandcost.Equally1Theinformationincludedinthischapterisup-to-dateuntilNovember2004.43N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,43-83.©2005Springer.PrintedintheNetherlands. 444Chapter3importantFPGAscanbereconfiguredinsecondseitherstaticallyordynamically/partially.Reconfigurationcantakeplaceintheworkstation,intheassemblylineorevenattheenduserpremises.Thesecapabilitiesprovideflexibility:•toreacttolastminutedesignchanges•toprototypeideasbeforeimplementation•tomeettime-to-marketdeadlines•tocorrecterrorsandupgradefunctionsoncetheendsystemisinusers’hands•oreventoimplementreconfigurablecomputingi.e.usingafixednumberoflogicgatestotime-division-multiplexmultiplefunctions.Becauseofalltheseadvantages,FPGAshavebeenmakingsignificantinroadsintoASICterritory.Itisamatteroftheper-gatecostdecreasesandthegatesperdeviceincreasestodecidewhetherFPGAscanreplaceASICs.MappingofapplicationsonFPGAshasbeenbasedonVHDLandVeriloglanguagesforinputdescriptions.Cbasedapproachesarealsocurrentlyunderconsideration.TheintegrationofCPUsonFPGAsintroduceddesignflowsandtoolssupportinghardware/softwarecodesignandsoftwaredevelopment.ThereareanumberofcompaniesbuildingFPGAsincludingActel,Altera,Atmel,LatticeSemiconductor,QuicklogicandXilinx;XilinxandAlteracurrentlybeingthemarketleaders.Inordertodifferentiate,FPGAvendorshaveintroduceddevicestoaddressdifferentintersectionsofperformance,power,integrationandcosttargets.SomerepresentativeFPGAdevicesarebrieflydiscussedinthefollowingsubsections.1.1ALTERAStratixIIAlteraclaimsthatStratixIIdevices[11]areindustry’sfastestandhighestdensityFPGAs.StratixIIdevicesextendthepossibilitiesofFPGAdesign,allowingdesignerstomeetthehigh-performancerequirementsoftoday’sadvancedsystemsandavoiddevelopingwithcostlyASICs.1.1.1ArchitectureTheStratixIIarchitecturehasbeendesignedtoprimarilyoptimizeperformancebutalsologicdensityinagivensiliconarea.ItslogicstructureisconstructedwithAltera’snewadaptivelogicmodules(ALMs).TheStratixIIarchitecturereducessignificantlythelogicresourcesrequiredtoimplementanygivenfunctionandthenumberoflogiclevelsinagivencriticalpath.Thearchitectureaccomplishesthisbypermittinginputstobe 3.ReconfigurableHardwareTechnologies45sharedbyadjacentlook-uptablesinthesameALM.Multiple,independentfunctionscanalsobepackedintoasingleALM,furtherreducinginterconnectdelaysandlogicresourcerequirements.ThestructureofaStratixIIALMisshowninFigure3-1.StratixIIFPGAsutilizetheTriMatrixmemorystructure.TriMatrixmemoryincludesthe512-bitM512blocks,the4-KbitM4Kblocks,andthe512-KbitM-RAMblocks,eachofwhichcanbeconfiguredtosupportawiderangeoffeatures.EachembeddedRAMblockintheTriMatrixmemorystructuretargetsadifferentclassofapplications:theM512blockscanbeusedforsmallfunctionssuchasfirst-infirst-out(FIFO)applications,theM4Kblockscanbeusedtostoreincomingdatafrommulti-channelI/Oprotocols,andtheM-RAMblockscanbeusedforstorage-intensiveapplicationssuchasInternetprotocolpacketbufferingorprogram/datamemoryforanon-chipNiosembeddedprocessor.Allmemoryblocksincludeextraparitybitsforerrorcontrol,embeddedshiftregisterfunctionality,mixed-widthmode,andmixed-clockmodesupport.Additionally,theM4KandM-RAMblockssupporttruedual-portmodeandbytemaskingforadvancedwriteoperations.Figure3-1.StratixIIadaptivelogicmodulestructureStratixIIDSPblocksareoptimizedtoimplementprocessingintensivefunctionssuchasfiltering,transforms,andmodulation.Capableofrunningat370MHz,StratixIIDSPblocksprovidemaximumDSPthroughput(upto284GMACs)thatisordersofmagnitudehigherthanleading-edgedigitalsignalprocessorsavailabletoday.EachDSPblockcansupportavarietyofmultiplierbitsizes(9x9,18x18,36x36)andoperationmodes(multiplication,complexmultiplication,multiply-accumulateandmultiplyadd)andcangenerateDSPthroughputof3.0GMACSperDSPblock.Inaddition,roundingandsaturationsupporthasbeenaddedtotheDSPblock. 466Chapter3StratixIIFPGAssupportmanyhigh-speedI/Ostandardsandhigh-speedinterfacessuchas10GigabitEthernet(XSBI),SFI-4,SPI4.2,HyperTransport™,RapidIO™,andUTOPIALevel4interfacesatupto1Gbps.Theseallowinterfacingwithanythingfrombackplanes,hostprocessors,busesandmemorydevicesto3Dgraphicscontrollers.StratixIIdevicessupportinternalclockfrequencyratesofupto500MHzandtypicaldesignperformanceatover250MHz.LogicdensitiesofStratixIIdevicesrangefrom15,600to179,400equivalentlogicelements.Totalmemorydensitiescanbeupto9MbitsofRAM,whichcanbeclockedata370MHzmaximumclockspeed.StratixIIFPGAsmayincludeupto12PLLsandupto48systemclocksperdevice.1.1.2GranularityStratixIIarchitectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodules.1.1.3TechnologyStratixIIFPGAsaremanufacturedon300-mmwafersusingTSMC’s90-nm,1.2-V,all-layercopperSRAM,low-kdielectricprocesstechnology.1.1.4ReconfigurationStratixIIdevicesareconfiguredatsystempower-upwithdatastoredinanAlteraconfigurationdeviceorprovidedbyanexternalcontroller.TheStratixIIdevice'soptimizedinterfaceallowsmicroprocessorstoconfigureitseriallyorinparallel,andsynchronouslyorasynchronously.TheinterfacealsoenablesmicroprocessorstotreatStratixIIdevicesasmemoryandconfigurethembywritingtoavirtualmemorylocation,makingreconfigurationeasy.RemotesystemupgradescanbetransmittedthroughanycommunicationsnetworktoStratixIIdevices.1.1.5Otherissues•NiosembeddedprocessorsallowdesignerstointegrateembeddedprocessorsonStratixIIdevicesforcompletesystem-on-a-programmable-chip(SOPC)designs.TheNiossoftembeddedprocessorhasbeenoptimizedfortheadvancedarchitecturalfeaturesoftheStratixIIdevicefamily. 3.ReconfigurableHardwareTechnologies47•StratixIIfamilyenablesdesignsecuritythroughnon-volatile,128-bitAESdesignencryptiontechnologyforpreventingintellectualpropertytheft.•Aseamless,cost-reductionmigrationpathtolow-costHardCopystructuredASICsexistsforStratixIIdevices.1.1.6DesignflowDesignflowforStratixIIFPGAsisbasedontheQuartusIIsoftwareforhigh-densityFPGAs,whichprovidesacomprehensivesuiteofsynthesis,optimization,andverificationtoolsinasingle,unifieddesignenvironment.QuartusIIincludesintegrateddevelopmentenvironmentforNiosIIembeddedprocessors.UsingtheSOPCBuilderdesigntoolintheQuartusIIsoftware,designersselectfromthewidearrayofIPcomponents,includingmemory,interface,control,anduser-createdfunctions,customizethemfortheparticularapplication,andconnectthemautomaticallygeneratinghardware,software,andsimulationmodelsforthecustomimplementation.1.1.7ApplicationareaSTRATIXIIFPGAsareveryflexibleallowingrealizationofdifferentapplications.DuetotheirhighmemorydensityStratixIIdevicesareanidealchoiceformemoryintensiveapplications.UsingDSPblocks,StratixIIFPGAscaneasilymeettheDSPthroughputrequirementsofemergingstandardsandprotocolssuchasJPEG2000,MPEG-4,802.11x,code-divisionmultipleaccess2000(CDMA2000),HSDPandW-CDMA.1.2ALTERACycloneIICycloneIIFPGAs[3]havebeendesignedfromthegroundupforthelowestcost.TheCycloneIIFPGAfamilyoffersacustomer-definedfeatureset,highperformanceandlowpowerconsumptioncombinedwithhighdensity.AlteraclaimsthatCycloneIIFPGAsofferthelowestcostperlogicelementamongallcommerciallyavailabledevicesandthuscansupportcomplexdigitalsystemsonasinglechipatacostthatrivalsthatofASICs.1.2.1ArchitectureCycloneIIdevicescontainatwo-dimensionalrow-andcolumn-basedarchitecturetoimplementcustomlogic.Columnandrowinterconnectsofvaryingspeedsprovidesignalinterconnectsbetweenlogicarrayblocks(LABs),embeddedmemoryblocksandembeddedmultipliers.Thelogic 488Chapter3arrayconsistsofLABs,with16logicelements(LEs)ineachLAB.Alogicelement(LE)isasmallunitoflogicprovidingefficientimplementationofuserlogicfunctions.LABsaregroupedintorowsandcolumnsacrossthedevice.ThesmallestunitoflogicintheCycloneIIarchitecture,theLE,iscompactandprovidesadvancedfeatureswithefficientlogicutilization.EachLEfeatures:•Afour-inputlook-uptable(LUT),whichisafunctiongeneratorthatcanimplementanyfunctionoffourvariables,•aprogrammableregister,•acarrychainconnection,•aregisterchainconnection•andabilitytodrivealltypesofinterconnects.EachLEoperateseitherinnormalorinarithmeticmode(eachoneusingLEresourcesdifferently).ThearchitectureofLEisshowninFigure3-2.Figure3-2.CycloneIIlogicelementstructureTheCycloneIIembeddedmemoryconsistsofcolumnsofM4Kmemoryblocks.TheM4Kmemoryblocksincludeinputregistersthatsynchronizewritesandoutputregisterstopipelinedesignsandimprovesystemperformance.EachM4Kblockcanimplementvarioustypesofmemorywithorwithoutparity,includingtruedual-port,simpledual-port,andsingle-port 3.ReconfigurableHardwareTechnologies49RAM,ROM,andfirst-infirst-out(FIFO)buffers.EachM4Kblockhasasizeof4,608RAMbits.CycloneIIdeviceshaveupto150embeddedmultiplierblocksoptimizedformultiplier-intensivedigitalsignalprocessing(DSP)functions.Designerscanusetheembeddedmultipliereitherasone18-bitmultiplierorastwoindependent9-bitmultipliers.Embeddedmultiplierscanoperateatupto250MHz(forthefastestspeedgrade)for18×18and9×9multiplicationswhenusingbothinputandoutputregisters.EachCycloneIIdevicehasonetothreecolumnsofembeddedmultipliersthatefficientlyimplementmultiplicationfunctions.CycloneIIdevicessupportdifferentialandsingle-endedI/Ostandards,includingLVDSatdataratesupto805megabitspersecond(Mbps)forthereceiverand622Mbpsforthetransmitter,and64-bit,66-MHzPCIandPCI-XforinterfacingwithprocessorsandASSPandASICdevices.CycloneIIdevicesrangeindensityfrom4,608to68,416LEs.CycloneIIdevicesofferbetween119to1,152Kbitsofembeddedmemorywithamaximumclockspeedof250MHz.CycloneIIdevicesprovideaglobalclocknetworkanduptofourphaselockedloops(PLLs).Theglobalclocknetworkconsistsofupto16globalclocklinesthatdrivethroughouttheentiredevice.1.2.2GranularityCycloneIIarchitectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodules.1.2.3TechnologyCycloneIIdevicesaremanufacturedon300-mmwafersusingTSMC’s90-nm,1.2-V,all-layercopperSRAM,low-kdielectricprocesstechnology,thesameprovenprocessusedwithAltera’sStratixIIdevices.1.2.4ReconfigurationCycloneIIFPGAsarestaticallyreconfigurable.CycloneIIdevicesareconfiguredatsystempower-upwithdatastoredinanAlteraconfigurationdeviceorprovidedbyasystemcontroller.Serialconfigurationallowsconfigurationtimesof100ms.AfteraCycloneIIdevicehasbeenconfigured,itcanbereconfiguredin-circuitbyresettingthedeviceandloadingnewconfigurationdata. 500Chapter31.2.5OtherissuesTheCycloneIIFPGAfamilyisfullysupportedbyAltera’srecentlyintroducedNiosIIfamilyofsoftprocessors.ANiosIIdesigninaCycloneIIFPGAoffersmorethan100DMIPsperformance.WithaNiosIIprocessor,adesignercanbuildacompletesystemonaprogrammablechip(SOPC)onanyCycloneIIdevice,providingnewalternativestolow-andmid-densityASICs.1.2.6DesignflowAllCycloneIIdevicesaresupportedbytheno-costQuartusIIWebEditionsoftware.QuartusIIsoftwareprovidesacomprehensivesuiteofsynthesis,optimizationandverificationtoolsinasingle,unifieddesignenvironment.Designerscanselectfromalargeportfolioofintellectualproperty(IP)coresanddownloadAltera'suniqueOpenCorePlusversionofthechosencore(s).TheQuartusIIsoftwareisusedtointegrateandevaluatethecoresinCycloneIIdevices.QuartusIIincludesintegrateddevelopmentenvironmentforNiosIIembeddedprocessors.1.2.7ApplicationareaCycloneIIFPGAsareidealforcostsensitiveapplications.1.3XilinxVirtex4TheVirtex-4family[12]isthenewestgenerationFPGAfromXilinx.Virtex-4FPGAsincludethreefamilies(platforms):LX,FXandSX.Choiceandfeaturecombinationsareofferedforallcomplexapplications.ThebasicVirtex-4buildingblocksareanenhancementofthosefoundinthepopularVirtexdevicesallowingupwardcompatibilityofexistingdesigns.Combiningawidevarietyofflexiblefeatures,theVirtex-4familyenhancesprogrammablelogicdesigncapabilitiesandisapowerfulalternativetoASICtechnology.1.3.1ArchitectureTheconfigurablelogicblock(CLB)resourceofXilinxVirtex4ismadeupoffourslices.Eachsliceisequivalentandcontains:twofunctiongenerators,twostorageelements,arithmeticlogicgates,largemultiplexers,fastcarrylook-aheadchainandhorizontalcascadechain.Thefunctiongeneratorsareconfigurableas4-inputlook-uptables(LUTs).Twoslicesina 3.ReconfigurableHardwareTechnologies51CLBcanhavetheirLUTsconfiguredas16-bitshiftregisters,oras16-bitdistributedRAM.Inaddition,thetwostorageelementsareeitheredge-triggeredD-typeflip-flopsorlevelsensitivelatches.EachCLBhasinternalfastinterconnectandconnectstoaswitchmatrixtoaccessgeneralroutingresources.Thegeneralroutingmatrix(GRM)providesanarrayofroutingswitchesbetweeneachcomponent.Eachprogrammableelementistiedtoaswitchmatrix,allowingmultipleconnectionstothegeneralroutingmatrix.Theoverallprogrammableinterconnectionishierarchicalanddesignedtosupporthigh-speeddesigns.Allprogrammableelements,includingtheroutingresources,arecontrolledbyvaluesstoredinstaticmemorycells.Thesevaluesareloadedinthememorycellsduringconfigurationandcanbereloadedtochangethefunctionsoftheprogrammableelements.TheblockRAMresourcesare18Kbittruedual-portRAMblocks,programmablefrom16Kx1to512x36,invariousdepthandwidthconfigurations.Eachportistotallysynchronousandindependent,offeringthree"read-during-write"modes.BlockRAMiscascadabletoimplementlargeembeddedstorageblocks.Additionally,back-endpipelineregisters,clockcontrolcircuitry,built-inFIFOsupportandbytewriteenablearenewfeaturessupportedintheVirtex-4FPGA.TheXtremeDSPslicescontainadedicated18x18-bit2’scomplementsignedmultiplier,adderlogicanda48-bitaccumulator.Eachmultiplieroraccumulatorcanbeusedindependently.Theseblocksaredesignedtoimplementextremelyefficientandhigh-speedDSPapplications.Mostpopularandleading-edgeI/Ostandards(bothsingleendedanddifferential)aresupportedbyprogrammableI/Oblocks(IOBs).Inlargerdevices10-bit,200kSPSanalog-to-digitalconverterisincludedinthebuilt-insystemmonitorblock.Additionally,FXdevicessupportintegratedhardwiredhigh-speedserialtransceiversthatenabledataratesupto11.1Gb/sperchanneland10/100/1000Ethernetmedia-accesscontrol(EMAC)cores.Virtex4FXdevicessupportoneortwohardwiredIBMPowerPC405RISCCPUs(upto450MHz)withtheauxiliaryprocessorunitinterface,whichallowsoptimizedFPGAbasedcoprocessorconnection.PowerPC405CPUisbasedona32-bitHarvardarchitecturewithafive-stageexecutionpipelinesupportingaCoreConnectbusarchitecture.InstructionanddataL1cachesof16KBeachareintegrated.Virtex4devicesachieveclockratesof500MHz.Virtex4deviceshavelogicdensitiesofupto200000logiccells.Memorydensitiesofupto9935kbitsforblockRAMandupto1392kbitsdistributedRAMaresupported.DSPslicesofupto512maybeincludedleadingtoa256GMACsDSPperformance. 522Chapter31.3.2GranularityVirtex4architectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodulesandcompletePowerPCCPUs.1.3.3TechnologyVirtex-4devicesareproducedonastate-of-the-art90nmtripleoxide(forlowpowerconsumption)copperprocess,using300mm(12inch)wafertechnology.Thecorevoltageofthedevicesis1.2V.1.3.4ReconfigurationVirtex4FPGAsaredynamically(partially)reconfigurabledevices.1.3.5OtherissuesOptional256-bitAESdecryptionissupportedon-chip(withsoftwarebitstreamencryption)providingIntellectualPropertysecurity.1.3.6DesignflowXilinxISEdevelopmentsystemisusedtomapapplicationsonthelogicpartofVirtex4devices.Advancedverificationandreal-timedebuggingisofferedbyChipScopeProtools.Morethan200pre-verifiedIPcoresareavailableforVirtex4devices.TheEDKPowerPCdevelopmentkitisusedfortherealizationoffunctionalityonPowerPCCPUs.1.3.7ApplicationareaVirtex-4LXFPGAsaresuitableforhigh-performancelogicapplications.Virtex-4FXdevicesarewellsuitedforhigh-performance,full-featuredsolutionforembeddedplatformapplications.Virtex-4SXdevicesareagoodsolutionforhigh-performanceDigitalSignalProcessing(DSP)applications.1.4XilinxSpartan-3TheSpartan-3familyofField-ProgrammableGateArrays[10]isspecificallydesignedtomeettheneedsofhighvolume,cost-sensitiveconsumerelectronicapplications.TheSpartan-3familybuildsonthesuccess 3.ReconfigurableHardwareTechnologies53oftheearlierSpartan-IIEfamilybyincreasingtheamountofresources,theuseofthestate-of-the-artVirtex-IItechnologyandtheadvancedprocesstechnology.1.4.1ArchitectureEachConfigurableLogicBlock(CLB)comprisesfourinterconnectedslices,asshowninFigure3-3.Theseslicesaregroupedinpairs.Eachpairisorganizedasacolumnwithanindependentcarrychain.Allfoursliceshavethefollowingelementsincommon:twologicfunctiongenerators,twostorageelements,wide-functionmultiplexers,carrylogic,andarithmeticgates.Boththeleft-handandright-handslicepairsusetheseelementstoprovidelogic,arithmetic,andROMfunctions.Besidesthese,theleft-handpairsupportstwoadditionalfunctions:storingdatausingDistributedRAMandshiftingdatawith16-bitregisters.TheRAM-basedfunctiongenerator(Look-UpTable)isthemainresourceforimplementinglogicfunctions.Figure3-3.Spartan-3CLBstructureSpartan-3devicessupportblockRAM,whichisorganizedasconfigurable,synchronous18Kbitblocks.BlockRAMstoresefficientlyrelativelylargeamountsofdata.Theaspectratioi.e.,widthvs.depthofeachblockRAMisconfigurable.Furthermore,multipleblockscanbecascadedtocreatestillwiderand/ordeepermemories.TheblocksofRAMareequallydistributedin1to4columns. 54Chapter3Therearefourkindsofinterconnect:Longlines,Hexlines,Doublelines,andDirectlines.LonglinesconnecttooneoutofeverysixCLBs;hexlinesconnectoneoutofeverythreeCLBs;doublelinesconnecttoeveryotherCLB.DirectlinesaffordanyCLBdirectaccesstoneighboringCLBs.Spartan-3devicesprovideembeddedmultipliersthataccepttwo18-bitwordsasinputstoproducea36-bitproduct.Theinputbusestothemultiplieracceptdataintwo’s-complementform(either18-bitsignedor17-bitunsigned).OnesuchmultiplierismatchedtoeachblockRAMonthedie.Theclosephysicalproximityofthetwoensuresefficientdatahandling.Cascadingmultiplierspermitsmultiplicandsmorethanthreeinnumberaswellaswiderthan18-bits.Twomultiplierversionsarepossible:oneasynchronousandonewithregisteredoutput.Spartan-3deviceshavelogicdensitiesofupto74880logiccells(correspondingto5millionsystemgates).Asystemclockrateofupto326MHzissupported.Memorydensitiesrangefrom72to1872kbitsofblockRAMand12to520kbitsofdistributedRAM.Thenumberofhardwiredmultiplierscanbeupto104.Spartandevicesincludeupto784I/Opinswith622Mb/sdatatransferrateperI/O.Seventeensingle-endedsignalstandardsandsevendifferentialsignalstandardsincludingLVDSaresupported.1.4.2GranularitySpartan-3architectureisafinegrainarchitecturewithembeddedhardwiredwordlevelmodules.1.4.3TechnologySpartan-3FPGAsaremanufacturedona90nmprocesstechnology.Threepowerrailsareincludedinthedevices:forcore(1.2V),I/Os(1.2Vto3.3V)andauxiliarypurposes(2.5V).1.4.4ReconfigurationSpartan-3FPGAsaredynamically(partially)reconfigurabledevices.1.4.5OtherissuesSpartan-3devicesallowintegrationofMicroBlazesoftprocessor,PCI,andothercores. 3.ReconfigurableHardwareTechnologies551.4.6DesignflowImplementationofapplicationsonSpartan-3devicesisfullysupportedbyXilinxISEdevelopmentsystem,whichincludestoolsforsynthesis,mapping,placementandrouting.TheEDKMicroblazedevelopmentkitisusedfortherealizationoffunctionalityonMicroblazecores.1.4.7ApplicationareaBecauseoftheirlowcost,Spartan-3FPGAsareideallysuitedtoawiderangeofconsumerelectronicsapplications,includingbroadbandaccess,homenetworking,display/projectionanddigitaltelevisionequipment.2.INTEGRATEDCIRCUITDEVICESWITHEMBEDDEDRECONFIGURABLERESOURCESIntegratedcircuitswithembeddedreconfigurableresourcesrepresentanalternativetoFPGAICs.ThesearchitecturesareinprinciplebasedonacombinationofaprogrammableCPUandareconfigurablearrayofwordlevel(coarsegrain)datapathunits.SuchdevicesmainlytargetDSPapplicationsandarecompetitorsofconventionalDSPinstructionsetprocessorsaswell.ThetechnologyislessmaturethanFPGAs,howeveritpromisesimportantadvantagesoverFPGAssuchaspowerandsiliconareaefficiency.Themajorissueistheefficientcompilationonthecoarsegrainreconfigurableresources.2.1ATMELFieldProgrammableSystemLevelIntegratedCircuits(FPSLICs)TheAtmel’sAT94SeriesofFieldProgrammableSystem-LevelIntegratedCircuits(FPSLICs)[2]arecombinationsoftheAtmelAT40KSRAMFPGAsandtheAtmelAVR8-bitRISCmicrocontrollerwithstandardperipherals.2.1.1ArchitectureThearchitectureofAT94KfamilyisshowninFigure3-4.TheembeddedAVRcoreisbasedonanenhanced,Ccodeoptimized,RISCarchitecturethatcombinesarichinstructionset(morethan120instructions)with32general-purposeworkingregisters.All32registersaredirectlyconnectedto 566Chapter3theALU,allowingtwoindependentregisterstobeaccessedinonesingleinstructionexecutedinonecycle.AVRincludesthefullcomplementofperipheralssuchasSPI,UART,timer/countersandahardwaremultiplier.SRAMdeliversone-cycleoperationatupto40MHz,whichtranslatesintoabout30MIPSfortheAVRspipelineRISCdesign.Forflexibility,the36KBofdynamicallyallocatedAVRSRAMcanbepartitionedbetweenx16programstoreandx8dataRAM.Forexample,onesetupmightdedicate20and16KBforprogramanddatarespectively,another32and4KB.Figure3-4.AtmelFPSLICAT94KArchitectureTheAVRcoreandFPGAconnectionisbasedonasimpleapproachthattreatstheFPGAmuchlikeanotheronboard8-bitperipheral.Thereisanaddressdecoderforgeneratingupto16pseudochipselectsintotheFPGAand,goingtheotherway,16interruptlinesthatarefedfromtheFPGAintotheAVR.TheMCUhasaccesstotheFPGA’seightglobalclocksandcandrivetwoofthemrelyingonitsowncombinationofinternalandexternaloscillators,clockdividers,timer/countersandsoon.TheFPGAcoreisbasedonahigh-performanceDSPoptimizedcell.FPSLICdevicesinclude5,000to40,000gatesofSRAM-basedAT40KFPGAand2-18.4Kbitsofdistributedsingle/dualportFPGAuserSRAM. 3.ReconfigurableHardwareTechnologies572.1.2GranularityThearchitectureofAT94devicesrepresentsfine-grainedarchitectureasfarasprogrammablelogicisconcerned.2.1.3TechnologyFPSLICdevicesarefabricatedonhigh-performance,low-power,3.0V–3.6V,0.35µCMOSfive-layermetalprocess.2.1.4ReconfigurationTheAT40KSRAMFPGAfamilyiscapableofimplementingCacheLogic(Dynamicfull/partiallogicreconfiguration,withoutlossofdata,on-the-fly)forbuildingadaptivelogicandsystems.Asnewlogicfunctionsarerequired,theycanbeloadedintothelogiccachewithoutlosingthedataalreadythereordisruptingtheoperationoftherestofthechip,replacingorcomplementingtheactivelogic.Figure3-5.SystemDesignerdesignflow2.1.5DesignflowAtmelprovidesSystemDesignertoolsuite(seeFigure3-5)thatcoordinatesmicrocontrollerandFPGAdevelopmentwithsource-leveldebug 588Chapter3andfullhardwarevisibility.Forimplementation,thepackageincludesplace-and-route,floorplanning,macrogeneratorsandbitstreamutilities.2.1.6ApplicationareaAtmel'sAT94KseriesFPSLICdeviceprovidesthelogic,processing,control,memoryandI/Ofunctionsrequiredforlow-power,high-performanceapplicationsincludingamongothers:PDAandcellphoneafter-marketproducts,GPS,portabletestequipment,point-of-saleandsecurityorwirelessInternetappliances.2.2QuickSilverADAPT2000AdaptiveComputingMachineSystemICPlatformQuickSilverTechnologyAdapt2000systemplatform[1],basedonadaptivecomputingtechnology,attemptstointegratethesiliconcapabilityofASIC,DSP,FPGAandmicroprocessortechnologieswithinasingleIC,anAdaptiveComputingMachine(ACM).Adapt2000platformaimsatachievingcustom-siliconcapabilitydesignedinsoftware–inweeksormonthsinsteadofyears–withfastertimetomarket,reduceddevelopmentcostsandtheabilityfordesignerstofocusoninnovatinganddevelopingIP.TheAdapt2000ACMsystemplatformcomprisestheAdapt2400ACMarchitecture,theInSpireNodeControlKernelandtheInSpireSDKtoolset.2.2.1ArchitectureAdapt2400architectureconsistsoftwomajortypesofcomponents:NodesandMatrixInterconnectNetwork(MIN).AgenericviewofAdapt2400architectureisshowninFigure3-6.NodesarethecomputingresourcesintheACMarchitecturethatperformtheprocessingtasks.Nodesareheterogeneousbydesign,eachbeingoptimizedforagivenclassofproblems.Eachnodeisself-containedwithitsowncontroller,memory,andcomputationalresources.Assuch,anodeiscapableofindependentlyexecutingalgorithmsthataredownloadedintheformofbinaryfiles.Nodesareconstructedofthreebasiccomponents:TheNodeWrapper,NodalMemoryandtheAlgorithmicEngine.TheNodeWrapperhastwomajorfunctions:a)toprovideacommoninterfacetotheMINfortheheterogeneousAlgorithmicEnginesandb)tomakeavailableacommonsetofservicesassociatedwithinter-nodecommunicationandtaskmanagement.Eachnodeisnominallyequippedwith16kilobytesofnodalmemoryorganizedasfour1kx32bitblocks.WhenbuildinganACM, 3.ReconfigurableHardwareTechnologies59memoriescanbeadjustedinsize,largerorsmaller,tooptimizecostorincreasetheflexibilityofaspecificnode.EachheterogeneousnodetypeisdistinguishedbyitsAlgorithmicEngine.Thecomputationalresourcesofeachnodetypearecloselymatchedandoptimizedtosatisfyafiniterangeofalgorithms.Figure3-6.GenericviewofAdapt2400architectureTherearethreeclassesofnodesinadaptivecomputing:•Adaptivenodessupporttheheavyalgorithmicelementsthatrequirecomplexcontrol.Theyhaveahighdegreeofprogrammabilityandcomputationalunitadaptability.•Domainnodesaredesignedforthereallycomplexpiecesofthealgorithms.DomainNodesperformatspeedscomparabletopureASICdesigns.Theircontrolmechanismsarefinitestatemachines.•Programmablenodesaredesignedtosupportlargecodebasesthatdonotdemandmuchprocessingpower.DesignersarealsoabletobuildtheirownfullycustomizedAlgorithmicEnginesandmemorystructures,andplacetheminsidetheNodeWrapper.TheMatrixInterconnectNetwork(MIN)tiestheheterogeneousnodestogether,andcarriesdata,configurationbinaryfiles,andcontrolinformationbetweenACMnodes,aswellasbetweennodesandtheoutsideworld.Thisnetworkishierarchicalinstructure,providinghighbandwidthbetweenadjacentnodesforclosecouplingofrelatedalgorithms,whilefacilitating 600Chapter3easyscalingoftheACMatlowsiliconoverhead.EachconnectionbetweenblockswithintheMINstructuresimultaneouslysupports32bitsofdatapayloadineachdirection.DatawithintheMINistransportedinsingle32-bitwordpackets,withaddressingcarriedseparately.Each32-bittransferwithintheMINcanberoutedtoanyothernodeorexternalinterface,withtheMINbandwidthfullysharedbetweenallthenodesinthesystem.AnAdapt2400ACMhasabuilt-inSystemControllerconnectedtotheMINRoot.TheSystemControllerisresponsibleforthemanagementoftaskswithinanACM.Inthisrole,theSystemControllersetsuptheindividualNodeHardwareTaskManagers(HTMs),andoncesetup,theHTMsaregivencontrolofthetasksonthenodewithouttheneedforinterventionbytheSystemControllertoperformataskswap.2.2.2GranularityAdapt2400architectureisa(tasklevel)coarsegrainarchitecture.2.2.3TechnologyADAPT2000platforminstanceshavebeenrealizedon0.13µmtechnologies.2.2.4ReconfigurationAdapt2400ACMarchitecturedynamicallyreconfiguresduringoperation.ACMnodesareconfigured/programmedusingabinaryfile(SilverWare),whichismuchsmallerthanthatofatypicalFPGAconfigurationfile,andiscomparabletotheprogramsizeofaDSPorRISCprocessor.Thesmallerbinaryfilesize,combinedwithhardwarespecificallydesignedtoadaptonthefly,allowsthefunctionofanodetochangeinaslittleasafewclockcycles.2.2.5DesignflowTheInspireSDKToolSetbyQuickSilverisacompletedevelopmentsystemfortheAdapt2400ACMArchitecturethatprovidesaunifieddesignenvironmentthatenablesrealizationofanACMwithinasingleIC.TheInspireSDKcomprisestheSilverCdevelopmentlanguage(ANSI-Cderivative),modulelinker,assemblerforeachnodetypeandtheInSpireSimulationPlatform,includingtheACMVerificationSwitchBoard.Thelatter,providesmulti-modeverificationofACMdesignsusinganycombinationoftheCVirtualNode(CVN),InspireSimulationPlatform, 3.ReconfigurableHardwareTechnologies61InSpireEmulator,andanactualACMdevice.TheInspireSDKiscompletelysoftware-basedandsupportsallphasesofdevelopment,fromhigh-levelsystemsimulationtocompiledbinariesrunningonanemulatorortargetIC.ItsAdapt2400SilverStreamDesignFlowenablesdeveloperstofreelyexpresssystemfunctionalitywithouttheneedtoconsiderhardwarepartitioning,taskthreading,ormemoryallocation.TheInSpireSDKalsoenablesengineerstocreatecustomAdapt2400architecturecoresinsimulationandassemblenewnodalcombinationsforexploringawidevarietyofACMhardwareconfigurations.Figure3-7.ACMdesignflowThedevelopmentflowfortheAdapt2400ACMArchitectureisbasedontheuseofadataflowmodelofthesystemunderdevelopment.Inthismethodologythesystemisrepresentedinaseriesoftop-downdataflowmodelsthatusesuccessiverefinementtechniquestobuilduptoafinalhardwareimplementation.TheACMSilverStreamDesignFlowsupportsthetask-based“executewhenready”asynchronousnatureoftheAdapt2400ACMArchitecturewithoutrequiringexperthardwareknowledgeonthepartofthedeveloper.ThedesignflowconsistsofuptosixstepsasshowninFigure3-7: 62Chapter3•Thefirststepconsistsof:(a)modelingthedataflowofthesystemunderdevelopmentbyusingSilverCtodefinetasks,andpipesbetweenthetasks,(b)assigningacyclebudgettoeachtaskand(c)simulatingthedatathroughputofthesystem.•ThesecondstepistodefinethefunctionofeachtaskusingANSI-C,andthenverifyingthebehavioralintegrityofthesystemusingCVirtualNodes(CVN).•Thethirdstepisnodetypeandnodeinstanceassignment.•Thefourthstepishardwareoptimizationwithnodeverificationusingthenode-typecompilersorassemblers,andtheappropriatenodesimulators.StepfourprovidesanI/Oaccuratemodelofthesystemoperation.EachnodecanbesimulatedusingtheACMVerificationSwitchBoard.ThismoduleintheInSpireSimulationPlatformallowsdeveloperstomodelthehardwaresystemasCVNsontheInSpireAdapt2400PlatformEmulator,InSpireDevelopmentBoard,oratargetdevice.Anyofthesemodelscanbeusedincombinationorindividuallyatanytime.•Thefifthstepisrun-timeoptimization,whichconsistsofassignmentofmultipletaskstonodes.TheInSpireSimulationPlatformandPerformanceAnalyzerareusedtodeterminewhichtaskscanbeassignedtothesamenodewithoutaffectingsystemoperation.Inthisstep,performanceandhardware-sizetrade-offscaneasilybemadeandanalyzedtoprovidethebestfitforsystemrequirements.•ThesixthstepisfinalsystemsimulationandverificationusingtheInSpireSimulationPlatformtoensureoverallsystemcompliancewithdesignspecifications.ThefinalsystemmodelscontainSystemCAPIsforinclusionintoESLmodelingenvironments.2.2.6ApplicationareaQuickSilverclaimsthatACM-enableddevicesprovidehighperformance,smallsiliconarea,lowpowerconsumption,lowcostandarchitectureflexibilityandscalability–theidealattributesforhandheld,mobileandwirelessproductsthatspanmultiplegenerations.Theyparticularlytargetsignalandimageprocessingapplications.2.3IPflexDAPDNA-2processorTheDAPDNADynamicallyReconfigurableProcessor[4]developedbyIPFlexInc.aimsatproviding“hardwareperformance”whilemaintaining“softwareflexibility. 3.ReconfigurableHardwareTechnologies632.3.1ArchitectureTheDAPDNA-2dynamicallyreconfigurableprocessorisadual-coreprocessor,comprisedofIPFlex'sownDAPhigh-performanceRISCcore,pairedwiththeDNAtwo-dimensionalprocessingmatrix.TheDAPDNA-2processorcanoperateat166MHz.TheDAPRISCcore(32bitwith8kbytesdatacacheand8kbytesinstructioncache)controlstheprocessor'sdynamicreconfiguration,whileportionsofanapplicationthatrequirehigh-speedprocessingarehandledbytheDNAmatrix,whichprovidesbothparallelandpipelinedprocessing.TheDNAmatrixisanarrayof376ProcessingElements(PE)comprisedofcomputationunits,memory,synchronizers,andcounters.ThetotalRAMoftheDNAarrayis576kbytes.TheDNAmatrixcircuitrycanbereconfiguredfreelyintothestructurethatismostoptimalformeetingtheneedsoftheapplicationindemand.Oneforegroundandthreebackgroundbanksareavailableon-chiptostoredifferentconfigurations.Additionalbankscanbeloadedfromexternalmemoryondemand.ThearchitectureofDAPDNA-2processorisshowninFigure3-8.Figure3-8.DAPDNA-2processorarchitectureLargeon-chipmemoryreducestheneedtoaccessoff-chipmemoryaprocessthatoftenbecomesaperformancebottleneck.ThisfeatureallowstheDNAtoprovidethemaximumpossibleparallelprocessingperformance.Sincethememoryisdistributedthroughouttheprocessingarray,thereisplentyofavailablememorybandwidth. 644Chapter3TheDAPDNA-2hassixchannelsofDNADirectI/O,whichprovidestheinterfacefortransferingdatadirectlyontooroutoftheDNAmatrix.EachchannelofDNADirectI/Ois32-bitwideandoperatesatthemaximumDAPDNA-2systemclockfrequencyof166MHz.TheDNADirectI/Ocanbealsousedtocommunicatedirectlywithexternaldevices,bringingdatainforprocessingontheDNAmatrix,bypassingtheBusSwitchandmemoryinterface.2.3.2GranularityTheDNAmatrixarchitectureisacoarsegrainreconfigurablearchitecture.2.3.3TechnologyTheDAPDNA-2processorcomesina156-pinFCBGApackage.Thepowersupplyforthecoreis1.2VwhilefortheI/Osis2.5V.2.3.4ReconfigurationDAPDNAprocessorisdynamicallyreconfigurableandcanchangeitshardwareconfigurationinoneclockcycleaccordingtotheapplicationondemand.2.3.5DesignflowTheintegrateddevelopmentenvironmentfortheDAPDNAdynamicallyreconfigurableprocessorisdesignedaroundtheconceptof“SoftwaretoSilicon”.TheSoftwaretoSiliconconceptmeansthatevensomeonewhodoesn'tknowhowtodesignhardwarecandevelopaproductbydesigninganapplicationusingahigh-levellanguage,andhavingthatapplicationseamlesslyimplementedasahardware.TheDAPDNAprocessorseriesisprovidedwiththeDAPDNA-FWIIIntegratedDevelopmentEnvironment,afull-featuredtoolsetcoveringeverythingfromalgorithmdesigntovalidationofanapplicationrunningontheactualhardware.DAPDNA-FWIIprovidescompilersforalgorithmswritteninMATLAB/SimulinkandCwithdataflowextension.DAPDNA-FWIIenvironmentsupportsthreedifferentdesignmethodologies,givingthedesignertheflexibilitytochoosethemostappropriatedesignmethod.ThefirstoptionistousetheDataFlowC(DFC)Compiler.InthiscaseitispossibletousetheCprogramminglanguagetodirectlycreatecodeforthedynamicallyreconfigurableprocessor.Ina 3.ReconfigurableHardwareTechnologies65developmentprocessbuiltaroundtheDFCcompiler,thedesignercancreatecodedirectlyusingtheCprogramminglanguage,whichreducesthedevelopmenttime.ThesecondoptionistousetheDNABlockset,whichallowsalgorithmdesignandverificationusingMATLAB,Simulink(fromTheMathWorksInc).DNABlocksetenablesaseamlessdesignflowfromalgorithmdesigntoimplementationintheDAPDNA-2processor,allwithintheMATLAB/Simulinkenvironment.ThethirdoptionistheDNAdesignerwhichisaGUI-baseddevelopmentenvironmentallowingthedesignertodrag-and-droprepresentationsoftheDAPDNAProcessingElements(PEs),supportinggraphicalconstructionofprocessingalgorithms.2.3.6ApplicationareaIPflexclaimsthattheDAPDNA-2istheworld'sfirstgeneral-purposedynamicallyreconfigurableprocessor.Itissuitableforapplicationsthatdemand,highperformanceandsupportforawiderangeofprocessingtasks.Italsoprovidesasolutionthatisoptimalfortoday'smarketplace,withitsdemandforshort-run,mixed-modelproduction.Targetapplicationsincludeindustrialperformanceimageprocessing(forfactoryautomation,inspectionsystems),broadcastandmedicalequipment,highprecisionhighspeedimageprocessing(multi-functionperipherals,laserprintersetc),basestations(cellular,PHS,etc),acceleratorsforimageprocessing,dataprocessingandtechnicalcomputation,securityequipment,encryptionacceleratorsandsoftwaredefinedradio.2.4MotorolaMRC6011ReconfigurablefabricdeviceTheMRC6011deviceisthefirstreconfigurablecomputefabric(RCF)devicefromFreescaleSemiconductor[7].Itisahighlyintegratedsystemonachip(SoC)thatcombinessixreconfigurablecomputefabric(RCF)coresintoahomogeneouscomputenode.TheprogrammableMRC6011deviceaimsatofferingsystem-levelflexibilityandscalabilitysimilartoaprogrammableDSPwhileachievingthecost,powerconsumptionandprocessingcapabilityofatraditionalASIC-basedapproach.2.4.1ArchitectureTheMRC6011RCFcoresareaccessibleintwoscalablemodules,eachcontainingthreeRCFcores,viatwomultiplexeddatainput(MDI)interfacesandtwoslaveI/OInterfaces.EachMDIinterfacecancommunicatewithupto12channels(antennasforexample),andeachRCcontrollercanmanipulatethedatafromtwochannels.ThedataprocessedbytheRCF 666Chapter3coresgoeseithertooneofthetwoslaveI/Obusinterfaces(compatiblewithindustry-wideDSPs)ortoanothercorewithinthesamemoduleortheadjacentmodule.ExternalinterfacesincludetheMDIinterfacesandslaveI/Obusinterfaces(supportingDSPbootstrapping)operatingatupto100MHz,andaJTAGportforreal-timedebugging.ThearchitectureoftheMRC6011deviceisshowninFigure3-9.Figure3-9.ArchitectureofMRC6011deviceEachRCFcoreincludesanoptimized32-bitRISCprocessor(allowingefficientCcodecompilation)withinstruction(4kbytes)anddatacaches(4kbytes).Thereconfigurablecomputing(RC)arrayincludes16reconfigurableprocessingunitswith16bitdatapathsincludingapipelinedMACunit.TheRCFcorealsoincludesatwo-channelinputbuffer(8kbytes),alargeframebuffer(40kbytes)witheightaddressgenerationunits(AGUs),aspecial-purposecomplexcorrelationunitsupportsspreading,complexscrambling,complexcorrelationon8-bitand4-bitsamplesandasingleandbursttransferDMAcontroller.At250MHz,thesix-coreMRC6011devicedeliversapeakperformanceof24.0Gigacomplexcorrelationspersecondwithasampleresolutionof8bitsforIandQinputseach,oreven48.0Gigacomplexcorrelationspersecondat4-bitresolution. 3.ReconfigurableHardwareTechnologies672.4.2GranularityThearchitectureoftheMRC6011isacoarsegrainarchitecturebasedonthewordlevelreconfigurabledatapathsoftheRCarrays.2.4.3TechnologyMRC6011devicesaremanufacturedona0.13µmprocesstechnology.Theinternallogicvoltageis1.2Vwhiletheinput/outputvoltageis3.3V.Thecoremaximumoperatingfrequencyis250MHzwhilethemaximumoperatingfrequencyforalloff-corebusesis100MHz.2.4.4ReconfigurationMRC6011isadynamicallyreconfigurablemulti-contextdevice.2.4.5DesignflowDesignflowforMRC6011isbasedonCandassemblyprogramming.TheCodeWarriorDevelopmentStudioforFreescaleRCFBasebandSignalProcessorsisacompletedevelopmentenvironmentforFreescaleReconfigurableComputeFabric(RCF)baseddevices.TheCodeWarriorDevelopmentStudioisacompletecodedevelopmentstudioandincludes:a)theProjectManagerthatprovidesanythingrequiredforconfiguringandmanagingcomplexprojects,b)theEditorandCodeNavigationSystemthatallowscreationandmodificatonofsourcecodeandc)thegraphicalleveldebuggers.CodeWarriorDevelopmentStudio,inconcertwiththePowerTAPProhardwaretargetinterface,providesamulti-coredebuggingenvironmentthatallowsforquicksinglesteppingaswellasfastdownloadsofverylargetargetfiles.IncaseofmultipleMRC6011products,itispossibletoconnecttheJTAGconnectionsinawayallowingtalkingtoanyoftheMRC6011'sthroughasinglePowerTAPdevice.SincePowerTAPhasEthernetasit'sconnectionmethodtoCodeWarrior,debuggingcanbedoneremotelyaswellasprovidingamechanismtoshareasingleresourceamongseveralengineers.FunctionaltestingeffortcanbeminimizedthroughutilizationofCodeWarriorDevelopmentStudio'sfullscriptingcapability.2.4.6ApplicationareaHighlyflexibleandprogrammable,theMRC6011processorprovidesanefficientsolutionforcomputationallyintensiveapplications,suchas 688Chapter3widebandcodedivisionmultipleaccess(WCDMA),CDMA2000andTD-SCDMAbasebandprocessing,includingchiprate,symbolrateandadvanced3Gfunctionssuchasadaptiveantenna(AA)andmulti-userdetection(MUD).2.5picoChipPC102picoArrayprocessor–ThePC102isthe2ndgenerationofthepicoArrayhighlyparallelprocessingarchitecturedevelopedbypicoChip[9].ThepicoChip'sPC102picoArrayprocessorisasignalprocessingdeviceoptimisedfornextgenerationwirelessinfrastructure.Thesolutioncanbedescribedasa“SoftwareSystemonChip”(SSoC):fastenoughtoreplaceFPGAsorASICsbutwiththeflexibilityandeaseofprogrammingofaprocessor.PC102picoArrayprocessoroffersscalabilityallowingextremelylargesystemstobebuiltbyconnectingdozensofprocessors.2.5.1ArchitectureThearchitectureemphasiseseaseofdesign/verificationanddeterministicperformanceforembeddedsignalprocessing–especiallywireless.ThepicoArraycombineshundredsofarrayelements,eachwithaversatile16bitRISCprocessor(3wayLIWwithHarvardarchitecture)withlocaldataandprogrammemoryconnectedbyahigh-speedinterconnectfabric.ThearchitectureisheterogeneouswithfourtypesofelementoptimisedfordifferenttaskssuchasDSPorwirelessspecificfunctions.Aswellasthestandardarrayelements,othershandlecontrolfunctions,memoryintensiveandDSP-orientedoperations.Multiplearrayelementscanbeprogrammedtogetherasagrouptoperformparticularfunctionsrangingfromfastprocessingsuchasfiltersandcorrelators,throughtothemostcomplexcontroltasks.WithinthepicoArraycore,arrayelementsareorganisedinatwodimensionalgrid,andcommunicateoveranetworkof32bitbuses(thepicoBus)andprogrammablebusswitches.ArrayelementsareconnectedtothepicoBusbyports.TheportsactasnodesonthepicoBusandprovideasimpleinterfacetothebusbasedonputandgetinstructionsintheinstructionset.Theinter-processorcommunicationprotocolisbasedonatimedivisionmultiplexing(TDM)scheme,wheredatatransfersbetweenprocessorportsoccurduringtimeslots,scheduledinsoftware,andcontrolledusingthebusswitches.Thebusswitchprogrammingandtheschedulingofdatatransfersisfixedatcompiletime.AroundthepicoArrraycorearesysteminterfaceperipheralsincludingahostinterfaceandanSRAMinterface.FourhighspeedI/Ointerfaces 3.ReconfigurableHardwareTechnologies69connecttoexternalsystemsorlinkpicoArraydevicestogethertobuildscalablesystems.ThebasicconceptofpicoArrayarchitectureisshowninFigure3-10.Figure3-10.BasicconceptofpicoArrayarchitecturePC102picoArrayhashugeprocessingresourcesforcomputeintensivedatapath.Italsohasenormousamountsofgeneral-purposeMIPStohandletheevermorecomplexcontroloperations.ThePC102uses348arrayelementsrunningat160MHz,andwithpeakusecanhandleover197,100millioninstructionspersecond(MIPS),147,800millionoperationspersecond(MOPS)or38,400millionmultiplyaccumulate(MMAC)instructionspersecondover10timestheperformanceofotherprogrammablesolutions.ThemicroprocessorinterfaceisusedtoconfigurethePC102deviceandtotransferdatatoandfromthePC102deviceusingeitheraregistertransfermethodoraDMAmechanism.Theinterfacehasanumberofportsmappedintotheexternalmicroprocessormemoryarea.TwoportsareconnectedtotheconfigurationbuswithinthePC102andtheothersareconnectedtothepicoBus.Theseenabletheexternalmicroprocessortocommunicatewiththearrayelementsusingsignals.Alternatively,thePC102canself-configure(orboot)instandalonemodefromasupportedmemory.2.5.2GranularityPC102processor’spicoArrayarchitectureisa(CPUlevel)coarsegrainreconfigurablearchitecturebasedon16bitCPUs. 70Chapter32.5.3ReconfigurationThepicoArrayarchitectureistotallyprogrammableandcanbeconfiguredatruntime(singlecontextdevice).2.5.4TechnologyPC102deviceshavebeenmanufacturedona0.13µmprocesstechnology.HighperformanceflipchipBGApackageshavebeenusedforpackaging.Thecorevoltageis1.2Vwhiletheinput/outputvoltageis2.5V.2.5.5DesignflowpicoChip'spicoToolsisafully-integratedhomogeneous(overthewholesystem)developmentenvironmentforthepicoArraywhichincludesCcompiler,assembler,debuggerandcycle-accuratesimulator,inwhichsystemperformanceisguaranteedbydesign(withcompletepredictability).picoChipalsosuppliesaLibraryofExampleDesignsandarangeofDevelopmentplatforms.Thedeveloperdefinesthestructureandrelationshipsbetweenprocesses,completelyspecifyingsignalflowsandtimings.TheindividualprocessorsarethenprogrammedinstandardCorassemblerasblockstobeembeddedwithinthestructure.Theentiredesign(structure,data-pathandcontrol)isdebuggedatthesourcelevel.Thisallowsengineerstoworkonthewholesysteminanintegratedway,ratherthanhavingtodebugdifferenttechnologiesseparately.Theprogrammingofthearrayiscompletelyautomatic,andthedesignerisabstractedfromthisimplementationdetails.Theoutputisahardwareconfigurationfilecontainingthedesignandthetiminginformationtoruninthesimulation.Thiscreatesaseamless“closedloop”flowfromthesimulatortothedevelopmentkitthroughtothesystem.ThepicoChiparchitectureisextremelyscalable,andapplicationscanberunacrossmultiplelinkeddevices.Thetoolsallowlargedesignstobesimulated,placedandverifiedaseasilyassmallones.Thearchitecturegiveshighlevelsofconfidenceinusingmultiplepre-verifiedblocksinaseriesofstaticsoftwarearchitecturesthatcanbeimplementedatdifferenttimesonthesamehardwaretogiveatrulyreconfigurablesystem.2.5.6ApplicationareaThePC102isacommunicationsprocessor,optimizedforhighcapacitywirelessdigitalsignalprocessingapplications.Thedeviceenablesalllayer1(physicallayer)signalprocessingandlayer1controltobeimplementedin 3.ReconfigurableHardwareTechnologies71software.ThedeviceisabletorunanywirelessprotocolsincludingWCDMA(FDDandTDD),cdma2000andTD-SCDMA,oremergingstandardssuchas802.16(WiMAX).2.6LeopardLogicGladiatorConfigurableLogicDeviceTheGladiatorconfigurablelogicdevice(CLD)[6]familyrepresentstheonlydigitallogicdevicethatcombinesFieldProgrammableGateArray(FPGA)technologywithhardwiredApplicationSpecificIntegratedCircuit(ASIC)logic.GladiatorCLDaimsatachievingmuchlowerNREchargesthanASICsincombinationwithdramaticallylowerunitcostthancomplexFPGAs.InitsfirststepsLeopardLogicprovidedembeddedFPGAIPcoresforASIC/SoCandfoundrysuppliersbutindustry’sinterestwithrespecttothisapproachwaslimited.ThenLeopardLogicreinventeditselfasasiliconsupplier.2.6.1ArchitectureThearchitectureofGladiatorCLDisshowninFigure3-11.ThebasicbuildingblocksofGladiatorCLDaretheHyperBloxFP(FieldProgrammable)andtheMP(MaskProgrammable)fabrics,whicharecombinedwithoptimizedmemories,Multiply-Accumulateunits(MACs)andflexiblehigh-speedI/Os.GladiatorCLDisavailableindensitiesrangingfrom1.6Mupto25Msystemgateswithupto10Mbitsofembeddedmemory.Itsupportssystemspeedsupto500MHz.GladiatorCLDincludeshighspeedMACunitsforfastarithmeticandDSP,upto16PLLcontrolledclockdomainswithfrequencysynthesisanddivisionand,upto16DLLforphaseshiftingtosupportinterfacetimingadjustment.GladiatorCLDoffersflexibleI/OoptionsandsupportsseveralgeneralpurposeI/Ostandards.GladiatorCLDalsosupportsDDR/QDR. 722Chapter3Figure3-11.ArchitectureofGladiatorCLD2.6.2GranularityThearchitectureofGladiatorCLDrepresentsafinegrainarchitecture.2.6.3TechnologyTheHyperBloxFPfabricisbasedonLeopardLogic’sproprietaryHyperRouteFPGAtechnologythatutilizestheindustrysfirstfully’hierarchical,multiplexer-based,point-to-pointinterconnect.Thistechnologyenablessuperiorspeed,utilization,predictabilityandreliabilitycomparedtolegacyFPGAarchitectures.TheHyperBloxMPfabricusesthesamelogiccorecellarchitectureasHyperBloxFPbutreplacestheSRAMconfigurationwithasingle-layervia-maskconfiguration,calledHyperVia.Thistechnologyprovidessignificantlyhigherdensity,aswellasincreasedperformanceandlowerpower.2.6.4ReconfigurationTheGladiatorCLDisstaticallyfield-upgradeablethroughembeddedSRAM-basedFPGA. 3.ReconfigurableHardwareTechnologies732.6.5DesignflowTheGladiatorCLDdesignflowisbasedonleadingindustrystandarddesigntoolsandflowscombinedwithLeopardLogicshighlyoptimized’ToolBloxbackendtools.PartitioningbetweentheHyperBloxMPandFPsectionsofthedeviceisdoneintuitively.FixedandstableblocksofthedesignaremappedintotheHyperBloxMPfabric,whilehigh-riskblocksthatarestillinfluxaremappedintotheFPfabric.DesignsarequicklyandeasilysynthesizedfromRTLintoaCLDdevice.Fulltimingclosureisachievedbasedonaccuratetimingextractionperformedbytheuser.BitstreamsfortheFPGAsectionsofthedevicearegeneratedautomaticallyandcanbedownloadedintothedeviceinstantly.Partitioningbetweenhard(MP)andsoft(FP)functionsisasnapwiththeToolBloxdesignflowandtheunifiedhardwarearchitectureallowstheallocationofdesignblocksevenpost-synthesis.Startingfrompre-processedwafers,userscanimplementsubstantialamountsofhighspeedlogicinthemask-programmable(MP)sectionofthedevice.AftersendingthegeneratedconfigurationdatatoLeopardLogic,firstsamplesaredeliveredwithinweeks.Thisprocessisreferredtoas“marketizationbecauseittransformsthegenericdeviceintoauseror”marketsegmentspecificdevice.Duetominimummaskandprocessingrequirements,theNon-RecurringEngineering(NRE)costsforthisprocessareanorderofmagnitudelowerthanforatraditionalcell-basedASIC.Themarketizeddevicescanbefurthercustomizedanddifferentiatedby“”ktidprogrammingtheHyperBloxFPfabric.LikeanyotherSRAM-basedFPGA,thisfabricallowsforanunlimitednumberofreconfigurationsbysimplydownloadinganewbistreamintothedevice,thusofferingoptimalin-fieldprogrammability.2.6.6ApplicationareaGladiatorConfigurableLogicDeviceissuitableforareasthattodayuseacombinationofApplicationSpecificStandardProduct(ASSP)/ASICwithstandaloneFPGAssuchasnetworking(edge,access,aggregation,framers,communicationscontrollers,backplaneinterfaces),storage(bridges,controllers,interfaces,gluelogic)andwireless(DSPacceleration,chiprateprocessing,smartantenna,bridges,backplanes,gluelogic).Acrossallmarkets,Gladiatorisanidealfitforthefastandcost-effectiveimplementationofflexibleformatconverters,protocolbridges,businterfacesandgluelogicfunctions. 74Chapter33.EMBEDDEDRECONFIGURABLECORESAstheSystem-on-Chip(SoC)worldbegantodevelopattheendofthe1990s,itwasrecognisedthat,tomakethedevicesmoreuseful,someformofprogrammablefabricwouldbeneeded.ASICdevelopersalsoconsideredembeddedreconfigurablelogicasonewaytobringsomeformoffieldprogrammabilitytoanotherwisededicatedproduct.TheindustryrespondedinanenthusiasticfashionandanumberofreconfigurablehardwarecoresthatcanbeembeddedinSoCs/ASICshavebeenproposedsincelate1990s.Twomajorarchitectureshavebeenmainlyconsidered:embeddedFPGAs(finegrain)andreconfigurablearraysofwordleveldatapaths(coarsegrain).Despitetheinitialenthusiasmseveraloftheseattemptsfailedcommercially(AdaptiveSilicondisappearedwhileActelstoppedtheirembeddedFPGAtechnologyactivities).Majorreasonswerethehighsiliconarea(itcouldrequirehalfthechipareatoputadecentamountofprogrammablelogiconit),andthepoweroverheadsofembeddedFPGAsandtheimmaturecompilationtechniquesforthecoarsegrainreconfigurablearrays.InOctober2004duringtheEDATechForuminSanJose,itwasprojectedthatuntilthefirstquarterof2005twoembeddedFPGAcoresforASICs/SoCswillbeputonthemarket-onebyacombinationofIBMandXilinxandtheotherbySTMicroelectronics.Themajorreasonthatcouldleadtheseattemptstocommercialsuccessistheuseof90nmtechnologies.3.1MorphoTechnologiesMS1ReconfigurableDSPcoresMorphotechnologiesreconfigurableDSP(rDSP)coresMS1-16andMS1-64[8]aimatprovidinghardwareflexibilityinimplementingmultipleapplications,minimizedlevelsofobsolescence,andlowpowerconsumptionwhileloweringhardwarecosts.Thecoresareavailableasis,ormaybecustomdesignedand/orquicklyintegratedintoanySoC,tofittheneedsofthecustomerandapplication(s).3.1.1ArchitectureTheMS1familyofrDSPsisfullyautonomousIP(soft,firmorhard)coresthatfunctionasco-processorstoahostprocessorinasystem.TheMS1rDSParchitectureconsistsofa32-bitRISCwith5pipelinestagesandbuilt-indirect-mappeddataandinstructioncache,anRCArraywith8to64ReconfigurableCells(eachhavinganALU,MACandoptionalcomplexcorrelatorunit),Contextmemorywith32to512contextplanes,aFrameBufferwithupto2048Kbytesinsize,andthreeoptionalblocksspecificto 3.ReconfigurableHardwareTechnologies753G-WCDMAbasestationapplications(namelyaSequenceGenerator),anInterleaverandanIQBuffer(16bytesto4Kbytesperantenna).Amulti-master128-bitDMAbuscontrollersupportingbursttransferswithbothsynchronousandasynchronousmemoryinterfaceisalsoincludedintheMS1architecture.ThearchitectureoftheRCarrayisshowninFigure3-12.Figure3-12.ArchitectureofReconfigurableCellsarray3.1.2GranularityThereconfigurablecellsarray(RC)ofMorphotechnologiesrDSPcoresisareconfigurablearrayofcoarsegraindatapaths.3.1.3TechnologyEvaluationdevicesareavailablein0.18µmand0.13µmprocesstechnologieswithcorevoltagesat1.8V/1.2Vand3.3VdigitalI/Ovoltage.3.1.4ReconfigurationMorphotechnologiesreconfigurableDSP(rDSP)coresaredynamicallyreconfigurableandcanadaptontheflytorealizedifferentapplications.Switchingfromoneapplicationspecificsetofinstructionstoanotherisdoneonasingleclockcycle. 76Chapter33.1.5DesignflowTheMS1rDSPcoresandassociatedevaluationdevicesareaccompaniedwithacompletetoolchainthatincludessoftwaredevelopmenttoolssuchasacompilerandtranslator,asimulatorandadebugtool.MorphoTechnologiesdevelopedanextensiontotheCProgramminglanguagecalled“MorphoC”allowingforfastandsimpleprogrammingtotheMSIrDSPcores.MorphoCisdesignedtodescribetheSingleInstructionMultipleData(SIMD)executionmodeloftheMS1rDSParchitecture.MorphoTransreadstheMorphoCprogramandkernellibrarymappinginformationandgeneratesastandardCprogramthatisrecognizablebythecompiler(gcc).TheoutputofMorphoTransiscompiledandlinkedwiththekernellibraryobjectfilestogenerateanexecutablefile.TheoutcomeofthisprocessmaybeexecutedintheMorphoSimsoftwaresimulatoranddebuggedbythedebugger(gdb).Inaddition,thesameexecutablecodecanalsoberunontheMS1developmentboard.MorphoSimprovidesanenvironmentforbehavioralsimulationoftheMS1rDSPcores.Tomakethelatestwired,wirelessandimagingstandardsintoproductionapplicationreality,thedebuggerisusedinconjunctionwithMorphoSimtodebugapplicationprogramsthatutilizevariouskernelssuppliedbytheMorphoTechnologiesextensivelistorfromcustomerspecifickernellibraries.3.1.6ApplicationareaMorphotechnologiesreconfigurableDSPcoresarecapableofimplementingthebasebandprocessingofairinterfacessuchasWCDMAinadditiontosourceprocessingsuchasMPEG4andvocoders.IngeneralMorphotechnologiesreconfigurableDSPcoresaresuitableforsignalprocessingbasedproductsincludingcommunicationsequipmentforwirelessandwirelineterminalsandinfrastructure,homeentertainmentandcomputergraphics/imageprocessing.3.2PACTXPPIPcoresAPACTXPPprocessororcoprocessor[13]canbeintegratedinaSystem-on-Chip(SoC)andcanbedesignedfromasmallsetofmacroblocksofwhichthelargestisintherangeof90kgates.ThehomogeneousarchitectureofXPPallowssynthesizingeachoftheblocksseparatelyand,inthesecondstep,arrangingthesynthesizedblockshierarchicallytothefinalarray. 3.ReconfigurableHardwareTechnologies773.2.1ArchitectureAnarrayofconfigurableprocessingelementsistheheartoftheXPP.Thearrayisbuiltfromaverysmallnumberofdifferentprocessingelements(PEs).ALU-PEsperformthebasiccomputations.RAM-PEsareusedforstorageofdata.TheI/OelementsconnecttheinternalelementstoexternalRAMsordataports.Theconfigurationmanagerloadsprogramsontothearray.ThearchitectureofthearrayisshowninFigure3-13.TheALUisatwoinputtwooutputALUprovidingtypicalDSPfunctionssuchasmultiplication,addition,comparison,sort,shiftandboolean.Alloperationsareperformedwithinoneclockcycle.TheALUcanbeutilizedforaddition,barrelshiftandnormalizationtasks.TheForwardRegisterisaspecializedALUthatprovidesdatastreamcontrolsuchasmultiplexingandswapping.Itintroducesalwaysonecyclepipelinedelay.TheCommunicationNetworkallowspointtopointandpointtomultipointconnectionsfromoutputstoinputsofotherelements.Upto8datachannelsareavailableforeachhorizontaldirection.Switchesattheendofthelinescanconnectthechanneltothechanneloftheneighboringelement.Figure3-13.ArchitectureofXPP’sarrayofconfigurableprocessingelementsTheRAMElementsarearrangedattheedgesofthearrayandarenearlyidenticaltotheALUPEs,howevertheALUisreplacedbyamemory.ThedualportedRAMhastwoseparateportsforindependentreadandwriteoperations.TheRAMcanbeconfiguredtoFIFOmode(noaddressinputsneeded)orRAMwith9ormoreaddressinputs.TheIPmodelallowstodefinethestoragecapacity.Typicalvaluesrangefrom512to2kwords. 788Chapter3BackRegisterandForwardRegistercanbeconfiguredtobuildalinearaddressgenerator.TherebyDMAtoorfromRAMcanbedonewithinoneRAM-PE.SeveralRAM-PEscanbecombinedtoalargerRAMwithacontiguousaddressspace.I/OElementsareconnectedtohorizontalchannels.ThestandardI/O-Elementprovidestwomodes:•StreaminggTwoportsperI/OElementsareconfiguredtoinputoroutputmode.TheXPPPackethandlingisperformedbyaReady-Acknowledgehandshakeprotocol.Thusexternaldatastreams(e.g.fromaA/D-converter)mustnotbesynchronoustotheXPPclock.•RAMMOneoutputprovidestheaddressestotheexternalRAM,theotheristhebi-directionaldataport.ExternalSynchronousStaticRAMsaredirectlyconnectedtotheaddressports,dataportsandcontrolsignals.ThemaximumsizeofexternalRAMsdependsonthedatabuswidthoftheXPP(e.g.16Mwordsforthe24-bitarchitecture).TheConfigurationManager(CM)microcontrollerhandlesallconfigurationtasksofthearray.InitiallyitreadsconfigurationsthroughanexternalinterfacedirectlyfromS-RAMsintoitsinternalcache.Thenitloadstheconfiguration(i.e.opcodes,routingchannelsandconstants)tothearray.AssoonasaPEisconfigured,itstartsitsoperationifdataisavailable.Furtheron,theCMloadssubsequentconfigurationstothearray.Thelocaloperatingsystemensures,thatthesequentialorderofconfigurationismaintainedwithoutdeadlocks.ThestructureofXPParrayofconfigurableelementsisverysimplemakingthearrayhomogeneousandsimplifyingprogrammingandplacingofalgorithms.TheIPmodelofXPPallowsdefiningthesizeandarrangementoftheprocessingelementsaccordingtotheneedsoftheapplications.Inaddition,thewidthoftheDataPathsandALUscanbedefinedbetween8and32bit.XPPisdesignedtosimplifytheprogrammingtaskandtoallowhighlevelcompilerstotapthefullparallelpotentialoftheXPP.ThemostimportantXPPfeaturetosupportthis,isthepackethandling.Datapacketscontainoneprocessorword(e.g.24-bit)andarecreatedattheoutputsofobjectsassoonasdataisavailable.Fromthere,theypropagatetotheconnectedinputs.Ifmorethenoneinputisconnectedtotheoutput,thepacketisduplicated.Ontheotherhand,anXPPobjectstartsitscalculationonlywhenallrequiredinputpacketsareavailable.Ifapacketcannotbeprocessed,thepipelinestallsuntilthepacketisprocessed.Thismechanismensurescorrectoperationofthealgorithmunderallcircumstancesand,theprogrammerdoesnotneedtocareaboutpipelinedelaysinthearrayandhowtosynchronizetoasynchronousexternaldatastreams. 3.ReconfigurableHardwareTechnologies793.2.2GranularityPACTXPParraysarchitectureisacoarsegrainreconfigurablearchitecture.3.2.3TechnologyXPPcoresaretechnologyindependent.PACTprovidesXPPcoresassynthesizableVerilogRTLcode.3.2.4ReconfigurationXPParraysallowfastdynamicreconfiguration.IncontrasttoFPGAs,XPPneedsonlyKbitsforafullconfiguration;internalRAMsbufferdatabetweentheconfigurations.Foroptimalperformancethenumberofdata,whichiscalculatedinoneconfiguration,shouldbeashighaspossibletominimizetheeffectofthereconfigurationlatency.Smallpartsofthearraycanbereconfiguredwithouttheneedtostopcalculationsofotherconfigurationsonthesamearray.3.2.5DesignflowTheXDSdevelopmentsuitesupportsco-developmentandco-simulationofsystemswiththeXPP-array.TheXDSisacompletesetoftoolsforapplicationdevelopment.SinceinmostapplicationsXPPisusedasacoprocessortomicro-controllers,theXDSprovidesseamlessdesign-flowforboth,themicro-controllerandtheXPP.Derivedfromadataflowgraph,algorithmsaredirectlymappedontothearray.TheGraphs'snodesdefinedirectlythefunctionalityandoperationoftheALUorotherelements,whereastheedgesdefinetheconnectionsbetweentheelements.Suchaconfigurationremainsstaticallyonthearrayandasetofdatapacketsflowsthroughthisnetofoperators.ApplicationsarewritteninCorC++.Inanenvironmentwithamicro-controllerandtheXPPascoprocessor,thesoftwaretasksaredividedintotwosections.Thecontrol-flowtasksareprocessedwiththestandardtoolsforthemicro-controllerandthehighbandwidthdata-flowtasks,thatneedsupportbytheXPP,arecompiledbytheXPP-VC.ThisvectorizingC-compilermapsasubsetofCtotheXPP,andallowsintegratingoptimizedmodules.Thesemodulesoriginatefromalibrary,orarewrittenfortheapplicationintheNativeMappingLanguage,NML.APIfunctionsforloadingandstartingofconfigurations,configurationsequencing,dataexchangeviaDMAandtasksynchronizationprovideacomfortable 80Chapter3environmentforC-programmerswhoarefamiliarwithembeddeddesigns.Thelinkercombinescodeofbothsections,whichcaneitherbesimulatedbysoftware,oruploadedtothetargethardware.Theintegrateddebuggingtoolforthemicro-controllerandtheXPP,allowsinteractivetestandverificationofthesimulationresultsorthehardware.TheconfigurationandthedataflowintheXPParevisualizedinagraphicaltool.3.3ElixentDFA1000TheElixentDFA1000accelerator[5]wasdesignedfromthegrounduptodeliveronthepromiseofReconfigurableSignalProcessing(RSP).UtilizingtheadvancedD-Fabrixprocessingarray.Itaimsatdeliveringhugebenefitsinperformance,powerconsumptionandsiliconarea.TheseattributesmakeitidealforintegrationwithRISCprocessorsinmobile/consumer/communicationsapplicationsthatneedtheultimateinsignalormediaprocessing.Theseadvantagesaredeliveredthroughsiliconreuse.TheDFA1000acceleratorimplements“virtualhardware”–hardwareacceleratorsforspecificalgorithms,implementedassimpleconfigurationsontheD-Fabrixprocessingarray.Whenonealgorithmcompletes,anew“virtualhardware”acceleratorisloaded,performingthenexttaskinthesystem’sdataflow.3.3.1ArchitectureThebasisforElixentsDFA1000istheD-Fabrixprocessingarraya’platformthatrealisesthepotentialofReconfigurableAlgorithmProcessing.ThestructureofD-Fabrixissimplethecomponentsare4-bitALUs,registersandtheswitchbox.Twoofeacharecombinedintoabuildingblock,thetile“HundredsorthousandsoftilesarecombinedtocreatetheD-Fabrixarray.Specialfunctionscanbedistributedthroughthearrayforexample,memoryisalwaysdistributedtogivefast,localstoragewithmassivebandwidth.CreatingwiderexecutionunitsissimplyamatterofcombiningALUs–typicallyinto8,12or16-bitunits,butoccasionallyintofarlargerunits.MuchofthetaskoflinkingtheALUstogetherinthiswayisperformedbythearray’sroutingswitchboxes.ThearchitectureoftheD-FabrixarrayisshowninFigure3-14.TheDFA1000acceleratorintegratesseveralbanksoflocalhigh-speedRAMnexttothearray.Theseareforoften-useddata;forexample,theymaybeusedasimagelinestores,orasaudiobuffers.TheseRAMseliminatemanyhighbandwidthaccessesoff-chip,improvingpowerconsumptionwhileatthesametimeenhancingperformance. 3.ReconfigurableHardwareTechnologies81Figure3-14.ArchitectureofD-FabrixarrayTheDFA1000alsoincludesaperipheralsettofacilitateitsintegrationintoSOCdesigns.Thearchitectureoffershigh-speeddatainterfacestotheD-Fabrixcorearray.Thisallowshigh-speeddatatobedrivenintothearraydirectly,withlowlatencyandnooverheadonthesystembus.Thesehigh-speeddatainterfacesaresupplementedbytheAMBAbusinterface,usedforprogrammingthearray,andtransferringdatatoandfromthehostprocessor.Thisistypicallyamuchlowerbandwidthcontrolandconfigurationpath.Thearchitecturealsointegrateslocalhigh-speedRAMs,directlyaccessiblebythearrayorbytheRISC;andofcoursetheD-Fabrixarrayitself.3.3.2GranularityDFA1000architectureisamediumgranularityarchitecturebasedon4-bitALUs.3.3.3TechnologyDFA1000willbemadeavailableindifferentindustrystandardprocesses.Firstrealizationwasona0.18µmtechnology.3.3.4ReconfigurationDFA1000canbedynamicallyreconfiguredinmicroseconds. 822Chapter33.3.5DesignflowThekeytousingtheDFA1000acceleratoriscreatingthehigh-performancevirtualhardwareconfigurations.D-Sign,theD-Fabrixalgorithmprocessor'stoolsetoffers,threemaindesignstylesforthispurpose:•HDLentry,usingeitherVerilogorVHDL•C-styleentry,usingCeloxica'sHandel-C•Matlabentry,usingAccelchip'sAccel-FPGAAllthedesignentrytoolsfeedacommonback-end.Thisperformsoptimisationstothecode,beforemappingresourcestotheD-Fabrixarray.Theentireprocessisautomatic.Oncethearraydescriptionhasbeen“compiled”forthearchitecture,itisplacedandrouted.ThisstageisanalogoustotheresourceallocationphasesthatacompilerusesforaVLIWprocessor,allocatingarrayresourcetothefunctionswithinthealgorithm.Theoutputofthe“placeandroute”toolisthefinalprogram.3.3.6ApplicationareaD-Fabrixissuitableforseveralapplicationsfromnetworkedmultimedia(MPEG-4,JPEG,camera,graphics,rendering)towireless(3G,CDMA,OFDMetc)orevensecurity(RSA,DES,AES...).REFERENCES1.Adapt2000QuickSilverTechnologies(2004)4)Availableat:http://www.qstech.com/default.htm2.AT40KAtmel(2004)Availableat:http://www.atmel.com/atmel/products/prod39.htm3.CycloneIIAltera(2004)Availableat:http://www.altera.com/products/devices/cyclone2/cy2-index.jsp4.DAPDNAIPFlexInc(2004)Availableat:http://www.ipflex.com/en5.DFA1000Elixent(2004)Availableathttp://www.elixent.com/products6.GladiatorLeeopardLogicLogic(2004)(Available2004)vailableat:http://www.leopardlogic.com/products/index.php7.MRC6011Freescale(2004)Availableat:http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MRC6011&nodeId=01279LCWs8.MS1MorphoTechnologies(2004)Availabelat:http://www.morphotech.com/9.picoArraypicoChip(2004)Availableat:http://www.picochip.com/technology/picoarray10.Spartan-3Xilinx(2004)Availableat:http://www.xilinx.com/xlnx/xil_pro-dcat_landingpage.jsp?title=Spartan-311.StratixIIAltera(2004)Availableat:http://www.altera.com/products/devices/stratix2/st2-index.jsp12.Virtex-4Xilin(2004)Availabxbleat:http://www.xilinx.com/xlnx/xil_prodcat_landingpage.jsp?title=Virtex-4 3.ReconfigurableHardwareTechnologies8313.XPPIPcoresPACT(2004)Availableat:http://www.pactcorp.com/ PARTBSYSTEMLEVELDESIGNMETHODOLOGY Chapter4DESIGNFLOWFORRECONFIGURABLESYSTEMS-ON-CHIP1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:AtopdowndesignflowforheterogeneousreconfigurableSystems-on-Chipispresentedinthischapter.Thedesignflowcoversissuesrelatedtosystemleveldesigndowntobackendtechnologydependentdesignstages.Emphasisisgivenonissuesrelatedtoreconfiguration,especiallyinsystemlevelwhereexistingflowsdonotcoversuchaspects.Keywords:Designflow,systemlevel,reconfiguration,reconfigurableSystems-on-Chip1.INTRODUCTIONHeterogeneousSystems-on-Chip(SoCs)withembeddedreconfigurableresourcesformaninterestingoptionfortheimplementationofwirelesscommunicationsandmultimediasystems.Thisisbecausetheyoffertheadvantagesofreconfigurablehardwarecombinedwiththeadvantagesofotherarchitecturalstylessuchasgeneralpurposeinstructionsetprocessorsandapplicationspecificintegratedcircuits(ASICs).Furthermore,suchSoCsallowcustomizationonthewayreconfigurableresourcescanbeused(typeanddensityofresources)dependingonthetargetedapplicationorsetofapplications.AgenericviewofaheterogeneousreconfigurableSystem-on-ChipisshowninFigure4-1.SuchaSoCwillnormallyincludeinstructionsetprocessors(generalpurpose,DSPs,ASIPs),customhardwareblocks(ASICs)andreconfigurablehardwareblocks.Theembeddedreconfigurableblockscanbeeithercoarsegrained(wordlevelgranularity)orFPGAlike(bitlevelgranularity).Thedifferentprocessingelementsmaycommunicate87N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,87-105.©2005Springer.PrintedintheNetherlands. 888Chapter4throughabus,howevercurrenttrendsaremoretowardscommunicationnetworksonchip(forscalability,flexibilityandpowerconsumptionissues).DirectMappedInstructionSetHardwareProcessors(ASIC)DistributedCommunicationNetworksharedmemoryorganizationFinegrainCoarsegrainreconfigurablereconfigurablehardwarehardwareFigure4-1.AbstractviewoftargetedimplementationplatformThedesignofaSoCwithreconfigurablehardwareisnotatrivialtask.Toobtainanefficientimplementationanextendeddesignflowisneededinordertocopewiththereconfigurationaspectsonawidescaleofcommerciallyavailableplatforms.Inaddition,ahighabstractionlevelmethodologyneedstobedevelopedforhelpingindecidingtheinstancesoftheimplementationtechnologies,bothforfinegrainedandcoarsegrainedreconfigurablehardware.Therequirementsandtheprinciplesofsuchdesignmethodologyarefurtherdiscussedintherestofthischapter.Itmustbenotedthatthedesignflowandhighleveldesignmethodsdescribedintherestofthischaptercanbeequallyapplytooff-the-shelfsystemlevelFPGAsthatincludeembeddedhardwiredblocks(includingsoftwareprocessorsandASICblocks).2.DESIGNFLOWREQUIREMENTSFORRECONFIGURABLESYSTEMS-ON-CHIPTheintroductionofreconfigurableresourcesinSystems-on-Chipcreatestheneedformodificationsandextensionstoconventionaldesignflowswithemphasisonthehigherabstractionlevels,wheremostimportantdesigndecisionsaremade.Inthissection,conventionalsystemleveldesignflows 4.DesignFlowforReconfigurableSystems-on-Chip89arebrieflypresentedandthensystemleveldesignflowrequirementsforreconfigurableSystems-on-Chiparediscussed.2.1OverviewofconventionalsystemleveldesignflowsDrivenbytheSoCdesigngrowth,thedemandforsystemlevelco-designmethodologiesisalsoincreasing[6].Academicandcommercialsourceshaveprovidedco-designmethodologies/toolsforavarietyofapplicationdomains,withmanyhardware/softwarepartitioningopportunities,synthesis,simulationandvalidationmechanisms,atdifferentdegreesofautomationandlevelsofmaturity.Asfarassystemspecificationisconcerned,avarietyoflanguages(HDL,objectoriented,proprietary)arebeingusedforsystemlevelspecification.Somemethodologiesexploitacombinationoflanguagesinordertoproperlydescribethehardwareorsoftwarepartsofthedesign.Thetrendishowevertounifythesystemdesignspecificationinonedescriptionlanguagecapableofrepresentingthesystematthehighlevelofabstraction[6].Thegoalofhardware/softwarepartitioningistheoptimizeddistributionofsystemfunctionsamongsoftwareandhardwarecomponents.Withrespecttothat,mostbeneficialarethemethodologiesthatprovidethepartitioningatdifferentlevelsofmodelingwithoutthenecessityofrewritingthehardwareorsoftwarespecifications.Thisnotonlyreducesthedesigniterationsteps,butalsoenableseasyinclusionofpredefinedlibraryelementsorIPblocks.Theimportantfeaturethatshouldbetakenintoaccountduringco-synthesisisthepossibilityofinterfacesynthesis.Thedifferentpossibleinter-processcommunicationprimitivesarecoveredindifferentmethodologies.Theyareeitherfixedtotheparticularmethodologyorwiththeoptionalpossibilityofcreatingnewprimitivesbasedontheexistingones.Co-simulationtechniquesrangefromcommercialsimulationbasedonmethodologyspecificsimulationenginestocombinationofmultiplesimulationengines.Mostofthemethodologydependentco-simulatorsarebasedoneventdrivensimulation,whilesomeofthemcomewithanoptionforco-simulationwithothersimulators[8].Co-verificationismainlysimulationbased,meaningthattheresultsoftheHDL,ISSorproprietarysimulationsatdifferentlevelsoftheco-designflowarecomparedforcorrectfunctionalityandtiming,withtheinitialspecifications.Debuggingisenabledinsomemethodologiesbyexploitingagraphicaltooloraproprietaryuserinterfaceenvironment.Themainfeaturesofrepresentativesystemlevelhardware/softwareco-designmethodologiesaresummarizedinTable4-1. 900Chapter4Asanaturalconsequenceofwhathasbeenmentionedinthepreviousparagraphs,itisconcludedthatagenerictraditionalsystemleveldesignflowusuallyinvolvesthefollowingkeyphases:•Systemspecification•Hardware/softwarepartitioningandmapping•Architecturedesign•Systemlevel(usuallybuscycleaccurate)simulationand•Fabricationofhardwareandsoftwareusingtoolsprovidedbytechnologyvendors.Table4-1.Summaryofthemainfeaturesofsystemlevelhardware/softwareco-designmethodologiesSystemHW/SWCo-Co-simulationRemarkSpecificationPartitioningsynthesisCo-verificationUsingC++forConcurrentInterfaceUnifiedco-simulationFuturefunctionalityprocesses,synthesisandenvironment,performanceversionsandpartitioningonindustrialestimation,co-simulationbuildontopCAPI-XLarchitecturaltheseprocessestoolsforRTLwithothersimulationofSystemCOpropertiescanbemadesynthesisenginesanywhereinthedesignflowUsingSystemChannels,Simulationengineincluded,BecomingSystemCspecificationsinterfacesperformanceestimationindustry(basedoncanberefinedandeventsstandardSystemCC++)fortomixedSWenabletofunctionalityandHWmodelandimplementa-communi-architecturetionscationandsynchroni-zationCUsingTemplateUnifiedco-simulationPerhapstheVCC/VHDLformodelsofenvironmentwithemphasismostarchitecturearchitectureonperformanceestimationcompleteandwheresoftwaretoolsetfunctionalityandhardwareareamappedUsingRetargetableinstructionsetUsefulforproprietarysimulatorsimulatesthethedesignofChess/nMLlanguageexecutionofcodeontargetembeddedCheckersforprocessorprocessorprocessorsarchitecture,Cforapplicationcontinued 4.DesignFlowforReconfigurableSystems-on-Chip91SystemHW/SWCo-Co-simulationRemarkSpecificationPartitioningsynthesisCo-verificationUsingVerilogAllocateInterfaceHW/SWco-simulationIncludesforfunctionalitytosynthesisandengineincludedinterfacefunctionalityprocessorsindustrialsynthesisbutHINOOKandpre-toolsforRTLrequirestoolCdefinedsynthesisspecificcomponentsmodelsofforarchitectureprocessorsandbusesLUsingasubsetSynthesisandNetlist&CommercialVHDLPreciseofVHDLforcompilationcontrollerssimulatortosimulatemodelingofCOOarchitecturetoolsusedtoforfunctionalityofthesystemcostandandcomputethecommuni-specificationanditsperformancefunctionalityvalueforthecationimplementationafterco-metricscostmetrics;betweenHWsynthesisspecificandSWalgorithmstogeneratedinsolvetheVHDLHW/SWpartitioningOneprocessorAllocatealltoInterfaceHW/SWco-simulationApplicableandVerilogSW,thenmovesynthesis,engineincludedonlytooneYMAfunctionalityslowestpartsandincludedprocessorCOStoHW.toolsforRTLarchitecturesynthesiswithhardwareco-processorCC/C++,ManualAutomaticSimulationenvironment;Supportfor2SystemCforinterfaceco-simulationwithIPcoresNsystemlevelsynthesisandcommerciallyavailabledescription,industrialinstructionsetsimulatorsextendedCfortoolsforRTLhardware.synthesiselEsterelBDDandtemporallogicCompilationrlanguagebasedverificationtechniquesofEsterelsteprogramsEintoFSM,HWorCprogramscontinued 92Chapter4SystemHW/SWCo-Co-simulationRemarkSpecificationPartitioningsynthesisCo-verificationSSubsetofCforDifferentHW/SWExperimentalSW,subsetofpartitioningcommuni-co-synthesisVHDLforHWmodelsandcationenvironmentLYCOalgorithmsthroughavailablememorymappedI/OHTextualModelingProjectinSthreeearlyEMindependentresearchlayersforphaseSW,scheduler/protocolandHWresourceyManymodelsSomecode-Powerfulco-simulationFeaturesofgenerationenginefordifferentmodelsvarywithcomputationstoolsofcomputationmodelsofPtolemthatcanbecomputationusedinsingledesign2.2SystemleveldesignflowrequirementsforreconfigurableSystems-on-ChipThewayinwhichthepresenceofembeddedreconfigurableresourcesaffectsthemajorstagesofasystemleveldesignflow,andtheadditionalrequirementsitcreatesarediscussedinthissubsection.2.2.1SystemspecificationInthesystemspecificationphase,therequirements,restrictionsandspecificationsaregatheredaswhennotusingreconfigurableresources,butextraeffortmustbespentonidentifyingpartsoftheapplicationsthatserveascandidatesforimplementationwithreconfigurablehardware.Theincorporationofreconfigurablehardwarebringsnewaspectstothearchitecturedesigntaskandtothepartitioningandmappingtask.Inthearchitecturedesigntask,anewtypeofarchitecturalelementisintroduced.Inarchitecturaldesignspace,thereconfigurablehardwarecanbeviewedasbeingatimeslicescheduledapplicationspecifichardwareblock.Onewayofincorporatingreconfigurablepartsintoanarchitectureistoreplacesome 4.DesignFlowforReconfigurableSystems-on-Chip93hardwareacceleratorswithasinglereconfigurableblock.Theeffectsofreconfigurableblocksonthearea,speedandpowerconsumptionshouldbecompletelyunderstoodbeforetheycanbeefficientlyused.2.2.2Hardware/softwarepartitioningandmappingDuringthisphase,anewdimensionisaddedtotheproblem.Thepartsofthetargetedsystemthatwillberealizedonreconfigurablehardwaremustbeidentified.Therearesomerulesofthumbthatcanbefollowedtogiveasimplesolutiontothisproblem:•Iftheapplicationhasseveralroughlysamesizedhardwareacceleratorsthatarenotusedinthesametimeorattheirfullcapacity,adynamicallyreconfigurableblockmaybeamoreoptimizedsolutionthanahardwiredlogicblock.•Iftheapplicationhassomepartsinwhichspecificationchangesareforeseeable,theimplementationchoicemaybereconfigurablehardware.•Ifthereareforeseeableplansfornewgenerationsofapplication,thepartsthatwillchangeshouldbeimplementedwithreconfigurablehardware.Furthermore,forthedesignofreconfigurablehardwareinsteadofconsideringjustarea,speedandpowerconsumptionasithappensintraditionalhardwaredesign–thetemporalallocationandschedulingproblemmustalsobeaddressed.Thisisachievedinawaysimilartothepoliciesfollowedforsoftwaretasksrunningonasingleprocessor.Thisleadstoincreasedcomplexityinthedesignflow,sincethecostfunctionsofthefunctionalityimplementedwithreconfigurabletechnologyincludetheproblemsofbothhardwareandsoftwaredesign.Therearebasicallytwopartitioning/mappingapproachessupportedbytheexistingcommercialdesignflows:(a)thetoolorienteddesignflow,and(b)thelanguageorienteddesignflow.ExamplesoftoolorienteddesignflowsaretheN2CbyCoWare[7]andVCCbyCadence[5].Thedesignflowssupportedbythesetoolsworkwellontraditionalhardware/softwaresolutions.Nevertheless,therefinementprocessofadesignfromunifiedandun-timedmodeltowardsRTListoolspecific,andtheincorporationofnewreconfigurablepartsisnotpossiblewithoutunconventionaltrickery.ExamplesoflanguageorienteddesignflowsareOCAPI-XL[12]andSystemC[13].Especiallyforthelatter,sinceitpromotestheopennessofthelanguageandthestandard,theadditionofanewdomaincanbemadetothecorelanguageitself.However,themethodmostlypreferredistomodelthebasicconstructsrequiredformodelingandsimulationofreconfigurablehardware,usingbasicconstructsofthelanguage.Inthisway,thelanguage 944Chapter4compatibilitywithexistingtoolsanddesignsispreserved.SystemCextensionsforreconfigurablehardwaredesignandOCAPI-XLarethoroughlycoveredinChapters5and6respectively.2.2.3ArchitecturedesignAdesignflowthatsupportssystemdescriptionsathighabstractionlevel,mustalsosupportthereconfigurabletechnologiesofdifferenttypesandvendors.Themainquestionthatmustbeanswered,evenatthehighestlevelofabstraction,is:Whattoimplementwithreconfigurabletechnologyandwhichreconfigurabletechnologytouse?2Thedesignflowmayanswerthesequestionsbyusingdifferenttechniques.First,analysisbasedtoolscompiletheunifiedrepresentationoftheapplicationfunctionalityandproduceinformationonwhichpartsoftheapplicationareneverruninparallel.Thisinformationcanbeusedtodeterminewhatfunctionalitycanbeimplementedindifferentcontextsofareconfigurableblock.Analternativemethodistheuseofcostfunctionsforeachimplementationtechnology.Costfunctionshelpinmakingquickdesigndecisionsusingseveralparametersandoptimizationcriteriaatthesametime.Anothercategoryoftoolsuseprofilinginformationgatheredinsimulationsinordertopartitiontheapplicationandtoproduceacontextschedulertobeusedinthefinalimplementation.ExampleofthisapproachisatoolsetforMorphoSys[14]reconfigurablearchitecture.Finally,themostrealisticalternativeforindustrialapplicationsisthesimulationbasedapproach.Inthisapproach,thepartitioning,mappingandschedulingareaccomplishedmanuallybythedesigner,whiletheresultsandtheefficiencyareverifiedthroughsimulations.Thisapproachisalsotheeasiesttoincorporateintoanexistingflow,sincetherequiredtoolsupportislimitedcomparedtothepreviousapproaches.Thisalsoleavesallthedesigndecisionstothedesigner,whichispreferredbymanyindustriallyuseddesignflows.Whenconsideringdesigningadditionstoalanguageoratoolthatcansupportmodelingandsimulationofreconfigurabletechnologies,asetofparametersthatdifferentiatetheimplementationtechnologiesneedtobeidentified:(a)thereconfigurableblockcapacityingates,(b)theamountofcontextmemoryrequiredtoholdconfigurations,(c)thereconfigurationtimeandsupportforpartialreconfiguration,(d)typicalclockortransactionspeed,and(e)powerconsumptioninformation.The2AbriefintroductiontoexistingreconfigurablehardwaretechnologiesispresentedinChapter3. 4.DesignFlowforReconfigurableSystems-on-Chip95aforementionedparametersareadequateformodelinganytypeofhomogenousreconfigurabletechnology.Thesimulationaccuracyresultingfromusingtheseparametersisnotoptimal,butitissufficientforgivingthedesigneranideaofhoweachdifferentreconfigurabletechnologyaffectsthetotalsystemperformance.Theresultsneededforsteeringthedesignspaceexploration,andverifyingthatthedesigndecisionsfulfillthetotalsystemperformance,are:•Spatialutilization,whichisneededtovalidatethecorrectsizeoftheblockandalsogranularityofthecontexts.•Temporalutilization,thatismeasuredtocomparethetimespentinconfiguringtheblock,waitingforactivationandactivelydoingthecomputation.•Contextmemorybusload,whichismeasuredtoanalyzetheeffectsofthereconfigurationmemorybustrafficontheperformanceofsystembuses.•Areaandpowerconsumptionwhicharecomparedagainsthardwareorsoftwareimplementation.Theaforementionedresultsshouldbeusedasadditionalinformationinordertodecidewhichreconfigurabletechnologytouseandwhichpartsoftheapplicationwillbeimplementedwithit.Whencomparingtherequirementspertainingtoreconfigurabilityinexistingdesignflows,itcanbeseenthattheexistingdesignflowsandtoolsdonotsupportanyoftherequirementsdirectly.Eitherthetoolsandlanguagesshouldbeimprovedorcompanyspecificmodificationsareneeded[1,2,3].3.THEPROPOSEDDESIGNFLOWFORRECONFIGURABLESoCsThissectionprovidesthegeneralframeworkoftheproposeddesignflowfordesigningcomplexSoCsthatcontainreconfigurableparts.TheflowaimstoimprovethedesignprocessofaSoCinordertousetheavailabletoolsinanoptimalway[11].Themainideaofthedesignflowproposedistoidentifythepartsofaco-designmethodology,wheretheinclusionofreconfigurabletechnologieshasthegreatesteffect.Thisisveryimportantsincetherearenocommercialtoolsormethodologiestosupportreconfigurabletechnologies,yet.ThedesignflowisdividedinthreepartsasshowninFigure4-2.TheSystem-LevelDesign(SLD)referstothehighlevelpartoftheproposedflow,whiletheDetailedDesign(DD)andImplementationDesign(ID)refertothebackendpartofthemethodology. 966Chapter4SystemRequirements/SpecificationCaptureArchitectureSystemDefinitionPartitioningMappingSystem-LevelSystem-LevelDesignSimulationSpecificationRefinementHardwareSoftwareReconfigurableDesignDesignHardwareDesignExternalIPIntegrationDetailedDesignCo-VerificationFPGA/ASICSoftwareImplementationImplementationDesignDesignVerificationFPGADownloading/SiliconManufacturingImplementationProductDesignQualificationFigure4-2.TheproposedDesignFlow 4.DesignFlowforReconfigurableSystems-on-Chip97DetailsontheformalismsusedarethoroughlycoveredinChapters5and6,whileChapters7,8and9provideinformationhowtheproposedframeworkcanbeappliedforthedesignofrealworldcasestudies.3.1SystemLevelDesign(SLD)AttheSLDphase,themaintargetsare:•todevelopaspecificationoftheapplicationassociatedwiththerequirementscaptured(andanalyzed),•todesignthearchitectureoftheSoC,•toselectmajorimplementationtechnologies,•topartitiontheapplicationforimplementationinhardware,softwareorreconfigurablehardwareand,•toevaluatetheperformanceofthepartitionedsystem.Therequirementsarecapturedandanalyzedinthespecificationphaseandtheresultsarefedtothenextphasesofthedesignflow.Architecturetemplatescanbeusedtoderiveaninitialarchitecture.Theycanbebasedonpreviousversionsofthesameproduct,adifferentproductinthesameproductfamily,adesign/implementationplatformprovidedbythedesigntoolorsemiconductorvendororevenoninformationofasimilarsystembyacompetitor.Atthearchitecturedefinitionphase,buscycleaccuratemodelsofthearchitecturalunitsarecreated,sothattheperformanceofthearchitecturecanbeevaluatedusingsystemlevelsimulations.Inthepartitioningphase,thefunctionalmodeloftheapplicationispartitionedinsoftware,hardwareandreconfigurablehardware.Thesepartitionsarethenmappedontothearchitecture,annotatedwithestimationsoftimingandothercharacteristicsneededinthemappingphase.AttheSLD,thereconfigurationissuesemergeinthefollowingforms:•Thegoalsforreconfiguration(e.g.flexibilityforspecificationchangesandperformancescalability)withassociatedconstraintsareidentifiedattherequirementsandspecificationstep.•Atthedesignspaceexplorationstep,thereconfigurablehardwaremanifestsitselfasacomputingresourceinasimilarwayasaninstructionsetprocessororablockoffixedhardware,thusbringinganewdimensiontothedesignspaceexploration.3.2DetailedDesign(DD)AttheDDphase,thespecificationsarerefinedandverificationisplannedaccordingtotargetedimplementationtechnologies,processorsetc. 98Chapter4Thedesigntoolsusedarefixedaccordingtotheselectedprocessorsandthechosenreconfigurableandfixedhardwaretechnologies.Additionally,theverificationandtestingstrategyareplanned.Afterthis,theindividualpartitionsofhardware,softwareandreconfigurablehardwarearedesignedandverified.Whenallpartsarefinished,thedesignedmodulesofhardware,softwareandreconfigurablehardwareareintegratedintoasinglemodel.Intheco-verificationstep,thefunctionalityoftheintegratedmodelischeckedagainstthereferenceimplementationortheexecutablespecification.Moreover,implementationrelatedissuesliketimingandpowerconsumptionaremodeled.Iftheresultsaresatisfactory,thedesignismovedtotheImplementationDesignphase,otherwiseiterationstoDetailedDesignoreventoSystemLevelDesignphasesarerequired.AttheDD,thereconfigurationissuesemergeinthefollowingways:•Atthespecificationrefinementandtechnologyspecificdesign,thereconfigurablehardwarerequirescommunicationmechanismstosoftwareand/orfixedhardwaretobeadded;incaseofdynamicreconfigurationmechanismstohandlecontextmultiplexingarealsoneeded.•Theintegrationandco-verificationcombinesthereconfigurablehardwarecomponentswithotherhardwareandsoftwarecomponentsontoasingleplatformthataccommodatesalsoexternalIP(e.g.processor,memoryandI/Osub-systemmodels)andprovidesco-verificationoftheoveralldesign.ThereconfigurablehardwareissimulatedinaHDLsimulatororemulatedinanFPGAemulator.•SpecificHDLmodelingrulesneedtobefollowedformultipledynamicallyreconfigurablecontexts[2,3].•Thereconfigurablehardwaremodulesmustbeimplementedusingtheselectedtechnology,includingtherequiredcontrolandsupportfunctionsforreconfiguration.•Intheintegrationandverificationphases,thevendorspecificdesignandsimulation/emulationtoolsmustbeused.3.3ImplementationDesign(ID)AttheID,thereconfigurationissuesemergeinthefollowingforms:•Dynamicreconfigurationrequiresconfigurationbitstreamsofmultiplecontextstobemanaged.•Specificdesignrulesandconstraintsmustbefollowedformultipledynamicallyreconfigurablecontexts[2,3]. 4.DesignFlowforReconfigurableSystems-on-Chip994.RECONFIGURATIONISSUESINTHEPROPOSEDDESIGNFLOWAsindicatedintheprevioussection,thereareseveralissuesregardingreconfiguration.Thenextsectionsemphasizehowtheseaspectsareaddressedinthecontextoftheproposeddesignframework.Thefocusisonsystemleveldesignissues,althoughdetailedandimplementationdesignapsectsarebrieflydiscussedtocompletethepicture.4.1ReconfigurationissuesatSystemLevelDesign4.1.1NeedsandRequirementsforReconfigurationTherequirementsandspecificationcaptureidentifiestherequiredfunctionality,performance,criticalphysicalspecifications(e.g.area,power)andthedevelopmenttimerequiredforthesystem.Alltheaforementionedcharacteristicsaredescribedintheformofanexecutablemodel,wherethegoalsforreconfiguration(e.g.flexibilityforspecificationchangesandperformancescalability)areidentifiedaswell.Ingeneral,simultaneousflexibilityandperformancerequirementsformthebasicmotivationforusingreconfigurationinSystem-on-Chipdesigns.Otherwiseeitherpuresoftwareorfixedhardwaresolutionscouldbemorecompetitive.Reconfigurabletechnologiesareapromisingsolutionforaddingflexibility,whilenotsacrificingperformanceandimplementationefficiency.Theycombinethecapabilityofpostfabricationfunctionalitymodificationwiththespatial/parallelcomputationstyle.Theinclusionofreconfigurablehardwaretoatelecommunicationsystemmayintroducesignificantadvantagesbothfrommarketandimplementationpointsofview:•Upgradability−Needtoconformtomultipleormigratinginternationalstandards−Emergingimprovementsandenhancementstostandards−Desiretoaddfeaturesandfunctionalitytoexistingequipment−Serviceprovidersarenotsurewhattypesofdataserviceswillgeneraterevenueinthewirelesscommunicationsworld−Introductionofbugfixingcapabilityforhardwaresystems.•Adaptivity−Changingchannel,trafficandapplications−Powersavingmodes.Althoughthereconfigurablehardwareisbeneficialinmanycases,significantoverheadsmayalsobeintroduced.Thesearemainlyrelatedto 1000Chapter4thetimerequiredforthereconfigurationandtothepowerconsumedforreconfiguringasystem.Areaimplicationsarealsointroduced(memoriesstoringconfigurations,circuitsrequiredtocontrolthereconfigurationprocedure).Therequirementscaptureshouldidentifyanddefinethefollowingreconfigurationaspects:•Typeofreconfigurationwantedinthesystem−Staticordynamic(singleormultiplecontexts)−Levelofgranularity(fromcoarsetofine)−Styleofcoupling(fromlooselytocloselycoupled).•Requirementsandconstraintsonsystemproperties(performance,power,cost,etc)•Requirementsandconstraintsondesignmethodology(pre-definedarchitecture,pre-selectedtechnologiesandIPs,tools,etc)Theinformationoutlinedaboveisneededinthelaterstagesofthedesignflow.However,thetechniquesforidentificationofneedsandcaptureofrequirementsarecompanyspecific.4.1.2ExecutableSpecificationThespecificationcaptureissimilartothecaseofsystemsthatemployonlytraditionalhardware.ThefunctionalityofthesystemisdescribedusingaC-likeformalisme.g.SystemC,OCAPI-XL.Theexecutablespecificationcanbeusedforseveralpurposes:•Thetestbenchusedinallphasesofthedesignflowcanbederivedfromtheexecutablespecification.•Thecompilertoolsandprofilinginformationmaybeusedtodeterminewhichpartsofanapplicationaremostsuitableforimplementingwithdynamicallyreconfigurablehardware.Thisisachievedinthepartitioningphaseofthedesignflow.•Theabilitytoimplementexecutablespecificationvalidatesthatthedesignteamhassufficientexpertiseontheapplication.Executablespecificationisamustinordertobeabletotacklereconfigurabilityissuesatthesystemleveldesign.4.1.3DesignSpaceExplorationThedesignspaceexplorationphaseanalysesthefunctionalblocksoftheexecutablemodelwithrespecttoreconfigurablehardwareimplementations.Morespecifically:•Itdefinesarchitecturemodelscontainingreconfigurableresourcesbasedontemplates. 4.DesignFlowforReconfigurableSystems-on-Chip101•Itdecidesthesystempartitioningontoreconfigurableresources(inadditiontohardwareandsoftware)basedontheanalysisresults.•Itmapsthepartitionedmodelontoselectedarchitecturemodels.•Itperformssystemlevelsimulationtoestimatetheperformanceandresourceusageoftheresultingsystem.Thearchitectureofthedeviceisdefinedpartlyinparallelandpartlyusingthesystemspecificationasinput.Theinitialarchitecturedependsonmanyfactorsinadditiontotherequirementsoftheapplication.Forexamplesacompanymayhaveexperienceandtoolsforcertainprocessorcoreorsemiconductortechnology,whichrestrictsthedesignspace.Moreover,thedesignofmanytelecomproductsdoesnotstartfromscratch,sincetheyimplementadvancedversionsofexistingdevices.Thereforetheinitialarchitectureandthehardware/softwarepartitioningisoftengivenatthebeginningofthesystemleveldesign.Therearealsocaseswherethereusepolicyofeachcompanymandatesdesignerstoreusearchitecturesandcodemodulesdevelopedinpreviousproducts.Theoldmodelsofanarchitecturearecalledarchitecturetemplates.Asfarasdynamicreconfigurationisconcerned,itrequirespartitioningtoaddressbothtemporalandspatialdimensions.Automaticpartitioningisstillanunsolvedproblem,butinspecificcasessolutionsfortemporalpartitioning[4],taskschedulingandcontextmanagement[10]havebeenproposed.InthecontextofindustrialSoCdesign,however,thesystempartitioningismostlyamanualeffort.Basedontheneedsandrequirementsforreconfiguration,theexecutablespecificationisanalyzedinordertoidentifypartsthatcouldgainbenefitsfromimplementationonreconfigurableresources.Thisanalysiscanbesupportedbyestimationsofperformanceandareadonewithrespecttopre-selectedtechnologies,architecturesandIPs,e.g.specificISPandreconfigurabletechnology.Duringthemappingphase,thefunctionalitydefinedinexecutablespecificationisrefinedaccordingtothepartitioningdecisionssothatitcanbemappedontothedefinedarchitecture.Inordertoincludeinthesystemlevelsimulationtheeffectsofthechosenimplementationtechnology,differentestimationtechniquescanbeused:•Softwarepartsmaybecompiledforgettingrunningtimeandmemoryusageestimates.•Hardwarepartsmaybesynthesizedathighleveltogetestimatesofgatecountsandrunningspeed.•Thefunctionalblocksimplementedwithreconfigurablehardwarearealsomodelledsothattheeffectsofreconfigurationcanbeestimated.Finallysimulationsarerunatthesystemlevel,togetinformationconcerningtheperformanceandresourceusageofallarchitecturalunitsofthedevice. 1022Chapter4Efficientdesignspaceexplorationisthecoreoftheproposeddesignframework.Withrespecttothedesignofreconfigurablesystemsparts,itsupports:•Earlyestimationoffunctionblocks/processesforperformance(hardware,softwareandreconfigurable),cost(area)etc.•Systempartitioning,especiallymulticontextpartitioningandscheduling•Architecturedefinition•Mapping•Performanceevaluation.4.2ReconfigurationissuesatDetailedDesignThespecificationrefinementandtechnologyspecificdesigntransformthefunctionalblocksoftheexecutablemodeltodesigncomponentstargetingreconfigurablehardware(inadditiontohardwareandsoftware)accordingtothepartitioningdecisions.Importantissuesatthisstageincludeiterativeimprovementsinhardware,softwareandreconfigurablehardwarespecification.Thedesignerstakeintoaccountnotonlydesign(modelinglanguage,targetedplatform,co-simulationandtestingstrategy),butalsoeconomicalandproductsupportaspectsofthedesign,exploitingthespecificreconfigurablehardwarefeatures.Theintegrationphasecombinesthehardware,softwareandreconfigurablehardwarecomponentsintoasingleplatformthataccommodatesalsoexternalIPe.g.processor,memory,I/Osub-systemmodels.Theintegrationphaseconsiderstwodifferentapproaches:languagebasedapproach(SystemC,OCAPI-XL)andtoolsorientedapproach(CoWareN2C)tocombinetheheterogeneouscomponentsofthetargetsystemonasingleplatform.Thereconfigurablehardwarerequirescommunicationmechanismstosoftwareand/orfixedhardwaretobeadded.Differenttypesofmechanismscanbechosentohandlecommunicationbetweenthecomponents:memorybasedcommunication,busbased,coprocessorstyleandevendatapathintegratedreconfigurablefunctionalunits.Busbasedcommunicationbetweenthecomponentsrequiresspecificinterfacesforboththereconfigurablefabricandhardware/softwaresidesofthesystem.Onthesoftwareside,driversarerequiredtoturnsoftwareoperationsintosignalsonthehardware.OntheFPGAfabricandhardwareside,interfacestothesystembusmustbebuilt.TheFPGAfabricandCPUcanalsocommunicatedirectlybysharedmemory.Regardingthesoftwareandfixedhardwaredesignflows,theydonotdifferfromtraditionalones.Forstaticallyreconfigurablehardwarethe 4.DesignFlowforReconfigurableSystems-on-Chip103designflowissimilartothatoffixedhardware.Fordynamicallyreconfigurablehardware,themoduleinterfaces,communicationandsynchronizationaredesignedaccordingtotheprinciplesofacontextscheduler.SpecificHDLmodelingrulesneedtobefollowedformultipledynamicallyreconfigurablecontexts[3,9].Inthecaseofdynamicreconfiguration,mechanismstohandlecontextmultiplexingarealsoneeded.Ahighlevelschemefordescribingdynamicreconfigurationshouldaddresshowdynamicallyreconfigurablecircuitscomposewithothercircuitsoverabusstructure.4.3ReconfigurationissuesatImplementationDesignReconfigurationpartitionstheapplicationtemporallyandmultiplexesintimetheprogrammablelogictomeetthehardwareresourceconstraints.Whenreconfigurationtakesplaceatruntime,thereconfigurationtimeispartoftheruntimeoverheadandhastobeminimized.Also,multiplereconfigurationbitstreamsneedtobestoredforthedifferentcontextsbeingmultiplexedontotheprogrammablelogic.ThisproblemisexacerbatedforSystem-on-Chipimplementationswheretheentireapplicationneedstobestoredinon-chipmemory.Whenmultiplecontextreconfigurabletechniquesareconsidered[3,9],dedicatedpartitioningandmappingtechniquesareappliedduringSystemLevelDesignphase.Later,duringImplementationDesignstep,aninter-contextcommunicationschemehastobeprovided.Inter-contextcommunicationreferstohowdataorcontrolinformationistransferredamongdifferentcontexts.Usually,transferregistersareusedforinterconnectingbetweenthepreviouslast,andcurrentnextcontext.Backupregistersarealsousedtostorethestatusvalueswhenthecontextswitchesoutandlaterswitchesin.Whenbulkbuffersaremorepracticalforinter-contextcommunication,memoryregionscanbeallocatedanywhereinthechipbyusingmemorymodeofthereconfigurablecells.Thesememoryregionscanbeaccessedfromallthecontextsassharedbuffers.Itisinstructivetocomparethishighbandwidthforinter-contextcommunicationwithamultipleFPGAsituation,wherebandwidthisinherentlylimitedtoexternalpins.Thehugebandwidthmakesmulti-contextpartitioningmucheasierthanthemulti-FPGApartitioning.5.CONCLUSIONSThedesignflowforreconfigurableSoCspresentedintheprevioussectionsisdividedinthreephases:Inthesystemleveldesignphase,where 1044Chapter4therequirementsandspecificationsarecaptured;functionalityintheformofexecutablespecificationisanalyzed,partitionedandmappedontothearchitecture,andtheperformanceofthesystemisvalidated.Inthedetaileddesignphase,thecommunicationandmodulesarerefinedandtransformed,integratedandco-verifiedthroughco-simulationorco-emulation.Theimplementationdesignmapsthedesignontotheselectedimplementationplatform.Theimplementationtechnologiestreatedinthismethodologyaresoftwareexecutedinaninstructionsetprocessor,traditionalfixedhardwareanddynamicallyreconfigurablehardware.EmphasisisgivenonthesystemlevelpartofthedesignflowwheremethodsforthemodelingandsimulationofreconfigurablehardwarepartsofareconfigurableSoCarerequired.MethodsandtoolstowardsthisdirectionarepresentedinChapters5and6respectively.REFERENCES1.ADRIATICProjectIST-2000-30049(2002)DeliverableD2.2:DefinitionofADRIATICHigh-LevelHardware/SoftwareCo-DesignMethodologyforReconfigurableSoCs.Availableat:http://www.imec.be/adriatic2.ADRIATICProjectIST-2000-30049(2003)DeliverableD3.2:ADRIATICback-enddesigntoolsforthereconfigurablelogicblocks.Availableat:http://www.imec.be/adriatic3.ADRIATICProjectIST-2000-30049(2004)AddendumtoDeliverableD3.2:ADRIATICback-enddesigntoolsforthereconfigurablelogicblocks.Availableat:http://www.imec.be/adriatic4.BobdaC(2003)SynthesisofDataflowGraphsforReconfigurableSystemsusingTemporalPartitioningandTemporalPlacement.PhDDissertation,UniversityofPaderborn5.Cadence(2004)http://www.cadence.com/datasheets/vcc_environment.html6.CavalloroP,GendarmeC,KronlofK,MermettJ,VanSasJ,TiensyrjaK,VorosNS(2003)SystemLevelDesignModelwithReuseofSystemIP,KluwerAcademicPublishers7.CoWareInc(2004)Availableat:http://www.coware.com8.GioulekasF,BirbasM,VorosNS,KouklarasG,BirbasA(2005)HeterogeneousSystemLevelCo-SimulationfortheDesignofTelecommunicationSystems.JournalofSystemsArchitecture(toappear),Elsevier9.KeatingM,BricaudP(1999)ReuseMethodologyManual.SecondEdition,KluwerAcademicPublishers10.MaestreR,KurdahiFJ,FernandezM,HermidaR,BagherzadehN,SinghH(2001)Aframeworkforreconfigurablecomputing:taskschedulingandcontextmanagement.IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems,vol.9,issue6,pp.858–87311.MasselosK,PelkonenA,CupakM,Blionas,S(2003)Realizationofwirelessmultimediacommunicationsystemsonreconfigurableplatforms.Journalofsystemsarchitecture,vol.49(2003)no:46,pp.155175 4.DesignFlowforReconfigurableSystems-on-Chip10512.OCAPI-XL(2004)Availableat:http://www.imec.be/ocapi/welcome.html13.SystemC(2004)Availableat:http://www.systemc.org14.TiwariV,MalikS,WolfeA,LeeMTC(1996)Instructionlevelpoweranalysisandoptimizationofsoftware.JournalofVLSISignalProcessing,KluwerAcademicPublishers,pp.223–238 Chapter5SYSTEMCBASEDAPPROACHYangQuandKariTiensyrjäVTTElectronics,P.O.Box1100,FIN-90571Oulu,FinlandAbstract:ThischapterdescribestheSystemCbasedmodellingtechniquesandtoolsthatsupportthedesignofreconfigurablesystems-on-chip(SoC).Fordesigningofreconfigurablepartsatsystemlevel,wedeveloped:1)anestimationmethodandtoolforestimatingtheexecutiontimeandtheresourceconsumptionoffunctionblocksondynamicallyreconfigurablelogictosupportsystempartitioning,2)aSystemCbasedmodelingmethodandtoolforreconfigurablepartstoallowfastdesignspaceexplorationthrough3)system-levelsimulationusingtransaction-levelmodelsofthesystem.Keywords:Configurationoverhead;contextswitching;designspaceexploration;dynamicreconfiguration;estimation;mapping;partitioning;reconfigurable;reconfigurability;SystemC;system-on-chip;workloadmodel.1.INTRODUCTIONReconfigurabilitydoesnotappearasanisolatedphenomenon,butasatightlyconnectedpartoftheoverallSoCdesignflow.TheSystemC-basedapproachisthereforenotintendedtobeauniversalsolutiontosupportthedesignofanytypeofreconfigurabily.Instead,wefocusonacase,wherethereconfigurablecomponentsaremainlyusedasco-processorsinSoCs.SystemC2.0isselectedasthebackboneoftheapproachsinceitisastandardlanguagethatprovidesdesignerswithbasicmechanismslikechannels,interfacesandeventstomodelthewiderangeofcommunicationandsynchronizationfoundinsystemdesigns.Moresophisticatedmechanismsforthesystem-leveldesigncanbebuiltontopofthebasicconstructs.Duetothestandardlanguageandopensourcereferenceimplementation,SystemC2.0hasbecomealanguageofchoiceforagrowingnumberofsystemarchitectsandsystemdesigners.107N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,107-131.©2005Springer.PrintedintheNetherlands. 1088Chapter5TheSystemCbasedapproachcoversthereconfigurationextensionandtherelatedmethodsandtoolsthatcanbeeasilyembeddedintoaSoCdesignflow.Thesystem-leveldesignpartofthedesignflowpresentedinChapter4isshowninFigure5-1.StSstemSystemyRiReqirements/Requirements/qt/SSpecificationpifitiCtCaptreCapturepAhittArchitectureTlTemplateptAhiAhittArchitectureSStSystemyStLlSystemLevelSystemSystemLevely-LevelDfiitiDefinitionPtitiiPartitioninggIPMiMappingppgSystem-LevelSLStLlSystemLevelSystem-LevelSystemLevelylDesignSiSimulationltiFigure5-1.System-leveldesignpartofproposeddesignflow.Thefollowingnewfeaturesareidentifiedineachphaseofsystem-leveldesignwhenreconfigurabilityistakenintoaccount:•SystemRequirementsandSpecificationCaptureneedstoidentifyrequirementsandgoalsofreconfigurability.•ArchitectureDefinitionneedstotreatthereconfigurableresourcesasabstractmodelsandincludetheminthearchitecturemodels.•SystemPartitioninggneedstoanalyzeandestimatethefunctionsoftheapplicationforsoftware,fixedhardwareandreconfigurablehardware.•Mappinggneedstomapfunctionsallocatedtoreconfigurablehardwareontotherespectivearchitecturemodel.•System-LevelSimulationneedstoobservetheperformanceimpactsofarchitectureandreconfigurableresources.IntheSystemCbasedapproach,weassumethatthedesigndoesnotstartfromscratch,butitisamoreadvancedversionofanexistingdevice.Thenewarchitectureisdefinedpartlybasedontheexistingarchitectureandpartlyusingthesystemspecificationasinput.Theinitialarchitectureisoftendependentonmanythingsnotdirectlyresultingfromtherequirementsoftheapplication.Thecompanymayhaveexperienceandtoolsforcertainprocessorcoreorsemiconductortechnology,whichrestrictsthedesignspaceandmayproduceaninitialhardware/software(HW/SW)partition. 5.SystemCBasedApproach109Therefore,theinitialarchitectureandtheHW/SWpartitionareoftengiveninthebeginningofthesystem-leveldesign.TheSystemCextensionisdesignedtoworkwithaSystemCmodeloftheexistingdevicetosuitthedesignconsideringdynamicallyreconfigurablehardwareFigure5-2(a)givesagraphicalviewoftheinitialarchitecture,andFigure5-2(b)showsthemodifiedarchitecturewithusingtheSystemCbasedextensions.ThewaythattheSystemCbasedapproachincorporatesdynamicallyreconfigurablepartsintoarchitectureistoreplaceSystemCmodelsofsomehardwareacceleratorswithasingleSystemCmodelofreconfigurableblock.TheobjectiveoftheSystemCbasedextensionsistoprovideamechanismthatallowsdesignerstoeasilytesttheeffectsofimplementingsomecomponentsinthedynamicallyreconfigurablehardware.TheprovidedsupportsintheSystemCbasedapproachinclude:•Analysissupportfordesignspaceexplorationandsystempartitioning.•ReconfigurabilitymodellingbyusingstandardmechanismsofSystemC.•System-levelsimulationusingtransaction-levelmodelsoftheapplicationworkloadandthearchitecture.SWSWSWSWfunctionsfunctionsfunctionsfunctionsCPUDMACPUDMAMEMHWHWReconfigurableMEMAcceleratorAcceleratorfabricHWHWAcceleratorAcceleratorfunctionalityfunctionalitySWfunctions(a)(b)Figure5-2.(a)TypicalSoCarchitectureand(b)modifiedarchitectureusingdynamicallyreconfigurablehardware. 110Chapter52.SYSTEMC2.0OVERVIEWSystemCisastandardmodellinglanguagebasedonC++.Itsversion1providesaclasslibrarythatimplementsobjectslikeprocesses,modules,ports,signalsanddatatypesforhardwaremodelling.ThemodeliscompiledbyastandardC++compilerforexecutiononaneventbasedsimulationkernel.Theversion2introducesalanguagearchitectureshowninFigure5-3[1].Itprovidescorelanguageconstructslikechannels,interfacesandeventsforsystem-levelmodelling.Elementaryandmoresophisticatedchannelscanbebuiltusingthecorelanguagetosupportvariouscommunication,synchronizationandmodelofcomputationparadigms.Thebasicsystem-levelconstructsofthelanguageareintroducedinfollowingsections,butformorecompleteinformationitisadvisabletoreadtheFunctionalSpecificationforSystemC2.0[2].StandardChannelsMethodology-SpecificforVariousMOC'sChannelsKahnProcessNetworksMaster/SlaveLibrary,etc.StaticDataflow,etc.ElementaryChannelsSignal,Timer,Mutex,Semaphore,Fifo,etc.CoreLanguageDataTypesModulesLogicType(01XZ)PortsLogicVectorsProcessesBitsandBitVectorsInterfacesArbitraryPrecisionIntegersChannelsFixedPointIntegersEventsC++LanguageStandardFigure5-3.SystemClanguagearchitecture.2.1ChannelsSystemC2.0channelsimplementoneormanyinterfacesandtheycontainthefunctionalityofthecommunication.Channelsareusedespeciallyindesigningandsimulatingfunctionalityofbuses.Functionalitysuchasaddresses,addressingschemes,prioritiesbuffersizesetc.canbeconfigured 5.SystemCBasedApproach111runtimeandthereforetheeffectofthesedesigndecisionscanbesimulatedeasilywithoutlargemodificationstothecode.Also,sinceitispossibletoattachmultipleportstoaninterfacethenumberofbusmastersorslavescanbechosenincompiletimewithoutmodifyingthebuscode.Whensystemlevelmodulesareimplementedcorrectlyforuseofparametersandvariablenumberofconnectedports,designspaceexplorationbecomesaneasytask.2.2PortsandInterfacesThemodelofcommunicationinSystemC2.0canbemoreabstractthaninregister-transferlevel(RTL)description.Usercandefineasetofinterfacemethodsthatmodulesuseforcommunication.Forexampleasystemlevelmodelofamemorycontrollercancontainthreeinterfacemethods,areadmethod,awritemethodandaburstreadmethod.Theactualbehaviouralimplementationofamethodislefttothemodulethatprovidestheinterface.Themodulethatusesaninterfacedoesthisviaaport.Thiswaythedetailedimplementationofaninterfacecanbeseparatedfromtheobjectthatisusingtheinterface.Usinginterfacesmakesitalsosimplertosimulateandmeasuretheeffectofforexampleburstreadingtotheperformanceofasystem.Thisiscalledtransactionlevelmodelling(TLM).2.3EventsandDynamicSensitivityEventsarelow-levelsynchronizationmechanisms.Theycanbeusedtotransfercontrolfromoneprocesstoanother.Theeffectcanoccurimmediately,afternextdeltacycleoraftersomedefinedtime.DynamicsensitivityinSystemC2.0meansthataprocesscanalteritssensitivitylistduringruntime.Processcanwaitanysetofeventsortimemakingforexampledesignandsimulationofstatemachineseasyanderrorsarereducedsincethesensitivitylistcanbesuppressedineachstatetominimum.3.OVERVIEWOFSYSTEMCBASEDEXTENSIONSSinceSystemCpromotestheopennessofthelanguageandthestandard,theadditionofnewdomaincanbemadetothecorelanguageitself.However,apreferredmethodistomodelthebasicconstructsrequiredformodellingandsimulationofreconfigurablehardware(RHW)usingbasicconstructsofthelanguageandthereforepreservingthecompatibilitywith 112Chapter5existingtoolsanddesigns.Forthisreason,theextensiondoesnotintendtoextendtheSystem2.0languageitself.ThetermsandconceptsspecifictotheSystemCbasedapproachusedinthefollowingsectionsaredefinedasfollows:•CandidateComponent:Candidatecomponentsdenotethoseapplicationfunctionsthatareconsideredtogainbenefitsfromtheirimplementationonareconfigurablehardwareresource.Thedecisionwhetherataskshouldbeacandidatecomponentisclearlyapplicationdependent.Thecriterionisthatthetaskshouldhavetwofeaturesincombination:flexibility(thatwouldexcludeanASICimplementation)andhighcomputationalcomplexity(thatwouldexcludeasoftwareimplementation).Flexibilitymaycomeeitherfromthepointthatthetaskwillbeupgradedinthefutureorinviewofhardwareresourcessharingwithothertaskswithnon-overlappinglifetimesforglobalareaoptimization.•Dynamicallyreconfigurablefabric(DRCF):Thedynamicallyreconfigurablefabricisasystem-levelconceptthatrepresentsasetofcandidatecomponentsandtherequiredreconfigurationsupportfunctions,whichlateroninthedesignprocesscanbeimplementedonareconfigurablehardwareresource.•DRCFcomponent:TheDRCFcomponentisatransaction-levelSystemCmoduleoftheDRCF.Itconsistsoffunctions,whichmimicthereconfigurationprocess,andtheinstancesofSystemCmodulesofthecandidatecomponentstopresenttheirfunctionalityduringsystem-levelsimulation.Itcanautomaticallydetectreconfigurationrequestandtriggerthereconfigurationprocesswhennecessary.•DRCFtemplate:TheDRCFtemplateisanincompleteSystemCmodule,fromwhichtocreatetheDRCFcomponent.TheSystemCbasedextensions[3]arehighlightedinthemodifiedversionoftheSystem-LevelDesigndiagramasshowninFigure5-4.Thethreefocusesareestimationsupport,DRCFmodellingmethodandsystemsimulation.•Theestimationapproach[4]isbasedonaprototypetoolthatcanproducetheestimatesofsoftwareexecutiontimeonaninstruction-setprocessor(ISP)andtheestimatesofhardwareexecutiontimeandresourceconsumptiononanFPGA.Theestimatesprovideinformationforsystempartitioningandselectionofcandidatecomponents.WhenafullSW/HW/RHWsystempartitioningisconsidered,traditionalanalysismethodsandtoolsarestillrequired.•TheDRCFmodellingmethod[5,6]focusesonthemodellingofthereconfigurationoverhead.Modellingthefunctionalityofthecandidatecomponentsthataremappedontothereconfigurable 5.SystemCBasedApproach113resourcesisnotaffectedbytheextension.Differentfeaturesassociatedwithreconfigurationtechnologyarenotdirectlymodelled.Instead,themodeldescribesthebehaviourofthereconfigurationprocessandrelatestheperformanceimpactofthereconfigurationprocesstoasetofparametersthatareextractedandannotatedfromthereconfigurationtechnology.Thus,bytuningtheparameters,designerscaneasilyevaluatetrade-offsamongdifferenttechnologyalternativesandperformfastdesignspaceexplorationatthesystemlevel.•Thesystem-levelsimulationisbasedonthetransaction-levelSystemCmodelandusesabstractworkloadandcapacitymodelsofapplicationandarchitectureforperformanceevaluationandstudyingofalternativearchitecturesandmappings.ISAandFPGAEstimationEtitiTechnologyModelsC/C++AlgorithmSpecificationSystemPartitioningArchitectureTemplateanalysisanddecompositionDRCFModellingDRCFMdlligDRCFTemplateSystem-LevelSystemSimulationSstemSimlationStSyiltiTransaction-levelSystemCmodelDesignFigure5-4.SystemCreconfigurabilityextensionsforsystem-leveldesign.4.ESTIMATIONAPPROACHTOSUPPORTSYSTEMANALYSISSystemanalysisisappliedintwophasesintheSystemCbasedapproach.Inthefirstphase,itfocusesonHW/SWpartitioningandhelpsdesignerstocreatetheinitialarchitecturebasedonanagreedpartitioningdecision.TheinitialarchitecturesetsthestartingpointfromwhichtheSystemCbasedapproachproducesthesystem-levelmodelforthearchitectureincludingtheDRCFcomponentthatisacorrespondingSystemCmodelofthedynamicallyreconfigurablehardwarewiththemodulestobeimplementedin 1144Chapter5it.Inthesecondphase,systemanalysisfocusesonstudyingthetrade-offofperformanceandflexibilityandhelpsdesignerstoidentifycandidatecomponentstobeimplementedinthedynamicallyreconfigurablehardware.Systemanalysisisperformedbydesignersmainlybasedontheirexperience,whichmaynotproducereliableresultsinallcasesespeciallyifdesignershavetocarryoutsystemanalysisfromthescratch.Inthissection,anestimationapproachtosupporttheworkofsystemanalysisispresented.TheestimationapproachfocusesonareconfigurablearchitectureinwhichthereisaRISCprocessor,anembeddedFPGA,andasystembusasacommunicationchannel.ItstartsfromfunctionblocksrepresentedusingC-languageandproducesthefollowingestimatesforeachfunctionblock:softwareexecutiontimeintermsofrunningthefunctionontheRISCcore,mappabilityofthefunctionandtheRISCcore,hardwareexecutiontimeintermsofrunningthefunctionontheembeddedFPGA,andresourceutilizationoftheembeddedFPGA.TheframeworkoftheestimationapproachisshowninFigure5-5.C-codeC-cod-codkrowoFunctionSUIFewmeamarCDFGDFGFniotaHihlHighlevelHigh-levelHighlevelHighleveiglmaimiSStSystemystemysteysyntthesis-bahesis-bahesis-besis-basedtsanalysisanalysinalysinalysnalslyHWestimatorHWestimatoWestimatoWtWiitiEsEHWresourceHWresorceHWMappabilityMppbilitiiySpeedupSSpdputilizationtiliilitiiSupportingAttributesStSippigAttibtFigure5-5.Estimationframework.Blocksinsidetheshadedareaarethefunctionsperformedbytheestimationapproach,anddatarepresentationsusedbytheestimationapproach.Detailedexplanationsaregiveninthefollowingsections.Outsidetheshadedarea,theblockswiththename“Functionblock”servesasinputtotheestimationapproach.Thesefunctionblockscaneitherbetheresultsfromsystemdecomposition,withthegranularitydecidedbydesigners,ortheycanbethecorrespondingSystemCmodulesfromtheinitialarchitecture.Intheformercase,theestimationapproachismeantforthefirstphaseofsystemanalysis,whichistohelpdesignerstomaketrade-offbetweenhardwareimplementationandsoftwareimplementation.Inthelatter 5.SystemCBasedApproach115case,theestimationapproachismeantforthesecondphaseofsystemanalysis,whichistohelpdesignerstoevaluatethetrade-offbetweenperformanceandflexibilitywhencomparingfixedhardwareimplementationanddynamicallyreconfigurablehardwareimplementation.EstimatesofhardwareresourceutilizationofthemodulesarefedintotheSystemCextensionasseparateparameters.4.1CreationofControl/DataFlowGraphfromCCodeControl/dataflowgraph(CDFG)isacombinedrepresentationofdataflowgraph(DFG),whichexposesthedatadependenceofalgorithms,andcontrolflowgraph(CFG),whichcapturesthecontrolrelationofDFGs.C-basedfunctionblockisusedasthestartingpointandCDFGisusedastheintermediaterepresentationoftheestimationapproach.SUIFcompiler[7]isusedasafront-endtooltoanalyzetheCcode,andapurpose-specificcodeconverterisusedtotransformtheSUIFintermediaterepresentationintoCDFG.Themainprocessinconversionistofindbasicblocks,whichcontainonlysequentialexecutionswithoutanyjumpinbetween,andtomapeachofthemontoasingleDFGandthejumpstatementsbetweenthebasicblocksontothecontrolrelationofDFGs.ThecharacteristicsoftheCfunctionsarestudiedthoughprofiling,andtheprofilingdataareattributesinthetargetCDFG.4.2High-LevelSynthesis-BasedHardwareEstimationAgraphicalviewofthehardwareestimationisshowninFigure5-6.TakingtheCDFGwithcorrespondingprofilinginformationandamodelofembeddedFPGAasinputs,thehardwareestimatorcarriesoutahigh-levelsynthesis-basedapproachtoproducetheestimates.Maintasksperformedinthehardwareestimatoraswellasinarealhigh-levelsynthesistoolareschedulingandallocation.Schedulingistheprocessinwhicheachoperatorisscheduledinacertaincontrolstep,whichisusuallyasingleclockcycle,orcrossingseveralcontrolstepsifitisamulti-cycleoperator.AllocationistheprocessinwhicheachrepresentativeintheCDFGismappedtoaphysicalunit,e.g.variablestoregisters,andtheinterconnectionofphysicalunitsisestablished.TheembeddedFPGAisviewedasaco-processingunit,whichcanindependentlyperformalargeamountofcomputationwithoutconstantsupervisionoftheRISCprocessor.ThebasicconstructionunitsoftheembeddedFPGAarestaticrandomaccessmemory(SRAM)-basedlook-uptables(LUT)andcertaintypesofspecializedfunctionunits,e.g.custom-designedmultiplier.Routingresourcesandtheircapacityarenottakeninto 116Chapter5account.ThemodeloftheembeddedFPGAisinaformofmapping-table.Theindexofthetableisthetypeofthefunctionunit,e.g.adder.ThevaluemappedtoeachindexishardwareresourcesintermsofthenumberofLUTsandthenumberofspecializedunits,requiredforthistypeoffunctionunit.EmbeddedEmbeddembeddembeddbedCCDFGDFGDFGFPGAmodelFPGAmodePGAmodePGAmodGAmoGAmGAAASAPASASAALAPALALAMModifiedModifieodifieodifiodifdAllocationAllocatiollocatiolocatiocatiocatcaFDSSHWResourceResourcesourcesoursoursouexecutiontimexecutiontimecutiontimecutionticutionticutiontutiontiiutiliztiontiliztioiliztioiliztilitilitiFigure5-6.High-levelsynthesis-basedhardwareestimation.As-soon-as-possible(ASAP)schedulingandas-late-as-possible(ALAP)scheduling[8]determinethecriticalpathsoftheDFGs,whichtogetherwiththecontrolrelationoftheCFGsareusedtoproducetheestimateofhardwareexecutiontime.Foreachoperator,theASAPandALAPschedulingprocessesalsosettherangeofclockcycleswithinwhichitcouldbelegallyscheduledwithoutdelayingthecriticalpath.Theseresultsarerequiredinthenextschedulingprocess,amodifiedversionofforce-directed-scheduling(FDS)[9],whichintendstoreducethenumberoffunctionunits,registersandbusesrequiredbybalancingtheconcurrencyoftheoperationsassignedtothemwithoutlengtheningthetotalexecutiontime.ThemodifiedFDSisusedtoestimatethehardwareresourcesrequiredforfunctionunits.Finally,allocationisusedtoestimatethehardwareresourcesrequiredforinterconnectionoffunctionunits.Theworkofallocationisdividedinto3parts:registerallocation,operationassignmentandinterconnectionbinding.Inregisterallocation,eachvariableisassignedtoacertainregister.Inoperationassignment,eachoperatorisassignedtoacertainfunctionunit.Botharesolvedusingtheweighted-bipartitealgorithm,andthecommonobjectiveisthateachassignmentshouldintroducetheleastnumberofinterconnectionunitsthatwillbedeterminedinthelastphase,theinterconnectionbinding.Inthisapproach,multiplexeristheonlytypeof 5.SystemCBasedApproach117interconnectionunit,whicheasetheworkofinterconnectionbinding.Thenumberandtypeofmultiplexerscanbeeasilydeterminedbysimplycountingthenumberofdifferentinputstoeachregisterandeachfunctionunit.4.3MappabilityBasedSoftwareEstimationSoftwareestimatorproducestwoestimates:softwareexecutiontime,andmappabilityofanarchitecture-algorithmpair.Aprofile-directedoperation-countingbasedstatictechniqueisusedtoestimatesoftwareexecutiontime.Thearchitectureofthetargetprocessorcoreisnottakenintoaccountinthetiminganalysis.Themainideaofestimatingthesoftwareexecutiontimeisasfollowing.Firstly,thenumberofoperationswitheachtypeiscountedfromtheCDFG.Then,eachtypeofoperationnodesintheCDFGismappedtooneorasetofinstructionsofthetargetprocessorinapre-definedmanner.Thenthetotalnumberofinstructionsiscalculatedfromtheresultsofthefirsttwostepssimplyusingmultiplicationandaddition.Finally,withtheassumptionthattheseinstructionsareperformedwithanidealpipeline,thesoftwareexecutiontimeisthemultiplicationresultofthetotalnumberofinstructionsandtheperiodoftheclockcycle.Mappabilityofanarchitecture-algorithmpairmeansthedegreeofmatchingbetweenresourcesprovidedbytheprocessorarchitectureandtherequirementsdescribedbythealgorithm[10].Themappabilityestimateiscalculatedviaasetofcorrelationfunctions,whichtakeintoaccounttheinstructionset,registerstructure,busefficiency,brancheffect,pipelineefficiencyandparallelism.CAMALAisaprototypetooltostudymappabilityofanarchitecture-algorithmpair.IttakesCDFGasinputandproducesestimateofmappabilitywithintherangefrom0to1.Anoptimalmappingisanexactmappingwithavalueofone,andbothover-requiredresourcesandunder-utilizedresourcesarereflectedaspoormappingresultswithvaluesnearzero.4.4CandidateComponentSelectionCandidatecomponentselectionisanapplication-dependentprocedure.Whenglobalresourcesavingisanissue,theresourceestimatesareimportantinputs.However,tomakejustifieddecisions,otherinformation,suchaspowerconsumptionshouldbeincludedasinputs.Moreimportantly,control/datadependencebetweencandidatecomponentsshouldbeanalyzed.Obviously,thereshouldbecontroldependencebetweencandidatecomponentsthataremappedtodifferentcontexts.Currentapproachdoesnot 118Chapter5includeautomatedtoolstosupporttheanalysis.Othertoolsandmanualanalysisarethesolutionsfornow.5.MODELLINGRECONFIGURATIONOVERHEADThemodellingmethodoftheDRCFfocusesonhowtorepresentthereconfigurationoverheadandhowtorevealitsperformanceimpactduringsystemsimulation.Thecandidatecomponentsthataremappedontothereconfigurableresourcesarehardwareacceleratortasks.Reconfigurationisrequiredwhenacalledtaskisnotloadedinthereconfigurableresources.ThedifferenceofhandlingincomingmessagesbetweentasksmappedtoafixedacceleratorandtasksmappedtoreconfigurableresourcesisshowninFigure5-7.incomingmessageistheaccesstargetedtoanactivecontext?YNreconfigurationfrequestreconfigurationdoneexecuteexecutefunctionalityfunctionality(a)(b)Figure5-7.(a)Handlingincomingmessagesasafixedhardwareaccelerator(b)Handlingincomingmessagesasareconfigurabletask.TheideaoftheDRCFistoautomaticallycapturethereconfigurationrequestandtriggerthereconfiguration.Inaddition,atooltoautomatetheprocessthatreplacescandidatecomponentsbyaDRCFcomponentisdeveloped,sosystemdesignerscaneasilyperformthetest-and-tryandthedesignspaceexplorationprocessiseasier.InordertolettheDRCFcomponentbeabletocaptureandunderstandincomingmessages,theSystemCmodulesofthecandidatecomponentsmustimplementtheread(), 5.SystemCBasedApproach119write(),get_low_addr()andget_high_addr()interfacemethodsshowedinthecodebelow.classbus_slv_if:publicvirtualsc_interface{public:virtualsc_uintget_low_addr()=0;virtualsc_uintget_high_addr()=0;virtualboolread(...)=0;virtualboolwrite(...)=0;};TheDRCFcomponentimplementsthesameinterfacemethodsandconditionallycallstheinterfacemethodsoftargetmodules.Infact,theseinterfacemethodsareverycommonforbusslavemodulesintransaction-levelmodels.5.1ParameterizedDRCFTemplateTheperformanceimpactofusingthedynamicallyreconfigurablehardwareisdependentontheunderlyingreconfigurabletechnology.Productsfromdifferentcompaniesordifferentproductfamiliesfromthesamecompanyhaveverydifferentcharacteristics,e.g.sizeofreconfigurablelogicandgranularityofreconfigurablelogic.DifferentfeaturesassociatedwiththereconfigurabletechnologyarenotdirectlymodelledintheDRCFcomponent.Instead,theDRCFcomponentcontainsthefunctionsthatdescribethebehaviourofthereconfigurationprocessandrelatestheperformanceimpactofthereconfigurationprocesstoasetofparameters.Thus,bytuningtheparameters,designerscaneasilyevaluatethetrade-offsbetweendifferenttechnologieswithoutgoingintoimplementationdetails.IntheSystemCextension,aparameterizedDRCFtemplateisused.Atthemoment,thefollowingparametersareavailablefordesigners:•Thememoryaddress,wherethecontextisallocatedintheextraDRCFmemory.•Thelengthoftherequiredmemoryspace,whichrepresentsthesizeofthecontext.•Delaysassociatedwiththereconfigurationprocessinadditiontodelaysofmemorytransfers. 120Chapter55.2DRCFComponentandRSoCModelAgeneralmodelofareconfigurablesystem-on-chip(RSoC)isshowninFigure5-8.ThelefthandsidedepictsthearchitectureoftheRSoC.TherighthandsideshowstheinternalstructureoftheDRCFcomponent.TheDRCFcomponentisasinglehierarchicalSystemCmodule,whichimplementsthesamebusinterfacesinthesamewayasotherHW/SWmodules.Aconfigurationmemoryismodelled,whichcouldbeanon-chiporoff-chipmemorythatholdstheconfigurationdata.Eachcandidatecomponent(F1toFn)isanindividualSystemCmodule,whichimplementsthetop-levelbusinterfaceswithseparatesystemaddressspace,andisinstantiatedinsidetheDRCFcomponent.Eachcandidatecomponenthastwoextraports.OneisaDONEsignalportroutedtotheConfigurationScheduler(CS).TheportisusedtoacknowledgetheCSthatthistaskcanbesafelyswappedout.Theotherisconnectedtoasharedmemorythatsavesthedatatobepreservedduringreconfiguration.TheInputSplitter(IS)isanaddressdecoderanditmanagesallincomingInterface-Method-Calls(IMCs).TheCSmonitorstheoperationstatesofthecandidatecomponentsandcontrolsthereconfigurationprocess.InstructionsharedsetconfigurationmemoryclockresetprocessormemoryInterconnectionbusInputconfigurationsplitterschedulermemorysharedHWReconfigurableacceleratorF1F2Fnco-processorconfigurationmemoryoutputDRCFcomponentFigure5-8.System-levelModellingofReconfigurableSoC.TheDRCFcomponentworksasfollowing.WhentheIScapturesanIMCtoacandidatecomponent,itwillholdtheIMCandpassthecontroltotheCS,whichdecidesifreconfigurationisneeded.Ifso,theCSwillcallareconfigurationprocedurethatusestheparametersspecifiedinstep1togeneratememorytrafficandassociateddelaystomimicthereconfiguration 5.SystemCBasedApproach121latency.AftertheCSfinishesthereconfigurationloading,theISwilldispatchtheIMCtothetargetmodule.Ifthemodulecannotbeactivatedatthemoment,amessageofrequesttoreconfigurethetargetmodulewillbeputintoaFIFOqueueandtheIMCwillreturnwiththevalueofFALSE.Whenamodulefinishesitsoperation,itwillsendaDONEsignaltotheCS,andtheCSwillcheckifthereisanywaitingmessageintheFIFOqueue.Ifsoanditispossibletoactivatethewaitingmodule,theCSwillcallthereconfigurationprocedure.NOT2LOADINGLOADED14WAIT3755NOTRUNNINGRUNNING6StateDefinitions:NOTLOADED:moduleisonlyintheconfigurationmemoryLOADING:moduleisbeingloadedWAIT:moduleiswaitinginaFIFOqueuetobeloadedRUNNING:moduleisrunningNOTRUNNING:moduleisloaded,butnotrunningStateTransitionConditions:1.IMCtothemoduleoccurs¬enoughresources2.IMCtothemoduleoccurs&enoughresources3.CSfinishestheloading4.Othermodulesfinish&enoughresources5.IMCtothemoduleoccurs6.Modulefinishes7.CSflushesthemoduleFigure5-9.Reconfigurationstatediagram.Thecontextswitchingwithpre-emptionisacommonapproachinoperatingsystems,theimplementationofwhichiseasyduetotheregularityoftheregisterorganization.IntheDRCFcomponent,thepre-emptionofarunningmoduleisnotsupported,sinceitwouldrequireaverycostlyimplementationofthehardwaremoduleinordertostoretheinternalregistersandstatesofthemodule.Themodellingmethodisfornon-blockingIMCs.Themethodsupportstheuseofblocking,butthesystembuswillbeblockedwhenacalledcandidatecomponentisnotloadedandunblockedwhenthereconfigurationisdone.ThereasonistomaintainsynchronizationbetweentheSWinitiatorsandthecandidatecomponents. 1222Chapter5Whilethisisagenericdescriptionofthecontextswitchingprocess,designerscanusedifferentCSmodelswhencandidatecomponentsaremappedtodifferenttypesofreconfigurabledevices,suchaspartialreconfigurationandsingle-contextdevice.Theauto-transformer,whichispresentedinthefollowingsections,usesacontextswitchingmechanismforsingle-contextdevices.Thereisastatediagramcommontoeachofthecandidatecomponents.Basedonthestateinformation,theCSmakesreconfigurationdecisionsforallincomingIMCsandDONEsignals.AstatediagramofpartialreconfigurationispresentedinFigure5-9.Forsinglecontextandmulti-contextreconfigurableresources,similarstatediagramscanbeusedinthemodel.Themainadvantageofthemodellingmethodisthattherestofthesystemandthecandidatecomponentsneednottobechangedbetweenastaticapproachandrun-timereconfigurationapproaches,whichmakesthismethodveryusefulinmakingfastdesignspaceexploration.5.3AutomaticTransformerforSystemCBasedExtensionsTheDRCFtransformerisatoolthatcanautomaticallytransformtheSystemCcodeofastaticsystemtotheSystemCcodeofareconfigurablesystem.Ittakestwoinputs.OneisSystemCmodelsoftheinitialarchitecture,andanotherisascriptfilethatspecifieswhichmodulesshouldbemovedintotheDRCFcomponentandalltheotherrelativeinformation,e.g.parametersfortheDRCFtemplate.OutputsoftheprogramarethemodifiedarchitectureaswellasSystemCmodelsoftheDRCFcomponentandthememoryassociatedwithit.AMakefileforcompilationisanoptionaloutput.AUMLdiagramofthestaticstructureofftheDRCFtransformerisshowninFigure5-10.Forthesakeofbrevity,operationsandattributesofclassesareignoredinthediagram.ThetransformerusesOpencxx[11]asthebasicC++parsertoanalysetheSystemCcode.TheClassHandleandSystemCClassHandlemanagetheanalysedinformation.Alex&bison-basedparserisdevelopedtoreadtheuserscriptfileandtheresultsarestoredusingclassDRCFReqInfo.TheclassDRCFTemplateHandleisresponsibleforgeneratedtheSystemCmodelsoftheDRCFcomponentandthememoryblockassociatedwithit.Finally,DRCF_driverristhekernelthatcontrolstheprocessoftransformation.TheflowofthetransformationisshowninFigure5-11.Inthefirstphase,eachmodulethatisacandidatecomponenttobeimplementedin 5.SystemCBasedApproach123reconfigurablehardwareisanalyzed.TheusedbusinterfaceandthebusportsareanalyzedsothattheDRCFcomponentcanimplementthesameinterfacesandports.Aftermodulesareanalyzed,thetransformermovestoanalyzeeachinstanceofthemodulesinarchitecture.Firstly,thedeclarationofeachinstanceislocatedandthentheconstructorsarelocatedandcopiedtoatemporarydatabase.Whenallinstancesareanalyzed,theDRCFcomponentiscreatedfromaDRCFtemplate.SystemCmodelsofyuserscriptfilepinitialarchitecture1read4parses413parseproducep4occ::Programocc::parserocc::Ptreecall4produceu411containsn4occ::Classbison&lexparser113producee1controlo4ClassHandle113requestinfofproduce4DRCFReqInfoMakefileHandleMakefilerequestinfono4producep4reequestinfoq41contrtrolrl4SystemCmodelsofymodifiedarchitectureSystemCClassHandleDRCF_driver3controlroo0..*10*exchangeinfoeproduce4SystemCmodelsofyDRCFcompnentandpSystemCModuleInfoDRCFTemplateHandleFunctionInfoDRCFmemoryy11..*DRCFTransformerFigure5-10.SoftwarespecificationofDRCFtransformerinUML.TheportsandinterfacesanalyzedinthefirstphaseareinsertedtotheDRCFtemplateandthenthecomponenttobeimplementedindynamicallyreconfigurablehardwareisinstantiatedaccordingtothedeclarationandconstructorlocatedinthesecondphase.TheDRCFtemplatecontainsacontextschedulertomimicthecontextswitchingprocess,aninputsplitterthatroutesdatatransferstocorrectinstances,andinstrumentationprocesses. 124Chapter5AAnalysisAnalysinalysooffmCreationofCreationoreationoreationreationDRCFRCcomponentcomponenomponenomponepAnalysisofAnalysisonalysisonalysisnalysisalysisalysilysyfInstancenstancenstancstancstantModificationofModificationoodificationoodificationodificationdificationdificatioificatioficatioficatiicatcatfInstancenstancenstancstancstanteFigure5-11.Transformationflow.Duringsimulation,datarelatedtoreconfigurationlatencywillbeautomaticallycapturedbytheDRCFcomponentandsavedinatextfileforanalysis.AVCD(ValueChangeDump)filewillalsobeproducedbytheDRCFcomponent,sotheconfigurationeffectcanbeanalysedviastandardwaveformviewersthatcanreadVCDformatfile.5.3.1ExampleoftheTransformationProcessAsimpleexampleofwhatwillbedonetotheSystemCmodulesisshownnext.Theinitialstaticsystemincludesthreehardwareaccelerators,hwacc1,hwacc2,andhwacc3.Thereisnodirectcontroldependenceamongthethreemodules.Theestimationresultsshowthatthehwacc2andhwacc3consumeaboutequalamountofresourcesashwacc1.Thedecisionistoassignhwacc1toonecontext,andtheothertwotoasecondcontext.Afragmentofcodebelowisapartofthehardwareaccelerator,hwacc1,whichismodelledusingSystemC.classhwacc1:publicsc_module,publicbus_slv_if{public:sc_in_clkclk;sc_portmst_port;...};Inthefirstphaseofoperation,theportsandinterfacesofthemoduleareanalyzed.Inthiscase,themoduleimplementsoneinterfacebus_slv_iff,whichistheslaveinterfaceofabus;themodulehastwoportsclkkandmst_port,whichrepresenttheclockinputandthemasterinterfaceofabus.Next,thetop-levelmoduleisanalyzedtounderstandthestructureofthesystem.Thecodebelowshowstheinstantiationofthemoduleinahierarchicalmodulenamedtop. 5.SystemCBasedApproach125SC_MODULE(top){sc_in_clkclk;hwacc1*hwa;hwacc2*hwb;hwacc3*hwc;bus*system_bus;SC_CTOR(top){system_bus=newbus("bus");system_bus->clk(clk);/*signalbindingsforhwacc1*/hwa=newhwacc("HWA",HWA_START,HWA_END);hwa->clk(clk);hwa->mst_port(*system_bus);system_bus->slv_port(*hwa);/*signalbindingsforhwacc2*/hwb=newhwacc("HWB",HWB_START,HWB_END);hwb->clk(clk);hwb->mst_port(*system_bus);system_bus->slv_port(*hwb);/*signalbindingsforhwacc3*/hwc=newhwacc("HWC",HWC_START,HWC_END);hwc->clk(clk);hwc->mst_port(*system_bus);system_bus->slv_port(*hwc);}};Aftertheanalysisofthetop-levelmodule,thedeclarations,constructors,theportbindingsandtheinterfacebindingsintermsofthemodulehwacc1,hwacc2,andhwacc3areremoved.ThishierarchicalmoduleisthenupdatedtousetheDRCFcomponentinsteadofthehardwareaccelerators.Themodifiedcodeisshownbelow.Noticethatthedeclaration,theconstructorandthebindingsaremodifiedforanewinstanceofdrcffSC_MODULE(top){sc_in_clkclk;drcf*drcf_inst_1;bus*system_bus;SC_CTOR(top){system_bus=newbus("bus");system_bus->clk(clk);drcf_inst_1=newdrcf("DRCF1"); 126Chapter5drcf_inst_1->clk(clk);drcf_inst_1->mst_port(*system_bus);system_bus->slv_port(*drcf_inst_1);}};TheactualDRCFcomponentcreatedfromtheDRCFtemplateisshowninthecodebelow.Inthecode,thedeclarations,constructorsandtheinterfacebindingsofthehardwareacceleratorsarecopiedfromtheoriginaltop-levelmodule.Theportbindingsareautomaticallymodified.Thetextthatisinitalicsisthecodethatwasdynamicallycreatedfromtheinformationsavedfortheinstancesofthemoduleshwacc1,hwacc2,andhwacc3.Whatwasalreadyinthetemplateisthearb_and_instr()methodthathandlesthecontextschedulingandinstrumentation.TheinstrumentationisaSystemCprocessthatkeepstrackoftheconfigurationstatus.classdrcf:publicsc_module,publicbus_slv_if{public:sc_in_clkclk;sc_portmst_port;hwacc1*hwa;hwacc2*hwb;hwacc3*hwc;SC_HAS_PROCESS(drcf);voidarb_and_instr();sc_uintget_low_addr();sc_uintget_high_addr();boolread(...);boolwrite(...);SC_CTOR(drcf){SC_THREAD(arb_and_instr);sensitive_pos<clk(clk);hwa->mst_port(*mst_port);/*signalbindingsforhwacc2*/hwb=newhwacc("HWB",HWB_START,HWB_END);hwb->clk(clk);hwb->mst_port(*mst_port);/*signalbindingsforhwacc3*/hwc=newhwacc("HWC",HWC_START,HWC_END);hwc->clk(clk); 5.SystemCBasedApproach127hwc->mst_port(*mst_port);ContextInfo*cont0=newContextInfo(...);cont0->insert_module(hwa);ContextInfo*cont1=newContextInfo(...);cont1->insert_module(hwb);cont2->insert_module(hwc);contexts.push_back(cont0);contexts.push_back(cont1);}};6.USINGWORKLOADMODELSFORDESIGNSPACEEXPLORATIONAsreconfigurabilityaddsanewdimensiontothedesignspace,areliablemethodofanalyzingperformanceoftheresultingsystemisneeded.Anarchitecturethatfitswelltotheapplicationathandavoidsmanydesignproblemsinthelaterdetaileddesignstages,butitisdifficulttofindthebottlenecksinthearchitectureearlyenough.Traditionally,modelsofsystemorsub-systemstartwithapurelybehaviouraldescriptionwhichcontainsonlythefunctionalitytobeperformed.Then,themodelsaregraduallyrefinedtowardsacertaintypeofimplementation,andconcreteinformationisinsertedintomodelsineachrefinement.However,whendesignershaveaninitialarchitectureinmindatthebeginningofdesign,theperformancesimulationofitcannotbeperformeduntileachmodelisiterativelyrefinedtoalevelofabstraction,whichcontainsenoughlow-levelinformationfromthearchitecturepointofview.Thetraditionalmodellingmethoddoesnotonlydelaytheperformancesimulationofthearchitecture,butitalsomakesdifficulttheexplorationofdesignspaceintermsoflookingforalternativearchitectures.UsingSystemCasasystemmodellinglanguageprovidestheopportunitytoperformarchitecture-spaceexplorationintheearlyphaseofdesign.ThisisachievedusingSystemCtransaction-levelworkloadoperationmodels.Theworkloadmodelseparatesthecomputationandcommunication.Atthetransactionlevel,theloadofcomputationisrepresentedusingtimedinformationeithercycleaccurateornot,andloadofcommunicationisrepresentedusingcombinedfactors,suchastypeoftransaction,bandwidthofbus,latencyofaccessingmemory,behaviourofbusarbiterandsoon.Thetiminginformationofcomputationcouldbetheestimatefromasupportingtool,suchastheestimationapproachintroducedintheprecedingsection,orfromdesigners’experience.Thefactorsrelatedtothecommunicationare 128Chapter5architecture-dependentandcouldbesetasparameters.Thus,bytuningtheseparametersinperformancesimulation,thebestarchitectureintermsofcertainperformanceaspectcouldbeeasilyfound.Thefollowingcodeisasimpleexamplethatmodelsthecomputationlatencyofahardwareaccelerator.ThemacroDELAY_CYCLESScanbegivenadifferentvaluewhenthetaskismappedtoadifferentkindofprocessingunit.voidaccelerator::do_process(){remain_im=accu_im&mask_coeff;/*delayDELAY_CYCLEScyclesuntilwrite*/for(delay=0;delay=64&&accu_re>0){data_out_short_real=accu_re/128+1;}}Thefollowingcodepresentstransaction-levelcommunicationfromamastertoamemoryblockthroughasystembus.Thedelayassociatedwiththebusarbitrationprocessisirrelevanttothemodelofthemasterblock.voidaccelerator::do_process(){bus_port->read(mem_in_addr,&data_in);unpack(data_in,data_in_real,data_in_imag);rot_re[i1]=data_in_real.to_int();rot_im[i1]=data_in_imag.to_int();}AttheArchitectureDefinitionthebestarchitecturehastobesearchedusingiterativedesignandmodelling.System-levelperformancesimulationscanbeperformedbybuildingworkloadmodelsoftheapplicationinordertosimulatethemoncandidatearchitectures.ThedevelopedSystemCarchitectureandworkloadmodellingandsimulationapproachisdepictedinFigure5-12.Itoperatesontransaction-levelofabstraction.Thesimulationresultsareestimatesofcomputationalcomplexityofeachblock,estimatesofcommunicationanddatastoragerequirements,andcharacteristicsofthearchitectureandthemappedworkload.Theworkloadmodelattransactionlevelcontainsinformationabouthowlongeachprocessingstagetakesandhowitcommunicateswithotherprocesses.ThecommunicationcanbemodelledusingSystemCresourcessuchasports,interfacesandchannels.Theinitialarchitectureisderivedfromtheapplicationanalysisresults.Togetthefullbenefitofthismodelling 5.SystemCBasedApproach129scheme,utilizationofeachresourceofthearchitecturewillbemeasuredintermsofe.g.idlecycles,datawaitingcyclesandoperationcycles.AlApplicationppitiAliAnalysisyWkldWorkloadAhittArchitectureMdlliModellinggMdlliModellinggPfPerformanceSiSimulationltiFigure5-12.Principleoftransaction-levelperformancesimulation.Workloadmodelsareusedtogenerateloadonthearchitecturebymappingthefunctionaloperationstoprocessingelements.Communicationandsynchronizationofprocessesandprocessingelementshasbeenimplementedusingpartofthememoryascommunicationregisters.Theworkloadandarchitecturemodelscanberefinedduringsimulations.Somesystemcharacteristicsandloadeffectscaneasilybeadjustedbymodifyingsuchparametersasclockfrequencies,buswidths,latenciesofmemoryoperationsandspeed-upfactorsthatcanbeusedtomodelvariouscandidateimplementationsofparallelismortomodelspeed-upofhardwareacceleratorimplementationsetc.Theserefinementsarecontinueduntiltheresourceutilizationratesthatareacceptablefortheapplicationarereached.7.CONCLUSIONSTheextensionstotheSystemCforsupportingthedesignofSoCsincorporatingreconfigurablepartsaredescribedinthischapter.TheextensionsarebasedonstandardfeaturesoftheSystemC2.0.SystemCencapsulatesC/C++descriptionsofalgorithmsintoanimplementationneutralsystemmodelbyexploitingeitherstandardoruserdefinedcommunicationmechanisms,e.g.differenttypesofchannels.OneextensionisthemethodsandprototypetoolsupportfortheestimationofsoftwareexecutiontimeonanISPandhardwareexecutiontimeandresourceconsumptiononanFPGA,whichprovidesinformationforsystempartitioningandselectionofcandidatecomponentsfor 1300Chapter5reconfigurabledesign.TraditionalanalysismethodsandtoolsarerequiredinafullSW/HWsystempartitioning.AnotherextensionistheDRCFmodellingmethodthatcanautomaticallydetectthereconfigurationrequestandmodelthereconfigurationoverhead.Thistechniqueallowsforfastdesignspaceexploration,sinceexploredmodulescanbeeasilyswitchedbetweenfixedandreconfigurablemodules.AprototypetransformationtoolisprovidedtohelptogeneratetheDRCFSystemCmodel.Thereconfigurationlatencyisderivedfromafewparameters,whichcanbeadjustedbydesignersindesignspaceexplorationstep.Thesystem-levelsimulationisbasedonthetransaction-levelSystemCmodelandusesabstractworkloadandcapacitymodelsofapplicationandarchitectureforperformanceevaluationandstudyingofalternativearchitecturesandmappings.ThemainbenefitoftheextendedSystemCbasedapproachisthatitenablesmodellingandperformanceevaluationofasystemcontainingreconfigurablepartsalreadyatthesystemlevelbeforedevotingeffortstothedetailedandimplementationdesign.REFERENCES1.S.Swan(2001)AnIntroductiontoSystemLevelModelinginSystemC2.0.http://www.systemc.org2.SystemC(2002)FunctionalSpecificationforSystemC2.0,UpdateforSystemC2.0.1,version2.0-Q.April5.http://www.systemc.org/3.K.Tiensyrjä,M.Cupak,K.Masselos,M.Pettissalo,K.Potamianos,Y.Qu,L.Rynders,G.VanmeerbeeckandY.Zhang(2004)SystemCandOCAPI-XLBasedSystem-LevelDesignforReconfigurableSystems-on-Chip.ForumonSpecification&DesignLanguages(FDL2004).14-17September2004.ECSI,Grenoble,France,pp.428-4394.Y.QuandJ.-P.Soininen(2003)EstimatingtheutilizationofembeddedFPGAco-processor.EuromicroSymposiumonDigitalSystemsDesign,2003(DSD2003),pp.214–2215.A.Pelkonen,K.MasselosandM.Cupak(2003)System-levelmodelingofdynamicallyreconfigurablehardwarewithSystemC.The17thInternationalParallelandDistributedProcessingSymposium(IPDPS2003),pp.174–1816.Y.Qu,K.TiensyrjäandK.MasselosK(2004)System-levelmodelingofdynamicallythreconfigurableco-processors.Proceedingsofthe14InternationalConferenceonFPL(LNCS3203),pp.881-8857.R.P.Wilson,R.S.French,C.S.Wilson,S.P.Amarasinghe,J.M.Anderson,S.W.K.Tjiang,S.W.Liao,C.W.Tseng,M.W.Hall,M.S.LamandJ.L.Hennessy(1994)SUIF:AnInfrastructureforResearchonParallelizingandOptimizingCompilers.thProceedingsofthe7ACMSIGPLANsymposiumonPrinciplesandpracticeofparallelproprogramming,pp.3748 5.SystemCBasedApproach1318.D.D.Gajski,N.Dutt,A.WuandS.Lin(1997)High-levelsynthesis:Introductiontochipandsystemdesign.KluwerAcademicPublishers,Boston9.P.G.PaulinandJ.P.Knight(1989)Force-DirectedSchedulingfortheBehavioralSynthesisofASICs.IEEETransactionsonComputer-AidedDesignofIntegratedCircuitsandSystems6:66167910.J.-P.Soininen,J.Kreku,Y.QuandM.Forsell(2002)Mappabilityestimationapproachforprocessorarchitectureevaluation.Proceedingsofthe20thIEEENorchipConference(NORCHIP2002),pp.17117611.S.Chiba(1998)OpenC++Tutorial.http://opencxx.sourceforge.net Chapter6OCAPI-XLBASEDAPPROACHMiroslavČupákandLucRijndersIMEC,Kapeldreef75,B-3001Leuven,BelgiumAbstract:ThischapterdescribestheOCAPI-XLbasedmodellingtechniquesandtoolsthatsupportthedesignofreconfigurablesystems-on-chip(SoC).Toallowmodelingofreconfigurabilityfeaturesatsystemlevel,wedeveloped:1)newsoftwareprocesstypeinOCAPI-XL,2)couplingofOCAPI-XLtoSystemCforco-simulation,and3)contextswitchingfromoneresourcetowardsanother(software,reconfigurablehardware).Keywords:Configurationoverhead;contextswitching;designspaceexploration;dynamicreconfiguration;estimation;mapping;partitioning;reconfigurable;reconfigurability;SystemC;OCAPI-XL;system-on-chip.1.INTRODUCTIONHeterogeneousHW/SWsystemsonachip(SoC)presentoneofthevitalchallengesfordesignmethodologiesoftoday.OCAPI-XL(OXL)isaC++baseddesignenvironmentfordevelopmentofconcurrent,heterogeneousHW/SWapplications.Itabstractsawaytheheterogeneityoftheunderlyingplatformthroughanintermediate-languagelayerthatprovidesaunifiedviewonSWandHWcomponents.ThelanguageisdirectlyembeddedinC++viaacreativelydesignedsetofclassesandoverloadedoperators[1],andhasanabstractionlevelbetweenassemblerandC.OXL’sdesign-flow,asdepictedinFigure6-1,startsathigh(typicallyC/C++)levelandgoesallthewaydowntotheimplementationinasequenceofincrementalsteps.133N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,133-151.©2005Springer.PrintedintheNetherlands. 1344Chapter6hotspotC/C++C++C++C++C++[a][b]High-levelBUSmodelBUSC++C++C++C++OXLSchedulermodelRTOSC++OXL[c][d]Figure6-1.OCAPI-XLdesign-flow:[a]single-threadedC++specification;[b]parallelisation;[c]modellingofarchitectureconstraints,and[d]refinementtoOCAPI-XLembeddedlanguage.TheOXLdesignflowcanbedividedasfollows:1.Identificationofhotspots,i.e.heavilyusedpartsofcode,whereparallelisationwouldbebeneficial.(Figure6-1[a]).ThiscanbedoneusingC/C++tools,suchasquantifyorgproff2.Partitioningofthesingle-threadedC/C++codeintoparalleltasksusingOXL’sconcurrencyandcommunicationprimitives(Figure6-1[b])basedontheanalysisfromthepreviousstep.ThemaingoalistogetparallelC++/OXLcode,functionallyequivalenttothesingle-threadedoriginal.3.Mappingofthefunctionalmodelfromstep2ontothearchitecturedescribedviaasetofconstraints,likenumberofSW-processors,relativeHW-clockspeed,orcommunicationresourcesharing(Figure6-1[c]).ThisstepadherestotheY-chartmodellingapproach[2,3](i.e.strictseparationoffunctionalityandarchitecture).4.CompleterefinementofselectedprocessestoOXLembeddedlanguage(Figure6-1[d]),whichcanthenbeusedinHW,aswellasinSWscenarios.Atthestagesfrom2to4,OXLprovidesthedesignerwithsimulationresultsaswellasquantitativefiguresofsystemthroughput,activity,performanceetc.,thatis,supplieswithanimportantfeedbackdirectingnewrefinementsteps.Additionally,whileitallowsthedesignertostaywithinthe 6.OCAPI-XLBasedApproach135sameC++basedframeworkduringthewholedesignprocess,italsoprovideshooksforcouplingothersimulationenginesorenvironmentsaccordingtothedesigners'needs.2.THREADEDPROCESSOCAPI-XL’SEXTENSIONTheOXLschemeofembeddingalanguagewithinC++removestheneedforacompletelynewlanguage-frameworkwithsupportingtoolsandenvironment,sinceonecanreusemostofanyexistingC++tools.Ontheotherhand,italsocreatestwogenuineproblems:howtomixthenewlanguagewithnativeC/C++codeandhowtotranslateexistingC/C++codeintothenewlanguageinanincrementalway.TheseproblemswereaddressedinOXL,thesolutionwashowevernotgeneralenoughandfailedforsignificantcategoryofapplications,asitwillbeshownlater.Thisdeficiencyisaddressedbythepresentedthreaded-processextension.ItclosesthegapintheOXLdesign-flow,andprovidesaconceptuallysound,genericlinkbetweenthehigh-levelC/C++codeandOXLEmbeddedLanguage(OXL-EL)alsointhecasesthatwerenothandledproperlybytheexistingtechniques.Thethreaded-processextensionwillbepresentedasfollows:first,wewillshowtheexistingtechniqueforintegrationofC/C++andOXLcode,pinpointitsweaknessandindicatetheproposedsolution.Afterwards,wewilldescribetheimplementationofco-simulationlibrary.2.1OXLandC/C++CodeIntegrationThedistinguishingfeatureofOXListhelanguageprovidingaunifiedsemanticmodelforHWandSW.TheOXLlanguageisembeddedinC++,whichallowseasyintegrationofexistingC++codetoOXL.Unfortunately,italsomakestheboundarybetweenC++andOXLcodesomewhatblurredtothedesigner,whichcanbequitedangerous,consideringthatwehaveco-existingtwosemanticallydifferentlanguages.Inthissection,wewillfirstlyoutlinethebasicideaoftheOXLlanguageimplementationanditsinteractionwithC++;secondly,wewillintroducetheoriginaltechniqueforintegrationofC++andOXLcode:theForeign-LanguageInterface(FLI),andpinpointitsdeficiency,andfinallywewillintroducetheideaofathreadedprocessasawaytodealwiththeFLIdeficiencies. 1366Chapter62.1.1OXLEmbeddedLanguageThenotionofclass,asanewtypesemanticallyequivalenttothepredefinedones,isoneofthecentralideasofC++.ItallowstouseC++asameta-language,whereclassesrepresenttypesofthenewlanguageandtheclass’codedefinesthenewlanguagesemantically.OperatoroverloadingaddsabitofsyntacticsugarallowingtomakesuchalanguageclosertoCsyntaxandeasiertoreadorwrite.Thisembedded-languageapproachisusedinOXLtoimplementtheunifiedHW/SWlanguageinawaysketchedinFigure6-2.Theso-calledoperatorclasseslikeaddoprepresenttheconstructsoftheOXL-EL,andtheirmemberfunctions(likesim())implementtheirsemanticmeaning.Theseclassesdonotdirectlyexecutetheircode,butrathercreatearuntimestructurecalledheap,similarconcepttoabyte-code,whichislaterinterpretedduringsimulationorcode-generation.OXLcodeC++representation:heapandoperators/*…IntisOXLtype…*/Inta,b,c;sim()a=b+caddop/*…OXLcode…*/labelL;nexta=b+c;if(a)jump_if(a,L);next=L;jmpopsim()else…HeapcreationHeapinterpretationXopXopXop/*…simulationinvocation…*/run(100);//--justinvokestherun()sim()sim()sim()//--functionsforoperatorsonheapFigure6-2.C++representationoftheOXLlanguage.Thus,C++andOXLcodecannotbecombinedwithoutanyrestriction.TheoriginalmethodofcombiningC++andOXLcodewastheFLImechanism.2.1.2Foreign-LanguageInterfaceMechanismTheFLImechanismistheoriginalinterfaceforintegrationofC++codeintoOXL-EL.Conceptually,anFLIisjustanotheroperatorclassoftheOXL-EL,butwithoutitsfunctionalityfixed.Rather,itcanbespecifiedbythedesignerviaoverridingofthevirtualrun()functionofthefliclass.Consequently,anFLIobjectisseenfromOXLsimulationengineasasingle,atomicinstructionintowhichinputsarepassed,andfromwhichoutputsare 6.OCAPI-XLBasedApproach137read(Figure6-3).Thereisnolimitationoftheamountofthecodewithintherun()function.Onecanreadafileorasocket,writetoaterminal,communicatewithanothersystemprocess,etc.Thepossibilitiesarewithoutlimits[4]–almost.TheFLImechanismhasanimportantlimitation:forOXL,thecodewithinthefli::run()functionmustbeasingle,atomicinstruction.ThislimitationprohibitsusageofotherOXLinstructions(e.g.,messagereadorsemaphorepost)inthecode.Wefounditacceptablefordata-dominantapplications,wherelargepartsofcodebehavelikeanatomicinstructionsothattheycaneasilybesplitintoafewFLIclasses.TheseFLIobjectscanthenbemanipulatedbythedesigneratwill,e.g.,groupedintoprocesses,annotatedwithtiminginfo,mixedwithrefinedcode,etc.OXLcodewithanFLIcallC++representation:fliopcallclassMyFli:publicfli{voidrun(){//importab=a+1;//MyFlimeaningsim()}fliop(mf)mf.run();};//exportb/*…OXLcode…*/Myflimf;Inta,b;nextb=a+1;call(mf,fliIn(a),fliOut(b));Figure6-3.OXLrepresentationandsimulationofanFLIobject.Inthecaseofcontrol-orientedapplications,wherethepotentialatomicallyexecutablepartsofcodearemuchsmaller,theFLItechniquebecomesquitecontroversial,sinceittypicallyresultsineitherafewbigFLIobjects,ormanysmallones.Inthefirstcase,noOXLprimitivescanbeusedinsidetherun()methodsothatallOXL-ELfeaturesareunavailabletothedesigner.Inthesecondcase,thecodeatomisationrequiresalotoftedious,error-proneworkleadingtoarefinedcode,whichishardtoreadandmaintain.Todealwithsuchcases,analternativemechanismforthecombinationofunrefinedC++andOXL-ELwasneeded.2.1.3ThreadedProcessesExtension:IdeaOXL-ELhasitsownwaytodealwithconcurrencyviaheapstructureandoperatorclasses.Ofcourse,itisnotsufficienttoimplementprocesseswithnativeC/C++threadofcontrol.InordertointroducesuchprocessesintoOXL,wemustdeviseawayofextendingtheOXLkernelwithsomethread-libraryprimitives,sinceonlysuchprimitivescanprovidethesupportforhandlingofarbitraryC/C++codeconcurrently.Additionally,theextensionmustpreserveOXLkernel’scompletecontrolofthesimulationtoensureits 1388Chapter6correctness,regardlesswhetherthethreadlibraryispre-emptiveornot.TheultimategoalofsuchanextensionittoprovideaconceptualwayforthecombinationofplainC/C++codeandOXLconcurrencyprimitiveswithouttheneedtorefineeverylineofcodeintoOXL-EL.ItmustallowonetousethefullsetofOXLcommunicationandsynchronizationprimitivesinsidethethreadedprocesssothatthehigh-levelmodellingfeaturesofOXLcanbeused.ItsbasicideacanbedemonstratedinFigure6-4,wheretheoriginalC++codeissplitintotwoOXLthreadedprocesses,communicationviaOXLcommunicationprimitiveswrappedinCfunctions.ThesplittingofC++codeinthecaseofthreadedprocessesneedsonlybedrivenbyparallelizationrequirements,andnotfromthecode-atomicitydemandoftheFLIs.threadAOXLprocessAthreadBtOXLcommunicationprimitiveswrappedinsideaC-functionOXLprocessA[a][b]Figure6-4.Threadedprocessesfromuser’spointofview.2.2ThreadProcessExtension:ImplementationRequirementsThereare,inprinciple,threeessentialrequirementsfortheimplementationofthreaded-processextension:1.Easeofuse:ideally,theFLI-likeAPI(i.e.requiringtoderivefromaclassandoverrideavirtualfunction)shouldbeprovided.2.Backwardcompatibility:shouldnotinfluenceanyexistingOXLcode.Ideallytheextensionshouldbeimplementedasaplug-in.3.NorestrictionontheC/C++codewithinthethreadedprocesses(unlessimposeddirectlybytheunderlyingthreadlibrary).Nextsectionsdescribe,howSystemClibrarycanbeusedforimplementationofthethreadedprocessextensioncompliantwiththeabove-mentionedrequirements. 6.OCAPI-XLBasedApproach1393.SYSTEMCIMPLEMENTATIONOFOCAPI-XLTHREADEDPROCESSEXTENSIONSystemC(SC),inprinciple,providesanimplementationofathreadlibrarybundledtogetherwithanevent-drivensimulationenginewithnotionofvirtualtime.ThusSCcanbeusedforimplementationofthethreaded-processextensionwiththeadditionalbonusofautomaticallyhavinganOXL/SCco-simulationenvironment.SuchanenvironmentbringstogethertheadvantagesofOXL(architecture-modellingfeatures,orOXL-EL),andSC(de-facto-standardmodellingenvironmentwithasignificanttoolsupport).Thereisoneadditionalrequirementfortheimplementation(ontopofthethreepresentedintheprevioussection):theunderlyingcodeshouldonlyusethestandardizedAPIofSCandavoidanyimplementation-specificfeatures(like,theschedulercall-backfunctionpresentintheOSCIreferenceimplementation)tomaketheensuringOXL/SCenvironmentportable.ThebasicstructureoftheOXL/SCco-simulationenvironmentisshowninFigure6-5.2.OXLsimulationenvironment1.SCsimulationenvironment3.XL/SCprocesspartOXLOXLOXLOXLELELELPROXYSCSCSCC++C++C++OXLsimulationkernel(slave)SCSCSCSCSystemCsimulationkernel(master)Figure6-5.OXL/SCco-simulationenvironmentstructure.ItconsistsofthreedomainswithintheSCenvironment:1.ThenativeSCprocessescontrolledonlybytheSCsimulationkernel.2.TheOXLdomainrunninginasingleSCthreadandcontrolledbytheOXLkernelactingasaslavetotheSCsimulationkernel.3.XL/SCprocesses(i.e.equivalentsofthepreviouslydescribedthreadedprocesses)runningC++/SCcodewithaccesstoOXL’scommunicationandsynchronizationprimitives.Theseprocessescontaininternallytwoparts:theSCpartrunningtheC++orSCcode,andasmallOXLproxyprocess,whichisactivewhentheprocessisexecutinganOXLsynchronization/communicationprimitive. 140Chapter63.1OXLEnvironmentwithinSCTheessentialideaofthecommonOXL/SCenvironmentistoletthewholeOXLpartruninasingleSCthreadandtouseSCsynchronizationmechanismsinsideamodifiedOXLsimulationkerneltosynchronizewiththerestofthe(SC)world.SincetheOXLkernelissingle-threaded(theconcurrencybehaviourisimplementedviatheheapandoperatorclasses),therearenothreadcompatibilityproblemscreatedbythissetup.3.1.1XL/SCProcessImplementationTheimplementationofthethreadedSC/XLprocessisthecrucialpartintheSC/XLlink.Internally,itconsistsoftwoco-operatingentities:OXLproxyprocessandSCthread.Theunderlyingideaisratherstraightforward:theprocessisexecutingeitherC++/SCcodeinSCdomain,orOXLconcurrencyprimitiveintheOXLdomain.Inthefirstcase,itiscontrolledbytheSCsimulationkernelandisofnointerestfortheOXLsimulationkernelexceptforitslocaltime(toavoidpossibilityofanti-causalsimulationwhenOXLscheduleradvancesitstime).Inthesecondcase,itisinsidetheOXLenvironment,executingacommunication/synchronizationstatementrequestedfromtheC++/SCcode,andforthatperiodoftimethecorrespondingSCthreadissimplyblockedinsideSCkernel.ThecompleteschemeisshowninFigure6-6.XL/SCproxyprocessSCthread1.resumeSCC++/SCcodeXLsynchronization2.waitforSCorcommunicationprimitivesetrequestedXLdyn-op3.XLdyn-oprestartXLsuspendSCagain/done[a][b]Figure6-6.XL/SCprocessimplementationvia[a]OXLproxyprocessand[b]SCthread.TheinteractionbetweenOXLproxyandSC-threadpartsdeservesacloserlook.Theproxyprocessusesthreenewoperatorclassesintroducedinthisco-simulationlibrary: 6.OCAPI-XLBasedApproach1411.Firstone,capabletorestartthecorrespondingSCthread(Figure6-6[a1]).2.Secondone,waitingfortheresponsefromSC-threadkeepingtheproxyprocessblockedinsidetheOXLdomain(Figure6-6[a2]).3.Thirdone,whosecontentscanbedynamicallychangedaccordingtotheoperationrequestedfromtheSC-thread(Figure6-6[a3]),performsthecurrentlysetoperationinsidetheOXLdomain.Withthesenewoperators,itispossibletodefinetheOXLproxyprocessinteractingwiththeSCthreadasshowninFigure6-6[a].WhenwelookattheOXLproxyprocess/SC-threadinteractionfromtheSCside,itgoesoverfollowingsequenceofsteps:1.TheSC-threadstartsinasuspendedstateandwaitstillitisnotresumedfromtheOXLproxyprocess.2.Afterresumption,theSC-threadrunsuntilitends,orafunctioninvokinganOXLcommunication/synchronizationprimitiveiscalled.Inthatcase,itupdatesthedynamicoperatorintheproxyprocesswithinformationabouttherequestedXLprimitiveoperation,restartstheproxyprocessandsuspendsitself,untilawakenedagainfromtheOXLproxyprocess.Asasidenote:blockingandresumptionofaSCthreadcanbeachievedviasimpledynamiceventwaiting/notificationschemeavailableinSC2.0[5].Finally,thepresentedschemealsorequirescomechangestotheOXLscheduler,sinceithastotakeintoaccountXL/SCprocesses.3.1.2OXLSlaveSchedulerTheOXLschedulermustbeslightlymodifiedsoitcouldtakeintoaccountthefactthattheXL/SCprocessescanbeoutsideofitscontrolatcertainmomentsduringthesimulation.Toaccountforthat,theevent-dispatchloopoftheOXLschedulermayonlybeallowedtoadvanceintime(i.e.,dispatcheventwithahighertime-stampthantheoneofthelastdispatchedevent),ifeithernoprocessesareinSCdomain,ortheirlocaltimeisatleastequaltothetimeoftheto-be-dispatchedOXLevent(otherwise,ananti-causalsimulationmayhappen,aftersometheXL/SCprocesseswouldreturnintoOXLdomainwithanoldertime-stampthanevent-dispatcher).Apseudo-codeofthemodifiedevent-dispatchloopisshowninFigure6-7.AlsothenewschedulermustholdsomeadditionalbookkeepingdataaboutthecurrentlyrunningXL/SCprocesses. 1422Chapter6originalOXLevent-dispatchloopcodevoidXLScheduler::ev_loop(){while(true){if(event_queue_XL.empty()){if(procs_SC.empty())break;//--eventqueueemptyandnoprocsinSCelsewait(wake_up);//--elsewaittillanSCprocessreturnstoXLelse{if(next_time_XL==time_SC){//--nextXLeventhascurrenttime-stamp?event_queue_XL.dispatch();//--wecandispatchit,sincenoevent//--withasmallertimecanbeunprocessed}else{wait(wake_up,next_time_XL–time_SC);}}}}Figure6-7.ModifiedOXLevent-dispatchloop.3.2AlternativeThreadedProcessImplementationsSCisonlyoneofpossiblethreadenvironmentssuitableforimplementationofthethreadedprocessextension.Essentiallyanythreadedlibrarycanbeused,employingasimilarimplementationstrategy,i.e.C++coderunninginsidethreadsandcontrolledfromproxyprocesseswithinOXLenvironment.Also,themodifiedOXLschedulercanbesimplerinsuchacase,sinceitdoesnothavetorunasaslavetoothersimulationkernel.WehavesuccessfullyimplementthethreadedprocessextensionwithpthreaddandGNUpthlibrariesonvariousoperatingsystems.4.SOFTWAREPROCESSESSCHEDULINGEXTENSIONPerformanceofreallifesoftwareishighlydependentontheoperatingsystemitisrunningon.Especially,ifmulti-threadormulti-processsoftwareisconsidered,theinfluenceoftheoperatingsystem'sschedulerishighlyinfluencingtheoverallperformance.SinceinOXLthesystemisdescribedinaparallelcommunicatingprocessesmodel,modellingsoftwareschedulingwillbenefittheaccuracyofthesoftwareperformancemodel.Inthehigh-levelsoftwaremodelofcomputation(procHLSW)concurrencyisconsideredattheprocessorlevel.Thismeansthatforeveryprocessthereisaseparateprocessorassumed(seeFigure6-8). 6.OCAPI-XLBasedApproach143procHLSWP1P2P3Figure6-8.High-levelsoftwarebehaviorovertime.Naturally,inreallifethiswilltypicallynotbethecase.Inrealisticsoftwareimplementationtherewillbeanoperatingsystemthatallowsalltheprocessestobeassignedtothesamesoftwareprocessingresource.Sofromtheperformancepointofviewtheprocessesarenotatallrunningconcurrently,buttheyarebeingsequentializedbytheoperatingsystemschedulerontotheprocessingunit.TomodelsuchbehaviourintheOXLperformancemodel,aseparateprocesstypehavingthisbehaviourhasbeenintroduced:procManagedSW(seeFigure6-9).procManagedSWP1P2P3Figure6-9.Sequentializingcomputationovertime.TobeabletocreateaprocessofthetypeprocManagedSWthedesignermustfirstcreateaschedulingobject.Thisschedulerwillperformtheactualsequentialisationofalltheprocessesthatwillbeattachedtothisobject.Thewaythisisdoneisdefinedinoneoffthemembermethodsofthisschedulingobject.Currentlyasimpleround-robinscheduler,aschedulerwithschedulingprioritiesandapriorityschedulingwithagingeffectareprovided.Additionally,usercandefineitsownschedulingobjectstomodelthebehaviouroftheschedulerpresentinthetargetoperatingsystem.OXLassumesanon-pre-emptivescheduler,soitisuptotheprocessestohandovercontroltotheoperatingsystem.Thiscanbedonebyeitherblockingonacommunicationprimitiveofbyallowingacontextswitch(bycallingthesync_()call).ItisimportanttorealizethatswitchingbetweenthedifferentSWtasksisnotpenalty-free.Italwaystakescertainnumberoftime(andespeciallyforreconfigurablearchitectures)tochangeformonetoanothertask.Inorderto 144Chapter6cometomostaccurateperformanceresults,context-switchingoverheadhastobeconsideredinaperformancemodel.ThisfeaturehasbeenaddedtotheOXLenvironment.Usercandefineforeveryprocesscreated,extracontextswitchingtime(asanargumentoftheschedulerobject,orusingextrasetcsoverhead()method),whichisthenappliedtothatprocessduringtheOXLsimulation.5.BUSMODELINGEXTENSIONTheOXLhigh-levelmodelconsistsoutofconcurrenttaskscommunicatingwithsemaphoresand/ormessagequeues.Bothhavenon-blockingwrite(semaphoreunlock,messagesend)andablockingread(semaphorelock,messagereceive)accesses.Whendoinghigh-levelsystemmodelling,thefocusliesmoreinthefunctionalcorrectnessratherthanthecorrectbehaviourinthetimedomain.Atthislevel,allcommunicationchannelsareusuallyconsideredinparallelandwithoutdelay.Dependingonthetargetedarchitecture,someofthesechannelscanbemappedontoasharedcommunicationresource.Asaconsequence,transfersonthesechannelscannotoccuratthesametimeanymore.Thissectionexplainsabusmodelextensionbasedupondistinctpropertiesofprocessestypesandcommunicationprimitives,goingfromhigh-levelcommunicationfeaturesoverbussharingandaccessprotocolsontoacompleteC++modelforasharedcommunicationresource.5.1ModellingBusSharingConnectingofsoftwareprocessorandthehardwarebymeansofacertainbusstructurehasalwaysanimplicationontheperformanceofthesystem.Abuscanusuallynotbesharedbydifferenttasksatthesametime.Soitisnecessarytoadaptthemodelinsuchawaythatbus-sharingpropertiescomeintoplay.Bussharingisverysimilartotaskscheduling.Inbothcasesoneresource(eitherprocessororthebus)needstobeshared.Thismeansthataccesses(processingordatatransfers)willhavetooccursequentiallyintime.Wecouldconstructafirstmodelforthebusbaseduponthealreadypresenttaskschedulingproperties.Byintroducingscheduleddummyprocessesontothebuschannelsweobtaintherequiredbehaviour.Ourinitialbusmodelconsistedoutofthebuswriterandbusarbiterparts(seeFigure6-10). 6.OCAPI-XLBasedApproach145P1HighlevelchannelP2Bus-WriterBus-ArbiterFigure6-10.InitialBusModel.5.1.1BusWriterTasksAprocessmanagerwillscheduletaskstoensurethatateachmomentmaximumonewriterhasaccesstothebus.Inordertoreusethealreadypresentschedulingproperties,thesetasksaresoftware-likeinbehaviour.Bynotannotatinganyoperationswithinthesetasks,theyareexecutedinvirtualzerotime.Byannotatingasinglebusaccesswithfixedtimeduration,bustransferdelayscanbetakenintoaccount.5.1.2BusArbiterAbusarbiterispartofthecommunicationresourceitself.Itisresponsiblefordecidingwhichtaskisallowedtocommunicateviabus.Howthisdecisionisdonediffersfromonebusarchitecturetoanother.Also,dependingonthetypeofbus,transferscouldbeasinglevalue,oritcouldmeantransferofawholeburstorvalues.Itmayverywellbethatthetypeofrequestissuedtothearbiterinfluencesthefinaldecision.Thebusarbiterhasalotofsimilaritieswithataskscheduler,andtoconstructourmodel,wewillactuallyusesucha(native)constructtobuildourbusmodel.Sinceinourmethodologytheuserisallowedtodefinehisownprocessmanager,orscheduler,thenewonecanbecreatedthatactsliketherealtargetbusarbiter.5.2ModellingBusAccessProtocolTomakeourmodelevenmoreaccurate,wecouldreplacethefixedtimeannotationofonetransferwithmoreexactvalue.Inanactualtransferitisthebusaccessprotocolthatmakesthatatransferrequiressometime.Butthisaccessprotocolmaybedifferentdependingagainonthetypeoftransfer 146Chapter6requested.Inaburst-transferitwillmost-probablynotbenecessarytoincludetheoverheadofthebus-request/acknowledgenorthetimeneededtoreversethedirectionofthebus(comeoutoftri-state).Theseparticularitieswillmostprobablynothavesuchagreatimpactontheoverallperformanceifthisfixedtimingannotationisproperlychosen,butoutofconsistency,wecouldincludetheprotocolinourmodelaswell.Toincludethebusaccessprotocol,wewilladdanadditionaltasktoeitherthewriterside,thereadersideoratbothsides.Thesetaskswillmodeltheprotocolforeachtransfer.Sincetheseprotocolsareusuallyspecifiedashardwareaccessschemes,theprotocoltask(s)willbeofthehardwaretype.ThefinalbusmodelincludingprotocolmodellingisgraphicallyrepresentedinFigure6-11.P1HighlevelchannelP2Write-Bus-Read-protocolWriterprotocolBus-ArbiterFigure6-11.Finalbusmodelblockdiagram.6.HIGH-LEVELMODELLINGOFCONTEXTSWITCHINGWhendesigninginOCAPI-XL,applicationcodecanbeassignedtothefollowingprocesstypes:•Ahighlevelabstractionfor(scheduled)softwaretargets(procHLSW,procManagedSW).•TwoabstractionsforcreationofANSI-Csoftware(procANSIC,procMTHRC).•Ahighlevelabstractionforhardwaretargets(procHLHW).•AnabstractionforFSMDhardwaretargetsbasedonOCAPI1.0.(procOCAPI1).•AhighlevelabstractionforintegratingwithSystemC(procSC).Assigningcodetoaprocessaffectsitssimulation,inter-processcommunicationandalsocodegeneration,whichisthefinalstepwhen 6.OCAPI-XLBasedApproach147headingforanimplementationofthedesign.Attheearlystagesoftheproject,theuserusuallyworksonsimulationstoobtaincorrectOCAPI-XLsimulationresultsofthesystem.Atthisstage,thecodeisassignedtohighlevelprocesstypes(procHLSW,procHLHW,procManagedSW).Later,thecodeisrefinedtowardsimplementationtargets,beingeitherSWorHW.Duringtherefinementstep,high-levelSWprocesseshavetoberewrittentoprocANSICorprocMTHRCtypes.ThisistoallowsinglethreadedANSI-CcodegenerationforprocANSICtypesandmultithreadedANSI-CcodegenerationforprocMTHRCprocesses.Processes,whichtargetHWimplementation,mustfirstberefinedtotheprocOCAPI1processtype.Subsequently,aHDLcode(VHDL,Verilog)isgeneratedforeachprocOCAPI1process.6.1ReconfigurableContextSwitchingProcessForreconfigurableprocesses,weconsiderrelocatingtasksfromthereconfigurablelogictotheISPandviceversa.Therefore,thereconfigurableprocessesshouldbespecifiedbothasSWandHWprocesses,astheycanpotentiallyberelocatedtoadifferentresourceatrun-time.Thesimulationmodelfortaskrelocationdescribedinthenextsubsectionsupportshigh-levelsimulationofHWandSWprocesses,withanopportunityofperformanceestimation,whichtakesthereconfigurationtimeintoaccount.Afterrefinement,HWandSWcodegenerationcanbedoneforeachreconfigurableprocess,sothattheprocesscanbestartedeitherasHWorasSW.Thecodetotransferstateinformationisnotautomaticallygenerated,andhasstilltobeinsertedexplicitlybythedesignerwhendynamicreconfigurationwithstateinformation(context)memoryisneeded.InOCAPI-XL,noparsingofcodeisdone.Sourcecode(ANSI-CforSWtargets,HDLforHWtargets)iscreatedforthepartsofthesystemmodelwheretheOCAPI-XLobjectsareused.GenerationofSWandHWimplementationsofcommunicationprimitivesisalsosupported.6.2SimulationModelforTaskRelocationTheabilityto(re)scheduleataskeitherinhardwareorsoftwareisanimportantassetinareconfigurablesystems-on-chip.Tosupportthisfeature,apossible(high-level)implementationandmanagementofhardware/softwarerelocatabletasksinOXLhavebeeninvestigated.Theproposedsolutionusesahighlevelabstractionofthetaskstateinformation.TheentirerelocationprocessisillustratedinFig.6-12.Inordertorelocateatask,theOScansendaswitchmessagetothattask,atanytime 148Chapter6(1).Wheneverthatsignalledtaskreachesaswitchpoint,itgoesintoaninterruptedstate(2).Inthisstate,alltherelevantstateinformationofthatswitchpointistransferredtotheOS(3).Consequently,theOSwillrelocatethattasktoanotherprocessor.Thetaskwillbeabletore-initialiseitselfusingthestateinformationitreceivesfromtheoperatingsystem(4).Thetaskresumesbycontinuingexecutioninthecorrespondingswitchpoint(5).Itshouldbenotedthatataskcancontainmultipleswitchpointsandthatthestateinformationcanbedifferentforeveryswitchpoint.Furthermore,itisuptotheapplicationdesignertoimplementthe(applicationdependent)switchpoint(s)insuchawaythatthestateinformationthatneedstobetransferredisminimal.relocationOS(3)(3)switchsignal(1)(2))SWtaskHWtask(5)(4)Figure6-12.Illustrationoftaskrelocation.Bymodellingtaskre-scheduling,theapplicationdesignercanverifyonbeforehandwhattheimpactisforhisparticularsystem.Andwhetherthesystemperformanceimprovementisnotaffectedtoomuchbythere-locationoverhead,forexample.However,thesupportfortaskrelocationisnotsostraightforwardasonemightexpect.Severalproblemsarerising,especiallyifthistaskcomesfromasharedsoftwareprocessingresource(ortaskschedulerobject).Inthiscasetherelocationofthetasknotonlyaffectsthebehaviourofthetaskitself,butalsoaffectsthebehaviourofthetaskschedulerwasrunningon,andthusaffectsallthetasksbeingscheduledbythatschedulingobject.TheOCAPI-XLmethodologyallowskeepingtrackofabunchofstatisticsduringsimulation.Byrelocatingatask,thesestatisticssuddenlygeta 6.OCAPI-XLBasedApproach149differentmeaning.Thereforethesestatisticsmustbecorrectedsotheykeepthesamemeaning.AfirstsimulationmodelfortaskrelocationwasdevelopedinOXL.TheOXLcodebelowillustratestheexampleofcodingcontextswitchingforataskP1,switchingbetweenthedifferentcontexts(procHLHW,procManagedSW),andsimulatingitsbehaviour.procDRCFP1("P1");//--initialcontext:High-LevelHardWare(defaultperiodof10)P1.context(HLHW);//--secondcontext:softwareunderRound-Robinscheduler(RR)P1.context(ManagedSW,&RR);//--nextcontext:High-LevelHardWarewithperiodof2P1.context(HLHW,2);{//--heregoes"normal"OCAPI-XLtaskcode//--uponthisoperatorthetaskwillswitchitself//tothenextcontextswitchpoint();//--heregoessomemoretaskcode}//--andrunthesimulationfor2000cyclesrun(2000);Withinthismodel,itisinprinciplealsopossibletomodelresourcesharingatthehardwarelevel,byreplacingonetaskwithanotheronthesamephysicalreconfigurablespaceandaddingtheappropriatecontexts.Priortostartingthesimulation,foreachrelocatabletaskallthecontextinformation,meaningtheprocessingresourcesthetaskwillrunon,andtheresequencemustbeknown.Ifthissequenceisknown,anewoperator,calledswitchpoint,forcesthetasktoberelocatedfromthecurrentprocessingresourcetowardsthenextprocessingresource.Thistaskmovingalsoimpliesthatallthestatisticsofthecurrentcontextarefinalized,andthestatisticsofthenewcontextareinitialised.7.CONCLUSIONSTheextensionstotheSystemCforsupportingthedesignofSoCsincorporatingreconfigurablepartsaredescribedinthischapter. 150Chapter6ExtensionoftheOCAPI-XLmethodologytowardsintroductionofthethread-levellibraryaswellasSystemCimplementationofOCAPI-XLthreadedprocessareprimarilyrelatedtosystemspecificationandsystem-levelsimulationsteps.ForsystemspecificationstepitprovidestheopportunitytoconsidernotonlyC/C++butalsoSystemCcode,whichbecomesthestandardspecificationlanguageusedintheindustry.MixturesoftheC/C++/SystemCspecificationsarealsopossible.Forsystem-levelsimulationstep,theextensionbroadenstheapplicationscopetocontroldominatedapplications,whichwerenotpossibletosimulatewiththeexistingOCAPI-XLlibrary.Secondarily,thisextensionaffectsalsothemappingstepbyprovidingnovelcommunicationandsynchronizationprimitivesusedwithinthethreadedprocesses.Althoughthethread-levellibraryandSystemCimplementationofOCAPI-XLthreadedprocessextensionhasbeendevelopedincontextofmethodologyforreconfigurableSoCs,itisgenerallyapplicabletoallkindofdesigns.Softwareprocessschedulingextensionenhancesthesystem-levelsimulationstepbyprovidingasequentialisationofsoftwareprocesses.Asrecentreconfigurablearchitecturesofferoneormoreprocessorsimplementedassoftcoresorembeddedprocessorsinsidethereconfigurablefabric,itisnecessarytoextendthesoftwareprocessexecutionmodellingontheseprocessorsinOCAPI-XL.Byintroducingthisextension,OCAPI-XLisabletoprovidethemeansofmodellingforbothsequentialandparallelexecutionoftheprocesses.ExtensionofOCAPI-XLbybusmodelisrelatedtosystem-levelsimulationandmappingstepsofthedesignmethodology.Asreconfigurablearchitecturesoftenusededicatedbuscommunicationschemes,modellingofthebusbehaviourprovidesthemeansforearlyperformanceestimationduringsystem-levelsimulation.ThesetofbusmodellingrelatedprimitivesintroducedtoOCAPI-XLprovidesufficientmeansforexpressingthesystem-levelbusbehaviour.Inordertocoveroneofthedistinguishingfeaturesofreconfigurablearchitectures,modellingofdynamicreconfigurationisimplementedinOCAPI-XLlibrary.Thisincludesprovidingnewprocesscharacterizedbytheabilitytorepresentdifferentcontexts(differentprocesstypes)andalternatethemduringhigh-levelsimulation.Insertingswitchpointsspecifiedbythedesignerinthehigh-levelspecificationdoesalternationoftheprocessesduringsimulation.Thiswaythesystem-levelsimulationmodelsthedynamicreconfigurationofselectedprocessesinearlystageofthedesignandprovidesfeedbackaboutinfluenceofdifferentdynamicreconfigurationschemesonperformanceofthesystem. 6.OCAPI-XLBasedApproach151REFERENCES1.OCAPI-XLmanual.IMEC’sinternaldocument2.KienhuisBetal(1997)Anapproachforquantitativeanalysisofapplication-specificdataflowarchitectures.ProcIEEEIntConfOnApplication-SpecificSystArchandProc:3383493.VanmeerbeeckG(2001)etalHardware/softwarepartitioningofembeddedsysteminOCAPI-XL.Proceedingsofthe9thInternationalSymposiumonHardware/SoftwareCodesign–CODES:30354.PaskoRatal(2000)Functionalverificationofanembeddednetworkcomponentbyco-simulationwitharealnetwork.IEEEInternationalHighLevelDesignValidationandTestWorkshop–HLDVT:64675.SystemCv2.0manual.http://www.systemc.org PARTCDESIGNCASES Chapter7MPEG-4VIDEODECODERMiroslavČupákandLucRijndersIMEC,Kapeldreef75,B-3001Leuven,BelgiumAbstract:TheOCAPI-XLbasedapproachwasappliedintheMPEG-4videodecoderdesigncasewithaimtovalidatethesystemlevelreconfigurabilityextensionsonatypicalmultimediaapplication.TheMPEG-4caserepresentsascenariowheretasksarerelocatedbetweensoftwareandreconfigurablehardwaredependingonthelevelofqualityofservicerequestedbytheuser.TheMPEG-4videodecoderhasbeenimplementedonXilinxVirtex-IIPromultimediademonstrationplatform.Keywords:Designspaceexploration;staticreconfiguration;estimation;mapping;partitioning;reconfigurable;reconfigurability;OCAPI-XL;system-on-chip.1.MPEG-4VIDEODECODERINANUTSHELLNext-generationofmobilemultimediadeviceswillprovidearicharrayofdigitalvideoandmultimediaapplicationstoenhancetheend-userexperience.MPEG-1andMPEG-2,thefirsttwovideostandardsfromtheMovingPicturesExpertsGroup(MPEG),werefundamentalincreatingwidespreadacceptanceofdigitalvideoformats.Theirsuccessor,MPEG-4,canbeconsideredasthefirsttruemultimediastandard,takinganobjectbasedapproachforthecodingandrepresentationofnaturalorsyntheticaudiovisualcontent[1,2,3].Itoffersaflexibletoolset,adaptabletoalargevarietyofrequirements,whileinteroperabilityamongdifferentterminalsisguaranteed.ThebitratesstartatafewhundredbitsforsyntheticaudiouptohundredsofMbpsforthemodellinganddescriptionofcomplexmultimediascenes.TheMPEG-4naturalvisualdecoder(videodecoder)isablock-basedalgorithmexploitingtemporalandspatialredundancyinsubsequentframes.Ittakesasinputabitstream,asequenceofbitsrepresentingthecodedvideo155N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,155-177.©2005Springer.PrintedintheNetherlands. 156Chapter7sequences,compliantwiththeISO/IEC14496-2standard[4].Thebitstreamstartswithidentifyingthevisualobjectasavideoobject(otherkinds,likestilltexturesexist).Thisvideoobjectcanbecodedinmultiplelayers(scalability).OnelayerconsistsofVisualObjectPlanes(VOPs),timeinstancesofavisualobject(i.e.frame).AdecompressedVOPisrepresentedasagroupofMacroBlocks(MBs).EachMBcontainssixblocksof8x8pixels:4luminance(Y),1chrominancered(Cr)and1chrominanceblue(Cb)blocks.Figure7-1definesthemacroblockstructurein4:2:0format(thechrominancecomponentsaredownsampledinhorizontalandverticaldirection)[5].01233445YCb(U)Cr(V)Figure7-1.4:2:0Macroblockstructure.Twocompressiontechniquesarediscriminated.Intheintracase,theMBorVOPiscodedonitselfusinganalgorithmthatreducesthespatialredundancy.IntercodingrelatesamacroblockofthecurrentVOPtoMBsofpreviouslyreconstructedVOPsandreducesinthiswaythetemporalredundancy.Figure7-2presentsthestructureofasimpleprofilevideodecoder,supportingrectangularIandPVOPs.AnIVOPorintracodedVOPcontainsonlyindependenttextureinformation(onlyintraMBs).AP-VOPorpredictivecodedVOPiscodedusingmotioncompensatedpredictionfromthepreviousPorIVOP,itcancontainintraorinterMBs.ReconstructingaPVOPimpliesaddingamotioncompensatedVOPandatexturedecodederrorVOP.Notethatallmacroblocksmustbeintrarefreshedperiodicallytoavoidtheaccumulationofnumericalerrors.Thisintrarefreshcanbeimplementedasynchronouslyamongmacroblocks. 7.MPEG-4VideoDecoder157PaddedReconstructVopVopComp.TextureVopVopInterIntraDecodeMotion-1-1ACDC-110110011110DecodeVLCScan-1QIDCTReconstructPadHeaderMotionCompensatePredIntraInterLooponMBsLooponVOPsFigure7-2.Thedataflowofthesimpleprofiledecoder.1.1MotionCompensationAvideosequencetypicallyhasahightemporalcorrelationbetweensimilarlocationsinneighbouringimages(VOPs).Intercoding(orpredictivecoding)tracksthepositionofamacroblockfromVOPtoVOPtoreducethetemporalredundancy.Figure7-3.followsthemovementofahandandfeetofadancerinthesuccessiveframes.ThemotionestimationprocesstriestolocatethecorrespondingmacroblocksamongVOPs.MPEG-4onlysupportsthetranslatorymotionmodel.Figure7-3.Temporalcorrelationinavideosequence.Thetopleftcornerpixelcoordinates(x,y),specifythelocationofamacroblock.ThesearchisrestrictedtoaregionaroundtheoriginallocationoftheMBinthecurrentpicture,maximallythissearchareaconsistsof9MBs(illustratedinFigure7-4).With(x+u,y+v),thelocationofthebestmatchingblockinthereference,themotionvectorequalsto(u,v).Inbackwardmotionestimation,thereferenceVOPissituatedintimebeforethecurrentVOP,opposedtoforwardmotionestimationwherethereferenceVOPcomeslaterintime. 158Chapter7Figure7-4.Motionestimationprocess.AsthetrueVOP-to-VOPdisplacementsareunrelatedtothesamplinggrid,apredictionatafinerresolutioncanimprovethecompression.MPEG-4allowsmotionvectorswithhalfpixelaccuracy,estimatedthroughinterpolationofthereferenceVOP.Suchvectorsarecalledhalfpelmotionvectors.AmacroblockofaPVOPisonlyintercodedifanacceptablematchinthereferenceVOPwasfoundbythemotionestimation(else,itisintracoded).MotioncompensationusesthemotionvectortolocatetherelatedmacroblockinthepreviouslyreconstructedVOP.Thepredictionerrore(x,y,t),thedifferencebetweentherelatedmacroblockMB(x+u,y+v,t-1)andthecurrentmacroblockMB(x,y,t)iscodedusingthetexturealgorithm.e(x,y,t)=MB(x,y,t)-MB(x+u,y+v,t-1)ReconstructinganinterMBimpliesdecodingofthemotionvector,motioncompensation,decodingtheerrorandfinallyaddingthemotioncompensatedandtheerrorMBtoobtainthereconstructedmacroblock.1.2TextureDecodingProcessThetexturedecodingprocessisblock-basedandcomprisesfoursteps:VariableLengthDecoding(VLD),inversescan,inverseDC&ACprediction,inversequantizationandanInverseDiscreteCosineTransform(IDCT).ExceptfortheIDCT,allblockshavetoproducenumericalidenticalresultstoISO/IEC14496-2andISO/IEC14496-5.TheVLDalgorithmextractscodewordsfromHuffmantables,resultingina8x8arrayofquantizedDCTcoefficients.Then,theinversescanreorganizesthepositionsofthosecoefficientsintheblock.Incaseofanintramacroblock,inverseDC&ACpredictionaddsthepredictionvalueofthesurroundingblockstotheobtainedvalue.Thisisfollowedbysaturationintherange[-2048,2047].NotethatthissaturationisunnecessaryforaninterMB.BecausenoDC&ACpredictionisused,theinterMBDCTcoefficientsareimmediatelyinthecorrectrange. 7.MPEG-4VideoDecoder159Inversequantization,basicallyascalarmultiplicationbythequantizerstepsize,yieldsthereconstructedDCTcoefficients.ThesecoefficientsarebitsPerPixel+3bitsPerPixel+3saturatedintherange[-2,2-1].Inthefinalstep,theIDCTtransformsthecoefficientstothespatialdomainandoutputsthebitsPerPixelreconstructedblock.Thesevaluesaresaturatedintherange[-2,bitsPerPixel2-1].1.3ErrorResilienceTheuseofvariablelengthcodingmakesthe(video)bitstreamsparticularlysensitivetochannelerrors.AlossofbitstypicallyleadstoanincorrectnumberofbitsbeingVLCdecodedandcauseslossofsynchronization.Moreover,thelocationwheretheerrorisdetectedisnotthesameaswheretheerroroccurs.Onceanerroroccurs,alldatauntilthenextresynchronisationpointhastobediscarded.Theamountoflostdatacanbeminimizedthroughtheuseoferrorresiliencetools:resynchronisationmarkers,datapartitioning,headerextensionandreversiblevariablelengthcodes[4].2.IMPLEMENTATIONPLATFORMThefollowingsectionsdescribethefamiliesofVirtex-IIboards,clarifytheconceptofembeddedsoftcoresthatcanbebuildinsideoftheVirtex-IIFPGAsandexplaintheprocessofselectionthesuitableplatformforMPEG-4videodecoderdemonstrator.2.1VirtexIIBoardsXilinxoffersalineofprototypeboardsfortheVirtex-IIseriesofPlatformFPGAs.Theseboardsareintendedtoprovidetestingandmodelingofthedesignfunctionality.Theycomewithdocumentation,cables,connectors,andasetofreferencedesignsintendedtogainknowledgeaboutefficientuseoftheboards.AnexampleoffamilyofVirtex-IIboardsistheXilinxMultimediaDevelopmentBoard(XMDB).Theboardisdesignedtobeusedasaplatformfordevelopingmultimediaapplications.TheboardsupportsPALandNTSCtelevisioninputandoutput,truecolorSVGAoutput,andanaudiocodecwithpoweramplifier,aswellasEthernetandRS-232interfaces.SeveralpushbuttonandDIPswitchesareavailableforuserinteractionwiththesystem.TheembeddedSystemACE™controllerallows 1600Chapter7forhigh-speedFPGAconfigurationfromCompactFlash™storagedevices.Figure7-5showstheXilinxMultimediaDevelopmentBoardcomponents.NTSC/PALCCIR601/VideoEncoder656YCrCbSVGAVideoorNTSC/PAL4:2:2TVoutRAMVideoformat2FRAMESDecoderVIDEODAC4:4:4R`G`B`REALTIMEAJTAGformatHC3CompactVideoINPROC.BUSInterfaceFlashCardBuffer2FRAMES4:4:4R`G`B`STATICVideoformatINBufferIRTEXIIFPG1FRAMEVButtonsforvideosourceandeffect10BASE-T/100BASE-TXsellectEthernetAudioCodec&AmpRS232audioinaudiooutuFigure7-5.XilinxMultimediaDemonstrationBoard.AstheVirtex-IIfamilydoesn’tcontainanyembeddedSWcoreonthechip,theusualsolutionforthedesignswhereprocessorcoreisdemandedistheuseofvirtualprocessorthatiscreatedoutofbitsofcodeinreconfigurablefabric.2.2MicroBlazesoftprocessorTheMicroBlaze[6]isavirtualmicroprocessorthatisbuiltbycombiningblocksofcodecalledcores.ItusesHarvard-styleRISCarchitecturewithseparate32-bitinstructionanddatabusesrunningatfullspeedtoexecuteprogramsandaccessdataoutofon-chiporexternalmemory(seeFigure7-6).Thecorecontains800lookuptablesand32generalpurposeregisterswiththree-operandinstructionformat.ItsstandardperipheralsetisdesignedtoworkwithIBM'sCoreConnecton-chipbusto 7.MPEG-4VideoDecoder161simplifycoreintegration.TheMicroBlazepipelineisaparallelpipeline,dividedintothreestages:Fetch,Decode,andExecute.Ingeneral,eachstagetakesoneclockcycletocomplete.Consequently,ittakesthreeclockcycles(ignoringdelaysorstalls)fortheinstructiontocomplete.Eachstageisactiveoneachclockcyclesothreeinstructionscanbeexecutedsimultaneously,oneateachofthethreepipelinestages.MicroBlazerunstheoreticallyat150MHzanddelivers123D-MIPS.Figure7-6.AviewonaMicroBlazeprocessor.2.3PlatformSelectionOnanySoCproject,theoverallgoaloftheplatformselectionprocessshouldbetofindtheplatformthatreducesriskasmuchaspossible.Theconsideredriskinvolvesbothtechnicalandnon-technicalaspectsthathavetobetakenintoaccount.Fromtheproposedmethodologypointofview,itshouldbepossibletoimplementitsdesignfeatures(describedatChapter4)ateverylevelofthedesign.Atsystem-leveldesignphase,ithastobedeterminedwhichdesigncomponentsarepossibletoimplementorreuseontheplatformanddeterminehowtheyinteract.Atdetaileddesignphase,whenthedesignisrefined,itmustbeguaranteedthatalltherefinementsaresupportedbytheplatform.Atthesametimeproperplatformverificationstrategyhastobeknowntoprovethatsolutionsarenotbaseduponincorrectassumptions.Theimplementationphasetheninvolvesbuildingthesystematthespecifiedplatform.Heretheimportantfactorsinfulfillingthedesigngoalsare:possibleIPreuseofthecomponents,experienceofthedesignteamwithboardenvironment,vendorsupport,etc. 1622Chapter7Takingallthisaspectsintoaccount,XilinxVirtex-IIMultimediaDemonstrationBoardwithembeddedMicroBlazesoftprocessorcorehasbeenselectedasthemostappropriatedemonstrationplatformforMPEG-4videodecoder.XMDBboardprovidessufficientsupportforimplementationoftheMPEG-4demonstratorcase.3.MPEG-4VIDEODECODERDESIGNFLOWFigure7-7illustratestheproposedmethodology(seeChapter4)andpositionsassociatedtoolsusedtodesignMPEG-4videodecoder.ThefirstdesignstepinvolveduseofATOMIUM[7]methodologyforinitialdatatransferandstorageexploration.ThenanumberofoptimizationswereappliedonthereferenceMPEG-4simpleprofilevideodecoder,startingfromthebasicdescriptionofvideodecoder.Themainoutputoftheoptimizationwasaplatformindependent,memoryoptimizedvideodecoderdescribedinC.Atthesametimeinitialarchitectureofthedecoderhasbeenproposed,basedonthefeedbackoftheoptimizationresults.Inthenextstep,OCAPI-XLmethodology(seeChapter6)hasbeenexploitedtocoverSystem-LevelDesignandDetailedDesignphasesofthemethodologydescribedatChapter4.Asystemlevelmodel,inwhichbothfunctionalityandarchitecturecanbedescribedseparately,allowedperformancemodellingatahighlevelofabstraction.Arefinementstrategyandexecutablespecificationsatalllevelsenabledastructuredpathtowardsimplementation.Attheendofthedesignflow,OCAPI-XLgeneratedVHDLcodeforthereconfigurableHWpartsofthedesign.TheimplementationphaseofthedecoderdesignhasbeenfullycoveredbythecommercialtoolsdedicatedforoptimalimplementationoftheSWandHWpartsontheselectedplatform.Synplify-ProFPGAandCPLDsynthesistoolhasbeenusedforimplementingtheFPGApartofthedesign.SWimplementation,co-verificationandboardintegrationhasbeensupportedbytheXilinx’sISEandEDKtoolsets.4.SOFTWAREVERSIONOFTHEMPEG-4VIDEODECODERThissectioncoverstheinitialanalysisoftheMPEG-4videodecoder,decoderoptimizationsanddescribespureSWversionofthedecoder. 7.MPEG-4VideoDecoder1634.1DefinitionoftheFunctionalityTestbenchandDecoderPruningTheVerificationModel(VM)softwareusedasinputspecificationwastheFDIS(FinalDraftInternationalStandard)naturalvisualpart[3].Havingworkingcodeatthestartofthedesignprocesscanoverrulethetedioustasktoimplementasystemfromscratch.Unfortunately,thesoftwarespecificationwasverylargeandcontainedmanydifferentcodingstylesofoftenvaryingquality.ANSICATOMIUM+ManualTransformationsOptimisedCSystemPartitioningSSystem-Levelystem-LevelOCAPI-xlC++DesDesignignOCAPI-xlVHDLReconfigurableHardwareDesignDetailedDesignVHDLFPGA/ASICSoftwareImplementationImplementationSynplifyDesignDesignEDIIFEDK-ISEFPGADownloadingImplementationProductDesignQualificationFigure7-7.MPEG-4VideoDecoderDesignandToolFlow. 1644Chapter7Moreover,theVMcontainedallthepossibleMPEG-4decodingfunctionality(i.e.ofallpossibleprofiles)resultinginoversizedCcodedistributedovermanyfiles.Thevideodecoderonitselfhasacodesizeof93files(.hand.csourcecodefiles)containing52928lines(withoutcountingthecommentlines).AnecessaryfirststepinthedesignwasextractingthepartofthereferencecodecorrespondingtothedesiredMPEG-4functionalityofthegivenprofileandlevel.ATOMIUMpruning[7]wasusedtoautomatethiserror-proneandtedioustask.Itremovedtheunusedfunctionsandtheircallsbasedontheinstrumentationdataofatestbenchrepresentativeforthedesiredfunctionality.Thisimpliedcarefulselectionofthesetofinputstimuli,whichhastoexercisealltherequiredfunctionality.Applyingautomaticpruningwiththisfunctionalitytestbenchreducedthecodeto40%ofitsoriginalsize.Fromthispoint,furthermanualcodereorganizationandrewritingbecamefeasible.Throughthecompleteanalysisandoptimizations,theForemanCIF3testcasewasusedasanexampleforthedetailedstudyoftheeffectsandbottlenecks.TheForemanCIF3testcaseusesnoratecontrolandhencethedecoderhastoactivatethedecompressionfunctionalityforeveryframeofthesequence(askippedframejustrequiresdisplayingbutnodecompression).4.2InitialDecoderAnalysisAnanalysisofthedatatransferandstoragecharacteristicsandthecomputationalloadinitiallyallowedanearlydetectionofthepossibleimplementationbottlenecksandsubsequentlyprovidedareferencetomeasuretheeffectsoftheoptimizations.ThememoryanalysiswasbasedonthefeedbackofATOMIUM.CountingthenumberofcycleswithQuantifyassessedthecomputationalload.Table7-1liststhemostmemoryintensivefunctionstogetherwiththerelativeexecutiontimespentineachfunctionfortheForemanCIF3testcase.ThetimingresultswereobtainedwithQuantifyonaHP9000/K460,180MHzRISCplatform.Asexpected,memorybottleneckspoppingupatthisplatformindependentlevelalsoturnouttoconsumemuchtimeontheRISCplatform.ThetimespendinWriteOutputImageisduetonetworkoverheadanddiskaccessing.It’stimecontribution(although)verylarge,wasneglectedduringtheoptimizations(intherealdesign,nowritingtodiskoccurs).ThelastcolumnofthetableisproducedwithWriteOutputImagedisabled.ThefollowinglistexplainsthebehaviorofthefunctionsinTable7-1: 7.MPEG-4VideoDecoder165•VopMotionCompensate:PickstheMBpositionedbythemotionvectorsfromthepreviousreconstructedVOP.Incaseofhalfpellmotionvectors,interpolationisrequired.•BlockIDCT:InverseDiscreteCosineTransformofan8x8block.•VopTextureUpdate:AddthemotioncompensatedandtextureVOP.•BlockDequantization:InversequantizationoftheDCTcoefficients.•CloneVop:CopiesdataofcurrenttopreviousreconstructedVOPbyduplicatingit.•VopPadding:AddabordertopreviousreconstructedVOPtoallowmotionvectorstopointoutoftheVOP.•WriteOutputImage:WritethepreviousreconstructedVOP(withoutborder)totheoutputfiles.OnlytheIDCTisacomputationallyintensivefunction,alltheothersmainlyinvolvedatatransferandstorage.ThemotioncompensationandblockIDCTtogethercausemorethan40%ofthetotalnumberofmemoryaccesses,makingthemthemainimplementationbottlenecks.Hence,thefocuswasonthesefunctionsduringthememoryoptimizations(i.e.reducethenumberofaccesses).Table7-1.MotioncompensationandtheIDCTarethememorybottlenecksofthedecoder(ForemanCIF3testcase)Functionname#accesses/Relative#RelativeRelativeframeaccessestime(%),totime(%),6(10acces-(%)disknottodiskses/frame)VopMotionCompensate3.925.416.938.34BlockIDCT2.818.09.421.25VopTextureUpdate1.710.73.16.8BlockDequantization0.53.02.04.5CloneVop1.27.51.53.46VopPadding1.17.01.43.08WriteOutputImage1.06.254.9-Subtotal11.674.789.177.43Total15.5100.0100.0100.0BothforHWandforSW,thesizeoftheaccessedmemoryplaysanimportantrole.AccessestosmallermemorieshaveabetterlocalityandhencetypicallyresultinahighercachehitchanceforSWandinlowerpowerconsumptionforHW.Figure7-8groupstheaccessesto4memorysizes:framememorywithasminimalsizetheheightwidthoftheVOP,largebuffercontainingmorethan64elements,bufferwith9to63elementsandregisterswithmaximally8elements.Inthisinitialanalysisstage,thewordlengthoftheelementsisnotconsidered.50%ofthetotalnumberofaccessesistoframememory,13%toalargebuffer,23%toabufferand13 166Chapter7%toregisters.Asaccessestolargememoriesaremostinefficient,theoptimizationsfocusedonreducingtheaccessestothosememories.Spreadoftheaccessesoverdifferentmemorysizes(ForemanCIF3)160001400012000Framememory10000LargeBuffer8000Bufferaccesses6000#Register400020000PrunedFigure7-8.Mostaccessesofthereferencedecoderareto(large)framememories.Fromtheinitialanalysisofthe(pruned)FDIScode,ahigh-leveldataflowmodelhasbeenderived.ForeveryVOP,thealgorithmloopsovertheMBs.First,themotioninformationisreconstructed.IncaseofaninterMB,themotionvectorisdecodedandthemotioncompensatedMBisstoredatthecurrentpositioninthecompensatedVOP.IncaseofanintraMB,thecompensatedMBisstoredasallzeros.Secondly,thetextureinformationisdecoded.InverseVLCandinversescanyieldtheDCTcoefficients.IncaseofanintraMB,alsoinverseDC&AC(ifenabled)predictionhastobeperformed.InversequantizationandIDCTproducethetextureMBthatisstoredatthecurrentpositioninthetextureVOP.WhenallMBsoftheVOPareprocessed,thereconstructedVOPiscomposedbyaddingthecompensatedandtextureVOP.ThiscompleteVOPiscopiedasitisneededatthenexttimeinstanceforthemotioncompensationasreference.Finally,aborderisaddedtothisreferenceVOPtoallowthemotionvectortopointoutoftheimage.TheresultingVOPiscalledthepaddedVOP.Thisillustratesthatthedataexchangedbetweenthemainpartsofthedecoderisofframesize.HencethedataflowofthereferencedecoderisVOPbased. 7.MPEG-4VideoDecoder1674.3DecoderOptimizationsDecoderoptimizationswereperformedintwophases.Duringthefirstphase,thedataflowwastransformedfromframe-basedtomacroblock-based.Inthesecondphase,ablock-baseddataflowwasintroduced.Theseoptimizationsaimedatthereductionofthenumberofaccessesandtheimprovementofthelocalityofdata.TheeffectoftheplatformindependentoptimisationshasbeenassessedbyATOMIUMandhasbeenvalidatedtowardssoftwareandhardwareimplementation.Theglobalnumberofaccesseswasreducedwithafactor5.4to18.6,dependingonthecomplexityofthesequence.Thepeakmemoryusagedroppedfromsomemegabytestoafewkilobytes.Theperformancemeasureshowedaconsistentspeedup.ThehighestspeedupwasmeasuredonaPCplatform,wherethespeedupfactorvariesbetween5.9and23.5.Theproposedarchitecturecontainsasingleprocessorandathreelevelmemoryorganization.Theobtainedresultsaregenericandallowarapidevaluationofalternativememoryhierarchies.4.4EvaluationAfteranalysisandoptimizationsofthecode,theSWonlyversionofSWMPEG-4videodecoderhasbeenimplementedonXilinxVirtex-IIMultimediaDemonstrationBoardrunningfullyonMicroBlazesoftprocessor.Boardmeasurementshaveshownthatdecoderrunsat0.5framespersecondforatypicalCIFvideosequence.EveniftheSWrelatedaccelerationtechniqueswouldbeused,whichwouldbringayieldofmagnitudespeedup,noreal-timebehaviorwouldbeachieved.HWaccelerationwastheonlysolutiontosolvetheproblem.ThismovesthecriticalfunctionalitytotheFPGAfabric.TheHW/SWpartitioningofvideodecoder,asdescribedinthefollowingsection,allowsforparallelprocessinginSWandHW,assumingthatthetimepreviouslyconsumedbycriticalblocksisminimalwhenmovedtoHW.Thiswayalsoacommunicationoverheadisreduced.5.HARDWAREACCELERATEDVERSIONOFMPEG-4VIDEODECODERStraightforwardimplementationofpureSWversionofMPEG-4videodecoderresultedininsufficientperformanceofthesystem.Thenextstepin 168Chapter7thedesignwasproposingtheaccelerationstepstoimprovetheprocessingspeedtowardsreal-timebehavior.5.1HW/SWPartitioningPrimarycandidatesforimplementinginhardwarebecamethecomputation/datatransfermostdominantblocksVopMotionCompensate,BlockIDCT,VopTextureUpdateandBlockDequantization(seeTable7-1).Secondarily,movingthoseblockstohardwarealsoinfluencedpartitioningofanumberofsub-blocksthatwereinvolvedintransferringthedatabetweenthememoriesandHW/SW.EspeciallytheaccessingthedatainmemorieshavehadanimpactonHW/SWpartitioningofsub-blocks.ThesehavebeenputtoHWifefficientcyclecountsavingcouldhavebeenobtained.ThefinalHW/SWpartitioningoftheMPEG-4videodecoderisshowninFigure7-9.MemoryFrameFrameBuffer1Buffer2ColorTransformMemoryCurrentVopBufferYUVComp.Text.Block+BlockMotion-1FPGAComp.Q+IDCTFabricMicroBlazesoftprocessoronFPGAfabricInter10110011110DecodeInterDecode-1-1ACDCVLCCScanIntra-1HeaderMotionPredIntraLooponblocksLooponMBsLooponVOPsFigure7-9.HW/SWpartitioningforhardwareacceleratedversionofthedecoder. 7.MPEG-4VideoDecoder1695.2HW/SWCo-DesigninOCAPI-XLAfterHW/SWpartitioning,theOCAPI-XLmodeloftheacceleratedMPEG-4videodecoderhasbeenbuild(seeFigure7-10).ItconsistsoffourmainOCAPI-XLblocks:motioncompensation(MC),inverseDCTand-1dequantization(iDCT-Q),aprocessresponsibleforwritingtheblocktoYUVbuffer(Block2BufferYUV)andaprocesswhichstoresthebuffereddatatocurrentVOP(BufferYUV2CurrentVOP).TheseblocksfurthercontainsmallerOCAPI-XLprocesses,executingspecificfunctionwithintheblock.AfterintroducingpropercommunicationmechanismsbetweentheOCAPI-XLprocesses,thecommunicationbetweentheMicroBlazeandtheHWacceleratorhavebeendefined.Amemory-mappedinterfaceservesthepurposeofcommunicatingbetweentheMicroBlazeandthedifferentHWblocks,bothfordataandcontrolsignals.FormemoryaccessestoYUVbufferandVOPblocksaC++parametrisablebufferlibraryhasbeenused.Tovisualizethedecodingprocess,adualvideomemorysystemwasproposedfordisplayrendering.MicroBlazeMux/DemuxMCin/outIDCT-Q-1in/outMCIDCT-Q-1CurrentMCoutIDCT-Q-1outVOPBufferYUVBlock22CurrentVOPBufferYUVBufferYUVHWAcceleratorDataandControlFlow. 170Chapter75.3PerformanceEstimationThenextstepintheMPEG-4videodecoderdesignwasestimationofthegainthatcanbeobtainedbyHWacceleration.OCAPI-XLestimationtechniqueshavebeeninvolvedtosolvethistask.Priortoperformanceestimation,theCcodeforVopMotionCompensate,BlockIDCT,VopTextureUpdateandBlockDequantizationblockshavebeenrewrittentoOCAPI-XLprocessesandrefined.WeusedOCAPI-XLoperationsetsimulationapproachforperformanceestimation,whichisanalogytotheInstructionSetSimulatorapproach.Theperformancemodelsofprocessorswerecharacterizedinatableofoperationswithassociatedexecutioncyclecount.Theexecutioncyclecountswereobtainedfromboardmeasurementsbyexecutionofsmallprograms.OperationsetapproachcoversperformanceestimationoftheOCAPI-XLprocesses.However,asSWpartofthedecoderwasrunningasaseparatethread,itwasstillnecessarytoannotatetheSWtaskswithpropertiminginformation.WehaveexploitedthesimulationtimeresultsofthepureSWversionofthedecoderrunningontheXMDBplatform(seeTable7-2)toobtaintheapproximatetimesforthoseprocesses.Table7-2.TimespendinthemainfunctionalblocksduringthedecodingofForemanCIF450kbps,12secondsofvideoforpureSWversiononXMDBFunctionalBlockFunctiontime(s)Relativetime(%)MotionCompensation164.028.8BufferToVop75.813.3VOPTextureUpdate105.718.6-1Q/IDCT105.918.6VLCDecoding43.77.7InitBlock20.23.5DCACReconstruction14.72.6ReadBitstream3.00.5Other36.46.4Total569.5100.0ExploitingOCAPI-XLunifiedsystemdescription,operationsetsimulatorandtheextensionthatenabledcontextswitchingbetweentheHWandSWenabledtheperformanceestimationofreconfigurablebehavioroftheMPEG-4decoder.Thedesignwasmodelledintwoversions:•ConfiguredaspureSWversionofMPEG-4decoder,and•ConfiguredasHWacceleratedversion.WithaSWprocessedannotationtheperformanceestimationstartedwithsimulationofthepureSWversionofthedecoder.Theframetimeshavebeenobtainedandtheaverageframetimehavebeencalculated.ForpureSWversion,theOCAPI-XLprocesseshavebeendefinedofmanagedSW 7.MPEG-4VideoDecoder171type,andround-robinschedulerhasbeenexploited.ThiscorrespondedwithserialexecutionofprocessesonasingleMicroBlazeprocessor.ForHWacceleratedversionofthedecoder,theOCAPI-XLprocesseshavebeenredefinedtohigh-levelhardware(HLHW)typeandthecodehasbeenrecompiled.HLHWischaracterisedbydetailedmodellingbetweenthefunctionandcycletime.ThiscycletimeinformationhasbeenspecifiedduringrefinementstepintheOCAPI-XLdesign.ComparingtheaverageframetimewithpureSWversion,thespeed-upoffactor4.2wasestimated.ThesetworeconfigurablescenariosareswitchedduringtheOCAPI-XLsimulationatspecificswitchpointinsidetheOCAPI-XLtasks.Itshouldbenotedthatswitchingbetweenthedifferentcontextsisfullysupportedduringhigh-levelsimulations.However,itisresponsibilityofthedesignertosolvetheHW/SWtaskrelocationattheimplementationlevel.5.4FurtheroptimizationsEstimationoftheperformanceoftheHWaccelerateddecoderindicatedimprovementoffactor4.2(framerate2.1fps)comparedtothepuresoftwareversion.Furtherstepsthereforeconcentratedonimprovementthememoryaccesstimestoobtainreal-timeperformance.Basicexperimentsquantifiedthedatatransferconstonthemultimediaboardtoallowforassessingtheimpactofanimprovedplatform.AsmallCprogramcountedthenumberofcyclesrequiredforadatatransfertothedifferentkindsofavailablememoryonthemultimediaboard:localmemory,blockRAMandZBTRAM(offchip).Table7-3liststhedifferentread/writetimeswhenthefunctionadd/orstackresidesinlocalmemoryoroffchip(ZBTRAM).Table7-4measurestheamountofcyclesspentduringthedatacopyfromonekindofmemory(row)toanother(column).Table7-3.MemoryaccesscyclesLocalStackNonLocalStackLocalFunctionNonLocalLocalFunctionNonLocalFunctionFunctionReadWriteReadWriteReadWriteReadWriteLocal66151511142828MemoryBlockRAM87221913153032ZBTRAM107241915153232Stack441010992020Theresultsinthetablesaboveindicatedthatcyclesavingscouldbegainedbyputtingtheselectedobjectfilesofthedecoderinlocalmemorytomaketheirfunctionslocal.Makinguseoflocalmemoriesforobjectcode,in 172Chapter7connectionwithpixelpacking,thespeedupoffactorof2.3wasobtained.AnotheroptimisationsincludedseparatingMicroBlazeinstructionanddatabusses,DirectMemoryAccessandexploitingthecompileroptimizations.Byaccumulativeapplyingtheseperformanceoptimizationsteps,theframerateof25fpshasbeenobtained.Thereal-timeperformanceof30fpscouldbeeasilyobtainediftheXMDBboardmaximumclockfrequencywouldnotbelimitedto81MHz.Table7-4.DatacopycyclesLocalStackLocalFunctionNonLocalFunctionLocalBlockZBTLocalBlockZBTMemoryRAMRAMMemoryRAMRAMLocal899202424MemoryBlock101111272828RAMZBT121313293030RAMNonLocalStackLocalFunctionNonLocalFunctionLocalBlockZBTLocalBlockZBTMemoryRAMRAMMemoryRAMRAMLocal161717333737MemoryBlock181919404141RAMZBT202121424343RAM5.5ImplementationDetailsThevideodecoderdemonstratorwasrealizedontheXilinxMicroBlazeDevelopmentBoard,whichincorporatesVirtex-IIxc2v2000FPGAandembeddedMicroBlazesoftprocessorcore.Theboardisdesignedtobeusedasaplatformfordevelopingmultimediaapplications.Theboardsupportsfiveindependentbanksof512Kx36bit130MHzZBTRAMwithbytewritecapability.Thismemoryisusedasvideoframebuffersstore.TheembeddedSystemACEenvironmentconsistingofaCompactFlashstoragedeviceandacontrollerisusedforstoringtheencodeddata.TheethernetconnectionisusedtotriggerthedecodingprocessformthebrowserrunningonneighbouringPC.ThedecodedsequenceisdisplayedonthemonitorconnectedtotheSVGAoutputoftheboard. 7.MPEG-4VideoDecoder173TheMPEG-4videodecoderreadscontrolandconfigurationsettingsfromfile,initializesandstartsMPEG-4decoding.ThedecodingitselfistriggeredwithaURLrequestinabrowseronaPCconnectedtothesameLANnetworkasthemultimediaboard.ThisrequeststartsuptheMPEG-4VideoDecoderontheMicroBlazewhichwillopenandreadthecontrolfilelocatedinSystemAceflashRAMontheboardandthenreadsthestreamofencodeddataalsostoredinflashRAM.DecodergeneratesYUVframedatathataresendtotherenderingblockandthendisplayedonthemonitorconnectedtotheboard.ResourceutilizationforthewholeMPEG-4decodingsystemforxc2v2000FPGAontheboardwas7703slicesi.e.71%oftheresources.TheHWacceleratorblocksitselfconsumed5000slices,whichrepresents46%ofresources.Allocationdataforbuild-upmultipliers,blockRAMsandLUTsareshowninTable7-5.Theclockrateforthedecoderwassettoamaximumboardavailable81MHz.Table7-5.XilinxVirtex-IIxc2v2000FPGAresourceutilizationMULT18X18s1933%RAMB16s4376%LUTs1123752%6.RESULTSANALYSIS6.1AnalysisofDesignMethodologyResults6.1.1BenefitsThekeybenefitshownonMPEG-4videodecoderisdemonstrationofabilityofhigh-levelsimulation-basedperformanceestimationandevaluationofcontextswitchingbetweenthedifferentcomputationresources.Basedontheperformanceestimationresults,itispossibletoconstructtrade-offcurveforconsideredHW/SWpartitions.Thisgivesthedesigneropportunitytoevaluateatearlystageofthedesignprocess,whichcomponentsisbeneficialtoimplementinreconfigurableHWandwhichoneswillberunninginSW.6.1.2DisadvantagesOCAPI-XLperformanceestimationisbasedontheoperationsetsimulationapproach.Thismeansthatforeveryoperationatruecycleexecutionhastobeprovidedtogettheestimationresultsascloseaspossible 174Chapter7toarealboardexecutiontime.Thesemeasurementshavetobeobtainedasaresultofboardmeasurements,meaningthattheboardeitherhavetobeavailableortheestimatesofthecyclecountareconsidered.However,ad-hocestimatesintroduceaconsiderableriskthattheexpectedperformanceresultswillmisstheactualboardperformance.Therefore,theasexactaspossibleboardmeasurementsarecriticaltomatchthehigh-levelsimulationwithreal-lifeexecutionontheboard.6.1.3SummaryRecentplatformFPGAsintegratehigh-performanceCPUswithinreconfigurablefabric.Thiscombinationprovidesflexibleandhigh-performancesystemdesignenvironmentsuitablefordeploymentofawidevarietyofapplications.Abilityofperformanceestimationandintroductionofcontextswitchmodellingatthehigh-levelofdesignbringstheadvantageofearlyevaluationofpossiblereconfigurableHW/SWpartitioningdecisions.DemonstratingonMPEG-4videodecoderapplication,theproposedOCAPI-XLbasedapproachhasproventheabilitytorepresentamethodology,whichsuccessfullycopeswithreconfigurableSoCdesign.6.2AnalysisofImplementationResults6.2.1BenefitsDesignofMPEG-4VideoDecoderdemonstrated,thatOCAPI-XLbasedmethodologyishighlysuitableapproachfordesigningtheRSoCsatthesystemlevel.Ithasbeenshown,thatabilityofgenerationofHDLdescriptionfromrefinedOCAPI-XLmodelshasanimportantrolewithrespecttodesigntime.Althoughthedesignerisrequiredtoputanefforttorefinethehigh-levelOCAPI-XLprocesstypestolow-levelprocesses,thebenefitoffastHDLcodegenerationduringthepossibledesigniterationsprovidesoverallgainbyreducingtheHDLre-designandre-simulationtime.Moreover,theOCAPI-XLHDLcodegeneratorprovidestheHDLdescriptionofcommunicationprimitives,interconnectionofHDLblocks,HDLtestbenchgenerationandeasyintegrationofIPblocks.Withrespecttothereconfigurability,theautomaticHDLcodegenerationisbeneficialinflexiblegenerationofdifferentreconfigurationscenarios.OCAPI-XLdynamicallyreconfigurableprocesstype(procDRCF)implementscontextswitchingbetweenthehigh-levelHW/SWprocesstypes.Thisenablesfastexplorationofvarietyofdifferentreconfigurableschemesathigh-leveldesignstep.Thetypeoftheprocesscanbealternatedduringsimulationatarbitrarytime,takingintoaccountreconfigurationtime 7.MPEG-4VideoDecoder175overhead.Inthissense,themodellingofcontextswitchingopensthepossibilityofmodellingofdynamicreconfigurationathigh-levelofthedesign.Havingtheopportunityofmodellingofrun-timecontextswitching,designermusthavethepossibilityofperformanceestimationofdifferentreconfigurations.Thisallowsfastevaluationofhigh-leveldecisionsandfocusingonthosepartsofthedesignflowwherethegaininperformanceandefficiencyisgreatest.ExploitingtheOCAPI-XL’sOperationSetSimulatorapproachhasshowntobebeneficialforannotationofSWprocesses,implementedonMicroBlazesoftcoreembeddedonFPGAdevice.6.2.2DisadvantagesTheexperiencesfromMPEG-4designindicatethatdetailedknowledgeoftheimplementationplatformiscrucialforefficientimplementationofthedesign.Suchknowledgecanonlybeobtainedfromtheboardexperiments,makingusesmalldedicatedexamples.Theexampleshavetobebuildtogaintheinformationaboutthedifferentaspectsoftheboard.Buildingthetestbenchexamples,obtainingandevaluatingresultsrequiresallocationofextradesigntime.ForthepurposeofMPEG-4VideoDecoderimplementedonXilinx’sMultimediaDevelopmentBoard,thefollowingaspectshavebeeninvestigated:•MicroBlazesoftcoreperformancemeasurements;toinvestigatetheabilitiestheSWprocessor.•Memoryexplorations;tofindoutthedatatransfertimesbetweenthedifferenttypesofmemories(localmemory,blockRAM,off-chipZBTRAM).•Vendorsupportmaturity;toinvestigatethesupportfornewlyintroduceddevicesandimplementationboards.Asmentionedinthesectionabove,OCAPI-XLbasedmethodologyprovidesfullsupportforhigh-levelmodellingofdynamicreconfigurationbyintroducingcontextswitching.Attheimplementationlevel,thesituationismuchmorecomplicated.ContextswitchingbetweentheHWandSWrequiresthetransitionsfromonefunctiontoanotherassmoothaspossible.Thisresponsibilityfallstothereal-timeoperatingsystem(OS),whichmanagesallthesecomplextransitions.Amongtheothertasks,theOSisresponsibleformanagingtheswitchingbetweenreconfigurableHWandSWontheFPGA,i.e.itmustsuspendcertaintasksthatarerunningsothatothertaskscantakeaturn.Todoso,itmustrememberthestateofeachtaskbeforeitstoppedexecutionsothateachtaskcanrestartfromthesamestate.Theimplementationofsuchmechanismsontherecentimplementation 176Chapter7boardsispossiblebutnotstraightforward.Thetrueexploitationofdynamicreconfigurationisexpectedinfutureplatforms.OCAPI-XLhasbeenextendedbyanewbusmodelextensionbaseduponpropertiesofprocesstypesandcommunicationprimitives.Byannotatingabusmodelwithtiminginformation,thebustimingbehaviourcanbemodelled.Althoughthemodelcouldbeexploitedforhigh-levelbusperformanceestimationoftheMPEG-4VideoDecoder,thedecisionhasbeenmadenottousethisapproach.Instead,annotationofeachtypeoftransferonthebusbyspecifictiminginformationobtainedfromsmallexperimentshasbeenutilized.ThemainreasonforusingthisapproachwasinsufficientamountofinformationfoundinthedocumentationabouttheXilinx’sVirtex-IIbusarchitecture,whichusesacombinationofbusses(PLB,OPB)andbridges(PLB2OPB)tocommunicatebetweenthereconfigurableHWandembeddedprocessor.6.2.3SummaryFromthedescriptionsabove,thefollowingconclusionscanbedrawn:•Thesystem-levelOCAPI-XLapproach,extendedwithreconfigurabilityfeatures,isvalidapproachfordesigningRSoCs.•Dynamicreconfigurationrepresentsimplementationobstacleinrecentreconfigurablearchitectures.•Fromadesignerpointofview,deepknowledgeofthereconfigurablearchitecturesandplatform(s)isstillrequiredforefficientmappingofthealgorithm.•Thereisalackofimplementationinformation(especiallyfornewlyintroducedplatforms),whichcanbefed-backtothehigh-leveldesignphaseforaccuratehigh-levelmodelling.7.CONCLUSIONSThedesignmethodologyandflowdescribedinChapter4,instantiatedforOCAPI-XL,havebeenusedfortherealizationofMPEG-4VideoDecoderdemonstratortargetinghigh-levelperformanceestimationofthereconfigurablesystems.TheOCAPI-XLperformanceestimationtechniques,enhancedbynewreconfigurablefeatures,demonstratetheabilitytoobtainhigh-levelstatistics,whichenableconstructionofgraphwithpossibleshapeofcurveforbestpossiblepartitioningoftheapplication.Bymodellingthedynamicreconfigurationofselectedprocessesinearlystageofthedesign,feedbackaboutinfluenceofdifferentdynamicreconfigurationschemeson 7.MPEG-4VideoDecoder177performanceofthesystemisprovided.Basedonthat,theoptimalrun-timeoperationofthevideodecoderapplicationcanbeselected.Theaccuracyofestimationshasbeenmeasuredbycomparingthedifferencebetweentheestimatedperformanceandboardperformance.Thedifferenceisveryacceptable8%forthemostrelevanttestsequence.TheMPEG-4VideoDecodersystemhasbeenimplementedonVirtex-IIFPGAwithembeddedMicroBlazesoftprocessorcoreonXilinxMultimediaDevelopmentBoard.REFERENCES1.MPEGRequirementsSubgroup(2001)OverviewoftheMPEGStandard,ISO/IECJTC1/SC29WG11N39312.JPEG(2004)Availableat:http://www.jpeg.org3.ISO/IECJTC1/SC29WG1114496-5(2000)InformationtechnologyGenericcodingofaudio-visualobjectsPart5Amd1:SimulationsoftwareN35084.ISO/IECJTC1/SC29WG1114496-2(1999)InformationtechnologyGenericcodingofaudio-visualobjectsPart2:VisualAmendment1:VisualExtensionsN30565.BhaskaranV,KonstantinidesK(1997)ImageandVideoCompressionStandards.AlgorithmsandArchitectures,KluwerAcademicPublishers6.Xilinx(2004),Availableat:http://www.xilinx.com/products/design_resources/proc_central/index.htm7.ATOMIUM(2004),Availableat:http://www.imec.be/design/atomium Chapter8PROTOTYPINGOFAHIPERLAN/2RECONFIGURABLESYSTEM-ON-CHIP1,21KonstantinosMasselosandNikolaosS.Voros1INTRACOMS.A.,HellenicTelecommunicationsandElectronicsIndustry,Greece2CurrentlywithImperialCollegeofScienceTechnologyandMedicine,UnitedKingdomAbstract:InthischaptertheprototypingofareconfigurableSystem-on-ChiprealizingtheHIPERLAN/2WLANsystemisdiscussed.Inthiscasereconfigurablehardwarewillbeexploitedtointroducepost-fabricationfunctionalupgrades.Fortheprototypingacommercialplatformusingcomponents-off-the-shelfhasbeenused.Thedesignflowandsystemleveldesignmethodsdescribedinthepreviouschapterswereusedforthesystemdevelopment.Anevaluationofthedesignflowandmethodsinthecontextofthisspecificdesignisalsopresented.Keywords:ReconfigurableSystem-on-Chip,prototyping,HIPERLAN/2,functionalityupgrading.1.INTRODUCTIONInthischapter,theprototypingofaHIPERLAN/2reconfigurableSystem-on-Chiponaplatformincorporatingcomponents-off-the-shelf(COTS)isdescribed.ThetargetedsystemhasbeendevelopedtoformthebasisforthedevelopmentofafamilyoffixedwirelessaccesssystemsbasedonHIPERLAN/2thatcanbeupgradedtosupportoutdoorcommunicationsaswell.Theintegrationofadditionalfunctionalitythatmaybeusedinafutureproductimprovement(evenafterproductshipment)couldrelyontheuseofsoftwareupgrades(thisisacommonlyusedpracticeinsoftwareproducts).However,duetotheexpectedcomplexitythesystempartsthatwillsupporttheextrafunctionality(mainlyrelatedtocomplexphysicallayerDSPtasks)hardwareaccelerationwillberequired.Thedesignflow179N.S.VorosandK.Masselos(eds.),SystemLevelDesignofReconfigurableSystems-on-Chips,179-207.©2005Springer.PrintedintheNetherlands. 180Chapter8presentedinChapter4hasbeenadoptedforthedesignofthetargetedsysteminordertoprovide(a)efficientarchitectureexplorationearlyenoughinthedesigncycleand,(b)aseamlesspathfromspecificationtoimplementation.2.HIPERLAN/2SYSTEMDESCRIPTIONTheHIPERLAN/2system[1,2,3]includestwotypesofdevices:themobileterminals(MT)andtheaccesspoints(AP).AtypicalHIPERLAN/2architectureisdepictedinFigure8-1.ThearchitecturesoftheAccessPointandtheMobileTerminalarepresentedinFigure8-2.APethernetbackboneMTMMTTFigure8-1.TypicalHIPERLAN/2architectureACCESSPOINT(AP)MOBILETERMINAL(MT)RxIFRxIFIFPartRFFront-endRFFront-endIFPartTxIFTxIFBASEBAND/DLCBASEBAND/DLCProcessorProcessorModemI/ObusModemI/ObusETHERNETPCIControllerController&BridgeRFcontrolRFcontrolETHERNETPCIBusTransceiverFigure8-2.ArchitecturesofAP–MT 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip181TheHIPERLAN/2basicprotocolstackanditsfunctionsareshowninFigure8-3.Theconvergencelayer(CL)offersaservicetothehigherlayers.TheDLClayerconsistsoftheErrorControlfunction(EC),theMediumAccessControlfunction(MAC)andtheRadioLinkControlfunction(RLC).Itisdividedinthedatatransportfunctions,locatedmainlyontherighthandside(userplane),andthecontrolfunctionsonthelefthandside(controlplane).TheuserdatatransportfunctionontherighthandsideisfedwithuserdatapacketsfromthehigherlayersviatheUserServiceAccessPoint(U-SAP).ThispartcontainstheErrorControl(EC),whichperformsanARQ(AutomaticRepeatRequest)protocol.TheDLCprotocoloperatesconnectionoriented,whichisshownbymultipleconnectionendpointsintheU-SAP.OneECinstanceiscreatedforeachDLCconnection.Inthecasewherethehigherlayerisconnectionoriented,DLCconnectionscanbecreatedandreleaseddynamically.Inthecasewherethehigherlayerisconnectionless,atleastoneDLCconnectionmustbesetupwhichhandlesalluserdata,sinceHIPERLAN/2ispurelyconnection-oriented.TheleftpartcontainstheRadioLinkControlSublayer(RLC),whichdeliversatransportservicetotheDLCConnectionControl(DCC),theRadioResourceControl(RRC)andtheAssociationControlFunction(ACF).OnlytheRLCisstandardizedwhichdefinesimplicitlythebehavioroftheDCC,ACFandRRC.OneRLCinstanceneedstobecreatedperMT.TheCLontopisalsoseparatedinadatatransportandacontrolpart.ThedatatransportpartprovidestheadaptationoftheuserdataformattothemessageformatoftheDLClayer(DLCSDU).IncaseofhigherlayernetworksotherthanATM,itcontainsasegmentationandreassemblyfunction.ThecontrolpartcanmakeuseofthecontrolfunctionsintheDLCe.g.whennegotiatingCLparametersatassociationtime.Figure8-3.HIPERLAN/2protocolstackandfunctions 1822Chapter8TheDLCfunctionsincludethefollowingoperations:•(Des)association•DLCUser(de)connection•encryption,decryption•(de)framing•Contentionmanagementmechanism•BroadcastControlChannel(BCCH)andFrameControlChannel(FCCH)analysisandsynthesis•DLC-CLbuffering•AutomaticRepeatRequest(ARQ)mechanismforasynchronoustransactions•PowerSaving•DynamicFrequencySelection•TransmissionPowerControlThemediumaccesscontrol(MAC)isacentrallyscheduledTDMA/TDDscheme.CentrallyscheduledmeansthattheAP/CCcontrolsalltransmissionsovertheair.Thisisworthforuplink,aswellasfordownlinkanddirectmodephase.ThebasicstructureoftheairinterfacegeneratedbytheMACisshowninFigure8-4.ItconsistsofasequenceofMACframesofequallengthwith2msduration.EachMACframeconsistsofseveralphases:Broadcast(BC)phase,Downlink(DL)phase,Uplink(UL)phase,DirectLinkPhase(DiL),Randomaccessphase(RA).MACFrameMACFrameMACFrameMACFrameBCPhaseDLPhaseDiLPhaseULPhaseRAPhaseflexibleflexibleflexibleflexibleFigure8-4.BasicMACframeformatTheDL,DiLandULphasesconsistoftwotypesofPDUs.ThelongPDUshaveasizeof54bytesandcontaincontroloruserdata(seeFigure8-5).TheDLCSDU,whichispassedfromortotheDLClayerviatheU-SAPhasalengthof49.5bytes.Theremaining4.5bytesareusedby 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip183theDLCforaPDUtypefield,asequencenumber(SN)andacyclicredundancycheck(CRC).ThepurposeoftheCRCistodetecttransmissionerrorsandisused,togetherwiththeSN,bytheEC.TheshortPDUswithasizeof9bytescontainonlycontroldataandarealwaysgeneratedbytheDLC.Theymaycontainresourcerequestsintheuplink,ARQmessageslikeacknowledgementsanddiscardmessagesorRLCinformation.Thesamesizeof9bytesisalsousedintheRCH.TheRCHcanonlycarryRLCmessagesandresourcerequests.TheaccessmethodtotheRCHisaslottedalohascheme.Thisistheonlycontention-basedmediumaccessphaseinHIPERLAN/2.Thecollisionresolutionisbasedonabinarybackoffprocedure,whichiscontrolledbytheMTs.TheAP/CCcandecidedynamicallyhowmanyRCHslotsitprovidesperMACframe.PDUSNPayloadCRCTypeDLCPDU(54octets)Figure8-5.FormatofthelongPDUsInthephysicallayerorthogonalfrequencydivisionmultiplexing(OFDM)hasbeenselectedasmodulationschemeforHIPERLAN/2duetoitsgoodperformanceonhighlydispersivechannels.Thechannelrasterisequalto20MHztoprovideareasonablenumberofchannels.Inordertoavoidunwantedfrequencyproductsinimplementationsthesamplingfrequencyisalsochosenequalto20MHzattheoutputofatypicallyused64-pointIFFT.Theobtainedsubcarrierspacingis312.5kHz.Inordertofacilitateimplementationoffiltersandtoachievesufficientadjacentchannelsuppression,52subcarriersareusedperchannel,48subcarrierscarryactualdataand4subcarriersarepilotswhichfacilitatephasetrackingforcoherentdemodulation.Thedurationofthecyclicprefixisequalto800ns,whichissufficienttoenablegoodperformanceonchannelswith(rms)delayspreadupto250ns(atleast).Tocorrectforsubcarriersindeepfades,forward-errorcorrectionacrossthesubcarriersisusedwithvariablecodingrates,givingcodeddataratesfrom6upto54Mbps.Akeyfeatureofthephysicallayeristoprovideseveralphysicallayermodeswithdifferentcodingandmodulationschemes,whichareselectedbylinkadaptation.BPSK,QPSKand16QAMarethesupportedsubcarriermodulationschemes.Furthermore,64QAMcanbeusedinanoptionalmode.Forwarderrorcontrolisperformedbyaconvolutional 1844Chapter8codeofrate1/2andconstraintlengthseven.Thefurthercoderates9/16and3/4areobtainedbypuncturing.ThemodesarechosensuchthatthenumberofencoderoutputbitsfitstoanintegernumberofOFDMsymbols.Toadditionallyaccommodatetailbitsappropriatededicatedpuncturingbeforetheactualcodepuncturingisapplied.InTable8-1thesevenphysicallayermodesarespecified,ofwhichthefirstsixaremandatoryandthelastonebasedon64QAMisoptional.Table8-1.ModesandmodulationschemesofHIPERLAN/2ModeModulationCoderateBitrate(Mbps)1BPSK1/262BPSK3/493QPSK1/2124QPSK3/418516QAM9/1627616QAM3/436764QAM3/454PDUTrainfromDLCDATAFECINTERLEAVERSCRAMBLERPREAMBLESTotheIF/RFCONSTELLATIONPILOTCYCLICunitsIFFTENCODERINSERTIONPREFIXFigure8-6.HIPERLAN/2transmitterchainThetransmitterchainoftheHIPERLAN/2physicallayerisillustratedinFigure8-6.Inthetransmitterpath,binaryinputdataareencodedbyastandardrate1/2convolutionalencoder.Theratemaybeincreasedbypuncturingthecodedoutputbits.Afterinterleaving,thebinaryvaluesaremodulatedbyusingPSKorQAM.Theinputbitsaredividedintogroupsof1,2,4or6bitsandconvertedintocomplexnumbersrepresentingBPSK,QPSK,16QAMor64QAMvalues.Tofacilitatecoherentreception,fourpilotvaluesareaddedtoeach48datavalues,soatotalof52valuesisreachedperOFDMsymbol,whicharemodulatedonto52subcarriersbyapplyingtheIFFT.Tomakethesystemrobusttomultipathpropagation,acyclicprefixisadded.Afterthisstep,thedigitaloutputsignalscanbeconvertedtoanalogsignals,whicharethenup-convertedtothe5GHzband,amplifiedandtransmittedthroughanantenna. 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip185ThestructureandthespecificationsofthephysicallayerreceiverarenotavailablefromtheHIPERLAN/2standard.AgenericHIPERLAN/2receiverisillustratedinFigure8-7.CHANNELESTIMATORFromRF/IFunitsCYCLICPERFIXSYNCHRONIZERFFTFEQEXTRACTORPILOTCONSTELLATIONPDUtrainDEINTERLEAVERFECDESCRAMBLEREQUALIZERDECODERHIPERLAN/2receiverchainInTable8-2thephysicallayertimingparametersofHIPERLAN/2systemarepresented.Table8-2.HIPERLAN/2physicallayertimingparametersPARAMETERSVALUESamplingfrequency(f)20MHz(T=50ns)Usefulsymbolpartduration(IFFTsymbol)64xT=3.2µsCyclicprefixduration16xT=0.8µsOFDMSymbolinterval80xT=4µsSubcarrierspacing0.3125MHz(1/3.2µs)Spacingbetweenthetwooutmostsubcarrriers16.25MHz(52x0.3125MHz)Broadcastburstpreambleduration16µsDownlinkburstpreambleduration8µsUplinkburstshortpreambleduration12µs3.IMPLEMENTATIONPLATFORMDESCRIPTIONTheARMIntegrator/APAHBASICDevelopmentPlatformhasbeenselectedfortheprototypingoftheHIPERLAN/2system.TheplatformisdesignedforhardwareandsoftwaredevelopmentofdevicesandsystemsbasedonARMcoresandtheAMBAbusspecification. 1866Chapter8ARMIntegratorsupportsuptofourprocessors(coremodules)tobestackedontheconnectorsHDRAandHDRBanduptofourlogicmodulestobestackedontheconnectorsEXPAandEXPB,(atotalnumberoffivemodulesi.e.2coremodulesand2-3logicmodulesaresupported).TheARMIntegratorprovides:•clocksandthreecounter/timers•busarbitration•interrupthandlingfortheprocessors•32MBof32-bitwideflashmemory•512KBof32-bitwideSSRAM•256KBbootROM(8bitswide)•PCIbusinterface,supportingexpansionon-board(3PCIslots)orinaCompactPCIcardrack•ExternalBusInterface(EBI),supportingmemoryexpansion.TheIntegrator/APalsoprovidesoperatingsystemsupportwithflashmemory,bootROM,andinputandoutputresources.Readsfromtheflashmemory,bootROM,SSRAM,andexternalbusinterfacearecontrolledbytheStaticMemoryInterface(SMI).3.1MotherboardarchitectureThemotherboardhoststheconnectorsforthecoreandlogicmodulesthatareconnectedinparalleltothesystembus.TheblockdiagramofthemotherboardisshowninFigure8-8.ThesystemcontrollerFPGAprovidescontrolfunctionsfortheplatform(includingbusarbitration–uptosixmastersaresupported)andinterfacesthecoreandlogicmodules(throughthesystembus)withtherestoftheresourcesonthemotherboard(theFlash,SSRAM,ROM,PCIbridgevariousperipherals–counters,clocks,GPIO,UARTs,keyboardandmouse,LEDsandtheinterruptcontroller).ThesystembusisroutedbetweenFPGAsoncoreandlogicmodulesandtheAP.ThisenablestheIntegratortosupportbothoftheAHBandASBbusstandards.Atreset,theFPGAsareprogrammedwithaconfigurationimagestoredinaflashmemorydevice.OntheAP,theflashcontainsoneimagethatconfigurestheAPforoperationwitheitheranAHBorASBsystembus.Oncoreandlogicmodules,theflashcancontainmultipleimagessothatthemodulecanbeconfiguredtosupporteitherAHBorASB. 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip187Figure8-8.ARMIntegratormotherboardblockdiagram3.2CoremodulesThecoremoduleboardincludes:•ARMmicroprocessorchip(ARM7TDMIhasbeenselected)•256KBSynchronousSRAM(andrelevantcontroller)•SDRAMDIMMsocket(256MB)•AMBAsystembusinterfacetoplatformboard•Clockgenerators•Resetcontroller•JTAGinterfacetoMulti-ICE™•CoremoduleFPGAprovidingsystemcontrolfunctionsforthecoremodule,enablingittooperateasastandalonedevelopmentsystemorattachedtoamotherboard.TheFPGAimplements:1.SDRAMcontroller2.SystemBusBridge3.Resetcontroller4.Interruptcontroller 1888Chapter85.Status,configuration,andinterruptregisters•Multi-ICE,logicanalyzer,andoptionalTraceconnectorsThearchitectureofthecoremoduleisshowninFigure8-9.Thevolatilememory(SSRAMandSDRAM)islocatedontheCoreModuleclosetotheCPU,sothatitcanbeoptimizedforspeed.Thismeansthatthememorybandwidthissignificantlyimprovedoverpreviousdevelopmentboards.ConsiderableefforthasgoneintoensuringoptimalmemoryandAMBAbusperformance.Actualfiguresaredependentonthespeedofthemicroprocessorchipusedbuttypicallytheyareintheregionof50MHzfortheSDRAMand25MHzfortheAMBAsystembus.Figure8-9.Coremodulearchitecture3.3LogicmodulesThelogicmodulecomprisesthefollowing:•AlteraorXilinxFPGA•ConfigurationPLDandflashmemoryforstoringFPGAconfigurations•1MBZBTSSRAM•Clockgeneratorsandresetsources•Switches•LEDs•Prototypinggrid•JTAG,Trace,andlogicanalyzerconnectors 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip189•SystembusconnectorstoamotherboardorothermodulesUpto4logicmodulescanbestackedontopofeachother,andanInterfaceModuleoranAnalyzerModulemaybefittedontopofthestack.Coreandlogicmoduleshandletheinterruptsignalsdifferently.Coremodulesmustreceiveinterrupts,butlogicmodules,thatimplementperipherals,generateinterrupts.ThearchitectureofthelogicmoduleisshowninFigure8-10.Figure8-10.LogicmodulearchitectureWhenusedwithanIntegratormotherboard,thelogicmodulesrequireasystembusinterface.ThesystembusinterfaceconnectsthelogicmodulewithotherIntegratormodules.ThismustbeimplementedaccordingtotheAHBorASBspecifications.Thelogicmoduleprovidesthegeneral-purposeinterfacemoduleconnectorEXPIMtoenableyoutoaddaninterfacemoduletothesystem.Theconnectorprovidesaccesstotwobanksofinput/outputpinsontheFPGAplusanumberofcontrolsignals.Thelogicmoduleprovides1MBofZBTSSRAMand4MBofflashmemory.A256Kx32-bitZBT-SSRAM(MicronpartnumberMT55LC256K32F)isprovidedwithaddress,data,andcontrolsignalsroutedtotheFPGA.TheaddressanddatalinestotheSSRAMare 190Chapter8completelyseparatefromtheAMBAbuses.ThisisusedforFPGAconfiguration,andmustnotbeusedforanyotherpurpose.ConfigurationismanagedbytheconfigurationPLD.4.SYSTEMLEVELDESIGNThesystemleveldesignpartofthemethodologydescribedinChapter4(andpresentedinFigure8-11)hasbeenadoptedforthedevelopmentoftheHIPERLAN/2system.Forthesystemlevelexploration,theOCAPI-XLenvironmenthasbeenemployed(additionaldetailscanbefoundinChapter6).Requirements/SpecificationCaptureArchitectureSystemDefinitionPartitioningMappingSystem-LevelSystem-LevelDesignSimulationFigure8-11.SystemleveldesignpartoftheproposedmethodologyAspartofthedesignprocesssystemrequirementshavebeendocumented,whilethetargetedfunctionalityhasbeenspecifiedthroughthedevelopmentofanexecutablemodel.Specifically,anANSICmodelhasbeendevelopedfortheMACandphysicallayers’functionalityoftheHIPERLAN/2system.ThebasicstructureoftheANSICmodelofthetargetedfunctionalityisshowninFigure8-12. 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip191MAClayerCommandFeedbackPHYlayerPHYControllerFeedbackControllerRSSnBinaryuestqTransmitReceivealgorithmsEnable/disableReceivereontrolinformatioRSSmeasurementcontrolComplex-basedcTransmitReceiveRSSanalysisTxCarrierRxpowerfrequencygainRSSRadioReceiveinterfaceLegend:controldatadata+controlFigure8-12.StructureoftheANSICmodelofthetargetedfunctionalityThephysicallayermodelisdividedintotwoparts:complexnumbersbasedalgorithms(mapping,OFDM,PHYbursts)andbinaryalgorithms(scrambling,FEC,interleaving).AblockdiagramofthephysicallayerANSICmodelisshowninFigure8-13.Physicallayersubmodulesaredesignedaspipelinedprocedureswithunitdataprocessing.Anumberofconfigurationparametersaresupportedforthephysicallayermodules:•widthandpositionofpointinfixedpointnumbers(separateforfrequencydomain,timedomain,FFTcalculations,FFTtwiddlefactors,channelcorrectionandCFOcancellationmultipliers)•numberofsoftbitsinViterbialgorithmsoftvaluerepresentation•timesynchronizationthreshold,durationandtime-outs•thehighestconfidencelevelthresholdofthede-mapper•sizesofinternalbuffers(FFTbuffers,receivercommandbuffer,receiverdatabuffer) 192Chapter8FromMACsublayermodelToMACsublayermodelFeedbackControllerlllRxCommandBufferPHYControllerphy_fb_controller()phy_controller()phy_rx_cmd_buffer_put()De-Scramblerphy_rx_descrambler()ScramblerTerminationCodeCleanerdCephy_tx_scrambler()phy_scrambler()phy_rx_code_determinator()CodeTerminatorViterbiDecoderphy_tx_code_terminator()phy_scrambler_proc()phy_rx_viterbi()ConvolutionalEncoderdDe-PuncturingP11phy_tx_conv_encoder()add_tail_bits()phy_rx_depuncturing_P1()PuncturingP1De-PuncturingP22phy_tx_puncturing_P1()phy_rx_depuncturing_P2()PuncturingP2De-Interleaverphy_tx_puncturing_P2()phy_rx_deinterleaver()confidence_bits()InterleaverDe-Mapperphy_tx_interleaver()phy_rx_demapper()map_n_bits()MapperChannelCorrectionphy_tx_mapper()phy_ss_factors[]phy_rx_channel_correction()PilotsandZerosInsertioneChannelestimationatioFlowSwitchFineCFOEstimationphy_tx_pilot_ins()phy_rx_channel_estimation()phy_rx_flow_switch()phy_rx_fine_cfo()FFTphy_tx_fft_if()phy_fft_manager()phy_rx_ps_converter()phy_fft_pipeline_stage_1()phy_fft_pipeline_stage_2()phy_fft_pipeline_multiplier_1()phy_fft_pipeline_stage_3()tx_req_pooltx_cmd_poolphy_fft_pipeline_stage_4()tx_cmd_pooltx_req_poolphy_fft_pipeline_stage_4()phy_fft_pipeline_multiplier_2()phy_fft_pipeline_stage_5()phy_fft_pipeline_stage_6()phy_cfo_1_multiplierphy_tx_ps_converter()phy_fft_sp_converter()phy_rx_fft_if()phy_cfo_16_multiplierCFOCancellationnstimrollophy_tx_controller_buffer_put()phy_rx_cfo_cancel()phy_rx_coarse_cfo()rx_cmd_bufitx_bufferphy_tx_controller_buffer_put_after()phy_rx_time_sync()phy_rx_cmd_buffer_get()ReceptionControllerphy_tx_controller_buffer_get()transmit_preamble_output()phy_rx_controller_data()phy_rx_controller()phy_rx_controller_cmd()phy_tx_controller_input()transmit_preamble()ReceiverBufferphy_rx_data_buffer_get()tx_dataphy_tx_set_pwr()bufferTimetickphy_tx_controller()TimetickLegend:Externalfunctionsoreventsphy_rx_data_buffer_put()phy_tx_output()FunctionsInputdataImportantdatastructuresFunctioncallsDatareadsandwritesFigure8-13.PhysicallayerANSICmodel-majorfunctionsanddatastructures 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip193Physicallayersubmodulesareimplementedasprocedures,whichgetasstandardparameters:requesttype,command,commandparametersanddata.Shareddataarerepresentedasglobalvariables.Eachsubmodulehasaglobalvariablewhichvaluedefinesaprocedurewhereoutputwillbedirected.Bydefaultthisvariableisassignedavalueoftheprocedurecorrespondingtothenextmoduleinphysicallayerhierarchy.Eachphysicallayermodulecallsthenextonewhendataportionrequestedbynextmoduleinterfaceisready.Controlinformation(commands)isforwardedsynchronouslywithdataexceptofsharedFFTmodulesandViterbialgorithminternals.SignificantpartofthehighleveldesignofMAClayeriscommonforAccessPointandMobileTerminaldevices.MAClayerhigh-leveldesignisfocusedonexternalinterfacesofthesub-layeranditsdecompositionincasesofAccessPointandMobileTerminal.TheblockdiagramsoftheANSICmodelsfortheAccessPointandtheMobileTerminalMAClayersareshowninFigure8-14andinFigure8-15respectively.IncontrasttophysicallayerMACmodulesintercommunicationisactivatedwhenalogicallyfinisheddatastructureiscompletelyready.Informationistransferredintheformofmemorypointers,orcopiedtosomebuffer.Integrationtesting(simulation)oftheANSICmodelhasbeenperformed.AnenvironmentwithoneAccessPoint(AP)andtwoMobileTerminals(MTs)hasalsobeenemulated.Duringtheemulation,differentkindsoftraffichavebeenpassedbetweenAPandMTsthroughanemulatedchannelintermoffixedpointcomplexnumbers.AnalysisoftheHIPERLAN/2computationalcomplexityandperformanceconstraintsleadtotheallocationoftwocoremodulesandtwologicmodulesfortherealizationoftheHIPERLAN/2systemontheARMIntegratorplatform.EachcoremoduleincludesanARM7TDMIprocessorandeachlogicmoduleincludesaXilinxVirtexE2000FPGA(0.18µm,6metallayers,with500Kusablegatesand832KbofadditionalRAM(BlockRAM)andbuilt-inclockmanagementcircuitry(8DLLs)).Logicandcoremodulescommunicateusingthebusoftheplatform(AMBA). 194Chapter8Figure8-14.AccessPointMAClayerANSICmodel-majorfunctionsanddatastructures 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip195CommandfromEC/RLCmac_get()mac_ind()SchedulerMACInterfacemt_scheduler_cmd_handler()mac_cmd()schedulermt_go()DUCTableduc_table_get()mt_scheduler_prepare()duc_table_add()mt_scheduler_reset()duc_table_del()mt_scheduler_start()duc_table_get_by_index()list_reset()list_del()list_add_last()duc_tableCurrentFramemt_tx_fifo_put()fc_fch_ie_blk_data_add()fc_bch_data_add()fc_reset()mt_tx_fifo_get()fc_fch_ie_get()fc_bch_get()cur_frametx_fifoRxSwitchCRCmt_phy_manager_cmd_handler()mac_src_calc()mt_rx_switch_reset()mt_phy_manager_rch_req()mac_src_check()rx_switchRCHhandlerp2m_fb_handler()mt_rch_handler()rch_handlerphy_fb_controller()phy_controller()Legend:OutsidefunctionsoreventsFunctioncallsFunctionswithglobalscopeUserdatapathImportantdatastructuresControldatapathFigure8-15.MobileTerminalMAClayerANSICmodel-majorfunctionsanddatastructures 196Chapter8ThearchitectureoftheARMIntegratorinstancethathasbeenselectedfortherealizationoftheHIPERLAN/2systemisshowninFigure8-16.Thefirstcoremodule(ARM7TDMIprocessor)actsasprotocolprocessorrealizingthemajorpartoftheHIPERLAN/2DLCfunctionality.Thesecondcoremodule(ARM7TDMIprocessor)realizesthelowerpartoftheHIPERLAN/2MACfunctionalityandalsocontrolstheoperationofthebasebandblock.Thefirstlogicmodule(XilinxFPGA)realizesthefrequencyanddatadomainpartsofthereceiver.Thesecondlogicmodule(XilinxFPGA)realizesthetransmitter,thetimedomainblocksofthereceiver,theinterfacetoMACandaslaveinterfacetoanAMBAbus.COREMODULE#2COREMODULE#1LowerMAC&modemProtocolprocessorcontrolprocessorSRAMAHBbusSRAMAHBbusControllerinterfaceControllerinterfaceSRAMSRAMAMBAAHBARMIntegratorSYSTEMCONTROLFPGATOPLOGICBOTTOMLOGICAnalogMODULEMODULEAMBAarbiterRFEthernetcontrollerIFTxpath&RxtimeRxdata&PCIcontrollerdomain,MAC/PHYfrequencydomainExternalbusinterfaceInterfaceARMrelatedblocksFigure8-16.ArchitectureofselectedARMIntegratorplatforminstanceItmustbetakenintoconsiderationthattheARMIntegratorplatformisusedtoemulateatargetedreconfigurableSystem-on-Chip.BothcustomandreconfigurablehardwarecomponentsofthetargetedreconfigurableSystem-on-Chipareemulatedbythelogicmodules’FPGAsoftheARMIntegrator.SystemlevelpartitioningandtaskassignmentexplorationhasbeenperformedusingOCAPI-XLC++library.UsingtheANSI-Cmodelasinput,OCAPI-XLmodelsoftheHIPERLAN/2MACandphysicallayershavebeendeveloped.TheblockdiagramofthephysicallayerOCAPI-XLmodelisshowninFigure8-17.TheblockdiagramsoftheAccessPointandMobileTerminalMAClayerOCAPI-XLmodelareshowninFigure8-18andFigure8-19respectively. 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip197FromMACsublayermodelToMACsublayermodelfb_ctrlphy_ctrlrx_cmd_buferrx_scramblerrx_fectx_scramblerscrambler_initial_statecode_determinatortx_feccode_terminatorviterbiconv_encoderdepuncturing_p1puncturing_p1depuncturing_p2puncturing_p2deinterleaverinterleaverdemappermappercc_factorschannel_correctionpilot_inschannel_estimationrx_flow_switchfine_cfotx_fft_ifrx_ps_convertermanagertx_cmd_poolpipelinerx_cmd_pooltx_req_pooldata_buffersp_converterrx_req_poolbd_poolphy_cfo_1_multipliertx_ps_converterrx_fft_ifphy_cfo_16_multiplierphy_rx_timetx_ctrlcfo_cancelcoarse_cfophy_tx_timePHYTXoutputtime_syncrx_controllerLegend:rx_data_bufferExternalfunctionoreventProcessInputdataProcessgroupShareddatastructureOCAPI-xlmessageDatareadorwriteSemaphorepost()SemaphoreSemaphorewait())Figure8-17.BlockdiagramofthephysicallayerOCAPI-XLmodel Chapter8miuFromEC/RLCsub-layerToEC/RLCsub-layerdatabufdatamiuaddrdatabytesap_cmd_hdlap_schedulerap_duc_tableap_cur_frameap_mac_cfgap_cur_rrsLegend:ExternalfunctionoreventProcessap_phy_mgrap_rx_sw_infoap_rx_swFLIobjectShareddatastructureToPHYlayermodelFromPHYlayermodelOCAPI-xlmessageDatareadorwriteSemaphorepost())SemaphoreSemaphorewait())FunctioncallFigure8-18.BlockdiagramoftheAccessPointMAClayerOCAPI-XLmodelmt_rammiuFromEC/RLCsub-layerToEC/RLCsub-layerdatabufmac_upp_interfacedatamiuaddrdatabytesmt_cmd_hdlmt_schedulermt_duc_tablemt_frame_controlmt_mac_cfgLegend:ExternalfunctionoreventProcessFLIobjectmt_phy_mgrmt_rch_handlermt_rx_swShareddatastructuremac_phy_interfaceToPHYlayermodelFromPHYlayermodelOCAPI-xlmessageDatareadorwriteSemaphorepost())SemaphoreSemaphorewait())FunctioncallFigure8-19.BlockdiagramoftheMobileTerminalMAClayerOCAPI-XLmodel 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip199Forthehighlevelexplorationhigh-levelOCAPI-XLprocesses(procHLHW,procHLSWandprocManagedSW)havebeenusedtomodelthetimingbehavioroftheHIPERLAN/2tasksunderdifferentimplementationscenarios.Usingtheperformanceestimation(intermsofexecutioncycles)capabilitiesofOCAPI-XLdifferentmappingsofHIPERLAN/2tasksonhardwareandsoftwarehavebeenevaluatedandthemostpromisingsolutionhasbeenidentified.Underthepost-shipmentfunctionalityupgradingscenario,theHIPERLAN/2systemmustbeabletosupport(afterupgrade)outdoorfixedwirelessaccessoperation.Inthatcontext,thetasksthataremorecomplex,andconsequentlyaredifficulttobeupgraded,areidentifiedandassignedtoreconfigurablehardware.Tasksofthiskindarethereceiver’schannelestimationandcorrectionblock,andthereceiver’sdecodingblock(Turboand/orReedSolomondecodersarerequiredinoutdoorenvironmentswhileViterbidecoderisincludedintheHIPERLAN/2standard).5.IMPLEMENTATIONTheimplementationphaseofHIPERLAN/2systemcorrespondstothedetaileddesignandimplementationdesignstagesofthedesignflowdescribedinChapter4(theyarealsoshowninFigure8-20).ThehighlevelOCAPI-XLmodeldevelopedduringhighleveldesignhasbeenrefinedatafirststep.Therefinementincludedthechangeofprocesses’typesfromhighleveltolowlevel(procOCAPI1andprocANSIC).Thisallowedacycleaccuratesimulationofthecompletesystemfunctionalityandconfirmationthattimingconstraintsaremet.Forthetasksassignedtoinstructionsetprocessors,CcodehasbeendevelopedandmappedontheARM7TDMIprocessorsofthecoremodules.Thetoolsusedforthesoftwaredevelopmentprocessinclude:•Codegenerationtool.TheARM,THUMBCandEmbeddedC++compilers.•IntegratedDevelopmentEnvironmentCodeWarriorIDE.•ARMExtendedDebuggerDebuggingenvironmentforprocessorcores.ItprovidesinterfacetotheARMulatorandcanbeusedtodebugcodeonanARMEvaluationBoard.•InstructionSetSimulator(ARMulator)Simulatesatargetsysteminsoftware,allowingsoftwaredevelopmentwhenahardwaretargetisnotavailable. 2000Chapter8SpecificationRefinementHardwareSoftwareReconfigurableDesignDesignHardwareDesignExternalIPEExternalIPxternalIPIntegrationDDetailedetailiedDesiDesignignCo-VerificationFPGA/ASICSoftwareImplementationImplementationDesignDesignVerificationFPGADownloading/SiliconManufacturingImImplementationplementatioinProductDesiDesigniggngnQualificationFigure8-20.DetailedandimplementationdesignpartsoftheproposedmethodologyExecutiontimesforbasictasksofHIPERLAN/2DLC/MACarepresentedinTable8-3.Theresultshavebeenobtainedwithanoperationfrequencyof50MHz(cycle20ns).ThecodeandthedataforthetasksarestoredinSDRAMmemory.ThedetailedarchitectureofthefunctionalityrealizedbythelogicmodulesoftheplatformisshowninFigure8-21.AtypicalFPGAflowhasbeenadoptedforrealizationofthetasksassignedontheplatform’slogicmodules(mainlybasebandpartofHIPERLAN/2).Thetoolsusedinclude:•Modelsimforsimulation•LeonardoSpectrumforsynthesis•XilinxISEtoolsforbackenddesign 8.PrototypingofaHIPERLAN/2ReconfigurableSystem-on-Chip201Table8-3.ExecutiontimesforbasictasksofHIPERLAN/2DLC/MAClayer(whereAP:AccessPoint,MT:MobileTerminal,CL:ConvergenceLayer,Tx:Transmitter,Rx:Receiver)MT-BCH/FCHDecoderExecutionExecutionModemCtrlMACLayerDLCtaskstimetimetasksInitializationPhase(Reset&Config@slot1.20µsAP-Scheduler0.2mscommands)SynchronisationPhase2.65µsAP/MT-TxCL0.6ms(BCH_SRCH,Rx_FCHwithrpt=1,Rx_ACH)BCHdecodingandBCH5.25µsAP/MT-TxBuilder0.7msCRCchecking(fullframe)DecodingofasingleIE3.23µsAP/MT-TxBuilder15µs(UL)CopyusingDMA(580bytes–wordtransfer)Decodingof3IEs(2ULs,15µsAP/MT-RxDecoder0.4ms1DL)includingCRCchecking&PuncturingAP/MT-RxCL0.7msThetotalutilizationofthebottomlogicmodule(FPGA)is85%.Thetotalutilizationofthetoplogicmoduleis89%.TheutilizationperresourcetypeforthebottomandthetoplogicmodulesispresentedinTable8-4.Inordertofullyrealizethe5GHzwirelessLANaccesspointandmobileterminalcomponentsthebasebandmodem’sfunctionalityisfollowedbyanIF(20MHzto880MHz)andanRF(880MHzto5GHz)stage.Theanalog-to-digitalanddigital-to-analogconversion(NationalSemiconductorsLMX5301andLMX5306),forcommunicatingwiththeIFanalogfrontendsofthereceiverandthetransmitterrespectively,isimplementedonaseparateboardwhichseatsonadedicatedconnectorforexternalcommunicationsonthe“top”ofthestackoflogicmodules.AlsothecommunicationwiththePCIorEthernetinterfaceisdonethroughthatport. 2022Chapter8MOTHERBOARDNCSRFANALOGIFLOGICMODULETOPEXPBADCTxTxmemI/FLAVEAMBASCLK_60/80RxDACmemICS670_80/60MHzPM_CLKSYSCLK0<<(LowPNSYSCLK[3:0]

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
大家都在看
近期热门
相关文章
更多
相关标签
关闭