BROOM:具有28納米CMOS的適應低電壓的開源亂序處理器(下)
SECTION 5
MICROARCHITECTURAL ASSIST TECHNIQUES
第5章 微架構輔助技術
Low-voltage operation improves energy efficiency. Unfortunately, SRAM-based memories tend to fail first as voltage is lowered, suffering as much as an order of magnitude (10×) increase in bit errors for every 50 mV reduction in Vdd. To enable low-voltage operation, we implemented a number of features that allows the processor to tolerate higher error rates. All of these techniques were implemented at the RTL-level in Chisel:
低壓運行可提高能源效率。不幸的是,基于SRAM的存儲器往往會隨著電壓降低而首先失效,電壓每降低50 mV,比特錯誤就會增加多達一個數(shù)量級(10x)。為了實現(xiàn)支持低電壓,我們實現(xiàn)了許多功能,這些功能使處理器可以承受更高的錯誤率。所有這些技術都是在Chisel的RTL級別實現(xiàn)的:
line disabling (LD);? 線路禁用(LD);
line recycling (LR);??線回收(LR);
dynamic column redundancy (DCR);??動態(tài)列冗余(DCR);
bit bypass with SRAM (BB-S) for tag protection.?
具有SRAM(BB-S)的位旁路功能,用于標簽保護。
A built-in self-test checks for erroneous bits at boot-time after the voltage has been set. SRAM-based cache lines with bad bits can be disabled (LD). LD is a common technique, but it reduces capacity. Some of the capacity can be recovered by using line recycling—three disabled lines can be aggregated via majority-vote to regain 33% of the disabled line capacity, so long as each line’s failing bit is in a different location. LR was only used to protect the L2 cache data arrays.
設置電壓后,內置的自檢程序會在啟動時檢查錯誤的位??梢越镁哂绣e誤位的基于SRAM的緩存線(LD)。LD是一種常見的技術,但它會降低容量。某些容量可以通過使用線路回收來恢復——只要每條線路的故障位位于不同的位置,可以通過多數(shù)表決將三條禁用的線路聚合在一起,以重新獲得33%的禁用線路容量。LR僅用于保護L2緩存數(shù)據(jù)陣列。
Dynamic column redundancy (DCR) adds an extra bit to each cache set, and uses a multiplexer shift driven by a Redundancy Address to dynamically avoid the erroneous bit. Finally, the Bit Bypass with SRAM (BB-S) technique focuses on protecting erroneous bits in the tag arrays. Bit bypass uses flip-flops to store the necessary repair bits to fix a limited number of bad entries. Our BB-S scheme stores the repair error location information in SRAM for every tag entry, saving on area and reducing the BB tag search to find potential matching entries.
動態(tài)列冗余(DCR)為每個緩存集增加了一個額外的1bit,并使用由冗余地址驅動的多路復用器移位來動態(tài)避免錯誤位。最后,帶SRAM的位旁路(BB-S)技術致力于保護標簽陣列中的錯誤位。位旁路使用觸發(fā)器存儲必要的修復位,以修復數(shù)量有限的錯誤條目。我們的BB-S方案將每個標簽條目的維修錯誤位置信息存儲在SRAM中,從而節(jié)省了面積并減少了BB標簽搜索以查找潛在的匹配條目。
The details of the resilient cache design, including the measurement results, are discussed in more detail.[6]
彈性緩存設計的詳細信息(包括測量結果)將進行更詳細的討論。[6]
SECTION 6
AGILE DESIGN APPROACH
Figure 4?shows all chip builds and their critical path lengths performed over a four-month period as part of the BROOM tapeout effort. This involved the microarchitectural transformation of BOOMv1 to BOOMv2, and physical design of the chip. Data from postsynthesis (“syn”) and post-place-and-route (“par”) are shown and include builds performed at both the slow-slow (SS) typical-typical (TT) corners. For our flow, the SS corner was 0.81 V at 125 C and the TT corner was 0.9 V at 25 C. Early builds were only of a BOOM core plus an L2 cache while later builds add in the resiliency (“res”) hardware. One should be careful of drawing conclusions from this figure; most builds resulted in LVS and DRC violations and many changes were made between each build. For example, early builds explored shrinking structure sizes to find the most fundamental critical paths while later builds sought to find the upper limits of structure sizing before the post-place-and-route critical path noticeably worsened.
圖4顯示了作為BROOM一部分的所有芯片構造及其在四個月內執(zhí)行的關鍵路徑長度流片工作。這涉及從BOOMv1到BOOMv2的微體系架構轉換以及芯片的物理設計。顯示了來自合成后(“ syn”)和放置后路線(“ par”)的數(shù)據(jù),并包括在慢速(SS)典型值(TT)拐角處執(zhí)行的構建。對于我們的流程,SS拐角在125 C時為0.81 V,TT拐角在25 C時為0.9V。早期構建只是一個BOOM內核加上一個L2緩存,而后來的構建則增加了彈性(“ res”)硬件。人們應該謹慎地從這一數(shù)字得出結論。大多數(shù)構建導致違反LVS和DRC,并且每個構建之間進行了許多更改。例如,早期的構建探索收縮結構規(guī)模以找到最基本的關鍵路徑,而后期的構建則尋求在位置和路徑之后的關鍵路徑明顯惡化之前找到結構規(guī)模的上限。

All VLSI builds are shown by date. Both slow-slow (SS) and typical-typical (TT) corners are shown. RVT cells were used initially, but replaced with LVT cells starting in July. In the last month of the implementation effort, we added in the resiliency hardware (“res”) central to the research thesis of the chip which added to the critical path. While our design efforts slowly improved the postsynthesis critical paths, post-place-and-route reports showed the clock frequency was less amenable to our efforts. Not shown is the impact of our design efforts on removing any LVS and DRC errors. Thus, many of the builds do not represent a manufacturable design.
所有VLSI版本均按日期顯示。同時顯示了慢速(SS)和典型(TT)角。最初使用RVT電池,但從7月開始用LVT電池代替。在實施工作的最后一個月,我們增加了芯片研究論文中心的彈性硬件(res),這增加了關鍵路徑。盡管我們的設計工作緩慢地改善了合成后的關鍵路徑,但布局布線后的報告顯示時鐘頻率不適合我們的工作。沒有顯示我們的設計工作對消除任何LVS和DRC錯誤的影響。因此,許多構建并不代表可制造的設計
The BROOM tapeout effort started with a preliminary analysis of the BOOM’s quality-of-result (QoR). This effort was performed using RVT-based cells and targeting the TT corner. By changing BOOM’s configurations, we could build an intuition of what critical paths were truly critical and arrive at a plan of action for addressing these paths with a mixture of microarchitectural changes and physical design effort. For example, by removing an execution unit or shrinking the issue window size, we could better understand the benefits of design changes that would provide fewer issue ports per issue window. At this stage, we had concluded that four critical paths needed to be managed. As previously mentioned in?Section 3, these critical paths were as follows:
該BROOM流片工作開始與BOOM的質量的結果(結果質量)進行了初步分析。使用基于RVT的單元并以TT角為目標進行了此工作。通過更改BOOM的配置,我們可以直觀地了解哪些關鍵路徑才是真正關鍵的,并通過微體系結構更改和物理設計工作的混合得出解決這些路徑的行動計劃。例如,通過刪除執(zhí)行單元或縮小發(fā)送窗口的大小,我們可以更好地理解設計更改的好處,即可以為每個發(fā)送窗口提供更少的發(fā)送端口。在這一階段,我們已經(jīng)得出結論,需要管理四個關鍵路徑。正如前面在第3節(jié)中提到的,這些關鍵路徑如下:
issue window select;??發(fā)行窗口選擇;
register rename busy-table read;??注冊重命名忙表讀??;
conditional BPD redirect;??有條件的BPD重定向;
register file read.??注冊文件讀取。
The microarchitectural changes to address the first two items together took one month. We also quickly prototyped a new frontend design that approximated a critical path fix for item three but was otherwise functionally incorrect. This frontend prototype helped justify the necessary design work before we committed to a full redesign of the frontend. We began testing these new changes in mid-May and labeled the new design BOOMv2.?Figure 4?shows the cluster of activity that correspond to the BOOMv1 and early BOOMv2 analysis. After the initial BOOMv2 analysis was performed, another month of design effort went into BOOM to finish implementing the new frontend design and to apply changes based on the initial performance feedback.
為解決前兩個問題而進行的微體系結構更改花了一個月的時間。我們還快速制作了一個新的前端設計原型,該設計近似于第三項的關鍵路徑修復,但在功能上不正確。在我們致力于完全重新設計前端之前,該前端原型有助于證明必要的設計工作是合理的。我們從5月中旬開始測試這些新更改,并標記了新設計BOOMv2。圖4顯示了與BOOMv1和早期BOOMv2分析相對應的活動簇。在執(zhí)行初始BOOMv2分析之后,BOOM又進行了一個月的設計工作,以完成新前端設計的實現(xiàn)并根據(jù)初始性能反饋來應用更改。
Half-way through the design cycle (two months into the effort), as the BOOMv2 RTL effort was wrapping up, the implementation focus switched to physical design. Parameters in BOOM, for example, the ROB size or the BPD sizing, were reduced to get a better feel for the fundamental critical paths that still required work and to find which modules had the greatest effect on DRC and LVS errors. At this stage, the clock frequency improved as the BOOM parameters were changed to instantiate a smaller BOOM core.
在設計周期的一半(工作量為兩個月),隨著BOOMv2 RTL工作的完成,實施重點已轉向物理設計。減小了BOOM中的參數(shù),例如ROB大小或BPD大小,以更好地了解仍需要工作的基本關鍵路徑,并找出哪些模塊對DRC和LVS錯誤的影響最大。在此階段,隨著更改BOOM參數(shù)以實例化較小的BOOM內核,時鐘頻率得到了改善。
Once the BOOM microarchitecture was settled, we added the resiliency hardware to the design. Some of these resiliency structures are on the critical paths of SRAM accesses. Thus, any chip builds with resiliency hardware enabled may generate analysis reports that hide critical paths that still need attention in the BOOM RTL. To allow improvements to both the resiliency structures and to the BOOM core to occur in parallel, we continued to perform chip builds with and without the resiliency hardware enabled.
一旦BOOM微體系結構解決后,我們就在設計中添加了彈性硬件。其中一些彈性結構位于SRAM訪問的關鍵路徑上。因此,任何啟用了彈性硬件的芯片都可能生成分析報告,這些報告隱藏了在BOOM RTL中仍然需要注意的關鍵路徑。為了同時改進彈性結構和BOOM內核,我們在啟用和未啟用彈性硬件的情況下繼續(xù)執(zhí)行芯片構建。
As our attention shifted to physical design issues, the major issue was the design of a 6-read, 3-write register file. Semicustom design was chosen over placement hints to the tools, for better QoR and faster design convergence.
隨著我們的注意力轉移到物理設計問題上,主要問題是6號讀3號寫寄存器文件的設計。選擇了半定制設計,而不是工具的放置提示,以實現(xiàn)更好的QoR和更快的設計收斂。
For the final stage of the implementation effort, we focused on fixing LVS and DRC errors while continuing to make small improvements to the critical paths that showed up in the place and route reports. We also began to increase structure sizes in BOOM that were no longer on the critical path in the postplace and route reports. For example, we quadrupled the size of the BPD.
在實施工作的最后階段,我們專注于解決LVS和DRC錯誤,同時繼續(xù)對布局和路線報告中顯示的關鍵路徑進行小幅改進。我們還開始增加BOOM中的結構尺寸,這些尺寸不再位于后期放置和路線報告中的關鍵路徑上。例如,我們將BPD的大小增加了三倍。
Over the course of our tape-out effort, the syn results slowly improved. This was aided by our RTL productivity and our 2–3-h synthesis runs. However, par results proved more stubborn and stayed mostly flat. Congested designs took 16 h to route, giving us less time to iterate, and changes in placements resulted in unintuitive changes in the resulting critical paths. Alas, most par effort was focused on fixing DRC and LVS issues and not on fixing timing.
在我們的流片開發(fā)過程中,同步結果逐漸得到改善。這得益于我們的RTL生產率和2–3小時的合成運行。但是,同等成績的結果卻顯得更加頑固,幾乎沒有變化。擁擠的設計需要16個小時才能完成路線選擇,從而減少了我們的迭代時間,而且布局的更改導致所生成的關鍵路徑發(fā)生了不直觀的更改。哎,大多數(shù)工作都集中在解決DRC和LVS問題上,而不是在解決時間上。
The design was taped out after the four-month design cycle. As the effort of this design project was to explore the superscalar processor design in an ASIC flow, we continued making changes to the RTL to improve the QoR. Each additional build continued to provide us new critical paths to address. The final critical path of the place and routed design was through the resiliency error logging code. Our final sign-off at the SS corner was 1.17 ns after synthesis and 1.68 ns after place-and-route. The resulting chip was demonstrated to run up 1.0 GHz at the nominal 0.9 V and down to 0.47 V at 70 MHz with assist techniques. Without assistance, BROOM was able to operate down to 0.6 V.
在四個月的設計周期后,該設計被錄音記錄。由于此設計項目的工作是探索ASIC流程中的超標量處理器設計,因此我們繼續(xù)對RTL進行更改以提高QoR。每次增加的構建都繼續(xù)為我們提供解決的新關鍵途徑。布局布線設計的最終關鍵路徑是通過彈性錯誤日志記錄代碼。我們在SS角的最終簽發(fā)在合成后為1.17 ns,在布局布線后為1.68 ns。結果表明,借助輔助技術,所得芯片在標稱0.9 V時可運行1.0 GHz,而在70 MHz時可降至0.47V。在沒有幫助的情況下,BROOM能夠在0.6 V的電壓下工作。
BOOM is an open-source OoO superscalar RISC-V processor that can be used for architecture exploration, and education, but also in practical industrial designs.
BOOM是一種開源的OoO超標量RISC-V處理器,可用于體系結構探索,教育以及實際工業(yè)設計中。
The design that has been fabricated is not the ultimate BOOM design, as both its clock frequency and the IPC performance can be improved. The measured Coremark performance was 3.77 CM/MHz with 1.11 IPC, limited by the branch prediction accuracy and the long load-to-use delay introduced while fixing timing paths. While some issues have since been addressed, such as the addition of load-cache-hit speculation bypassing to improve load-to-use, other improvements are ongoing. Future VLSI implementation efforts can continue from a known, good design point and can build on the early exploratory builds that were needed for the BROOM tapeout.
制造出來的設計不是最終的BOOM設計,因為它的時鐘頻率和IPC性能都可以得到改善。在1.11 IPC下,測得的Coremark性能為3.77 CM / MHz,受限于分支預測精度和固定時序路徑時引入的長負載使用延遲。盡管此后已解決了一些問題,例如增加了加載緩存命中推測繞過以提高使用負載,但其他改進仍在進行中。未來的VLSI實施工作可以從一個已知的良好設計點繼續(xù)進行,并且可以建立在BROOM磁帶輸出所需的早期探索性構建之上。
SECTION 7
CONCLUSION
第7章 結論
BOOM is an open-source OoO superscalar RISC-V processor that can be used for architecture exploration, and education, but also in practical industrial designs. Modern OoO processors rely on a number of memory macros and arrays of different shapes and sizes, and many of them appear in the critical path when designed in a standard ASIC flow. The impact on the actual critical path is hard to assess by using flip-flop-based arrays and academic/educational modeling tools, because they may either yield physically unimplementable designs or generate designs with poor performance and power characteristics. Rearchitecting the design by relying on a hand-crafted, yet synthesizable register file array and leveraging hardware generators written in Chisel helped us isolate real critical paths from false ones. This methodology narrows down the range of arrays that would eventually have to be handcrafted for a serious production-quality implementation. Describing hardware using generators also helped us explore multiple design points, with the final design choices being committed to later in the design cycle.
BOOM是一種開源的OoO超標量RISC-V處理器,可用于體系結構探索,教育以及實際工業(yè)設計中?,F(xiàn)代的OoO處理器依賴于許多具有不同形狀和大小的存儲器宏和陣列,并且在以標準ASIC流程進行設計時,它們中的許多都出現(xiàn)在關鍵路徑中。使用基于觸發(fā)器的陣列和學術/教育建模工具很難評估對實際關鍵路徑的影響,因為它們可能會產生物理上無法實現(xiàn)的設計或生成性能和功耗特性較差的設計。通過依靠手工制作但可綜合的寄存器文件陣列并利用Chisel編寫的硬件生成器來重新設計設計幫助我們將真實的關鍵路徑與錯誤的路徑隔離開來。這種方法縮小了陣列的范圍,而這些陣列最終必須手工制作才能實現(xiàn)嚴格的生產質量。描述使用生成器的硬件還幫助我們探索了多個設計要點,最終的設計選擇將在設計周期的后期進行。
Chisel is a highly expressive language. With a proper software engineering of the code base, radical changes to the data-paths can be made very quickly. However, physical design is often a stumbling block to agile hardware development. Small changes could be reasoned about and executed swiftly, but larger changes could change the physical layout of the chip and dramatically affect critical paths and the associated costs of the new design point.
Chisel是一種富有表現(xiàn)力的語言。使用適當?shù)拇a庫軟件工程,可以非??焖俚貙?shù)據(jù)路徑進行根本性的更改。但是,物理設計通常是敏捷硬件開發(fā)的絆腳石。可以迅速推理出微小的更改并迅速執(zhí)行,但是較大的更改可能會更改芯片的物理布局,并極大地影響關鍵路徑和新設計點的相關成本。
The BOOM core is still being developed and we can expect further refinements.
BOOM核心仍在開發(fā)中,我們可以期待進一步的完善。
[1]?K. Yeager, "The MIPS R10000 superscalar microprocessor", IEEE Micro, vol. 39, no. 2, pp. 28-41, Apr 1996.
[2]?R. Kessler, "The Alpha 21264 Microprocessor", IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar./Apr 1999.
[3]?D. G. Chinnery et al., "Closing the power gap between asic and custom: An asic perspective", Proc. 42nd Annu. Des. Autom. Conf., pp. 275-280, 2005.
[4]?M. Anderson, "A more cerebral cortex", IEEE Spectrum, vol. 47, no. 1, pp. 58-63, Jan 2010.
[5]?S. J. Wilton et al., "CACTI: An enhanced cache access and cycle time model", IEEE J. Solid-State Circuits, vol. 31, no. 5, pp. 677-688, May 1996.
[6]?P.-F. Chiu et al., "Cache resiliency techniques for low-voltage RISC-V out-of-order processor in 28 nm CMOS", IEEE Solid-State Circuits Letters, 2019.