1 Overview
This 3D RISC-V CPU(We call it V-Rio)is a template of a serise of 3D CPU. One could contain 1 to several banks of CPU cores, from small silicon area to large size. Each bank contains 16 RISC-V cores and 4 V-Cache die.
Core: 16NB cores(NB means Number of Banks) with 64bit data width, 12 stage pipeline with 3 Issues/ 8 execution, support RV64GC instruction set.
NoC: support AMBA CHI protocal with verion 0050E.b,implement HN-F, HN-I, RN-I, SN-F and XP.
Cache: 64K instruction and data caches with cache coherency support. Hardware cache coherency
ensures the consistency of all caches efficiently. The cluster shared L2 Cache is 1MB. Software and hardware collaborative optimization of data consistency between TLB, I-Cache and D-Cache.
1.1 Processor Features
The main features of the 3D RISC-V CPU are:
Level | Item | Description | Comment |
---|---|---|---|
Chip | |||
Core | 16NB | 2M, NB=No. of Bank | |
Cluster | 8NB | 1 Cluster has 2 Cores ( or 4 Cores in another version) | |
Bitwidth | 64 | ||
ISA | RV64GC | ||
NoC | AMBA CHI | version 0050E.b, HN-F, HN-I, RN-I, SN-F and XP | |
No. of V-Cache | 4NB Same to DDR | Shared Cluster Cache(LLC), memory die bonding to logic die with ubump | |
DDR version | 4 | ||
DDR bitwidth | 128 | ||
No. of DDR interface | 4NB | ||
No. of PCIE interface | 2NB | ||
V-Cache | |||
V-Cache density | 8MB? | ||
Bitwidth | 128 | ||
No.of bank | |||
Sub-array configuration | |||
Coherence protocol | CHI | ||
Cluster | |||
No. of core | 2 | (~8 mpw seat) | |
Coherence protocol | CHI | ||
L2 cache | 1MB | Per cluster | |
L2 cache line size | 64B | ||
L2 cache ECC error protection | support | ||
Bus Interface | CHI | ||
Core | |||
Core area estimation @12nm | 7.6mm2 | Estimate core area by L3 cache area(~2 mpw seat) | |
Pipeline stage | 12 | ||
Issue | 3 | ||
Execution | 8 | ||
I-Cache /each core | 64KB | with cache coherency support, can be configured as 32KB | |
I-Cache /each core | 64KB | with cache coherency support, can be configured as 32KB | |
Branch target buffer | |||
ISA | RV64GC? | ||
Memory management unit | Sv39 memory management? | ||
Bus interface | AXI4-128 master interface? | ||
Interrupt controller | Configurable Platform Level Interrupt Controller (PLIC) | ||
Floating-point Unit | Support RISC-V F, D instruction extension Support IEEE 754-2008 standard? |
NoC refrence:https://github.com/RV-BOSC/OpenNoC/tree/master
1.2 Block Diagram
1.3 Major Components
The following list describes the major components or abbriviation of the 3D RISC-V CPU design
Bieviation | Description |
---|---|
ALU | Arithmetic Logic Unit |
NoC | Network On Chip |
ACPU | Application Central Process Unit |
SCU | Snoop Control Unit |
SF | Snoop Filter |
PLIC | Platform Level Interrupt Controller |
HN-F,HN-I, | Fully Coherent Home Node/Non- Coherent Home Node |
RN-I, | IO Coherent Request Node |
SN-F,SN-I | A Subordinate Node type used for Normal Memory/A Sunordinate Node used for Peripherals or Normal memory |
2 Manycore Top Level
2.1 ManycoreDiagram
The top of the 3D RISC-V CPU processor design is 8 CPU cluster. The CPU cluster instantiates
2 of CPU core top.
2.2 NoC and Interface
2.2.1 NoC Overview
The manycore CPU system leverages the OpenNoC interconnect, an AMBA CHI protocol-compliant (version 0050E.b) Network-on-Chip (NoC) designed to connect multiple cores, memory controllers, and peripherals. The OpenNoC implementation includes key components: HN-F (Home Node-Fully Coherent), HN-I (Home Node-Interface), RN-I (Request Node-Interface), SN-F (Slave Node-Fully Coherent), and SXP (Scalable eXtensible Protocol) routers.
The system supports scalable topologies (e.g., mesh, ring) and provides configurable parameters for optimizing performance, coherence, and resource utilization.
2.2.2 HNF Function
The HN-F (hereinafter referred to as HN-F) included in this project implements the POC (Point of Coherence) and PoS (Point of Serialization) functions specified by the CHI protocol (see CHI eb protocol 1.6 for details). It contains a configurable LLC cache and a Snoop Filter (SF) that can reduce the number of SNP messages sent. Other functions supported by HN-F are as follows:
\1. CHI Transaction in the table below
\2. UC, SC, UD, I cache line status in CHI protocol
\3. Exclusive access
\4. QoS
HN-F receives Req messages from the RXREQ channel. Each received Req message corresponds to a Transaction, and a Transaction may include multiple Rsp, Snp and Dat messages. HN-F cooperates with RN and SN-F to complete the entire Transaction.
HN-F can be viewed as a state machine that behaves as follows:
\1. Read and update internal cache data and status (including LRU)
\2. Read and update SF
\3. Read and update the exclusive monitor
\4. Read and update QoS (including internal arbitration related registers) related registers
\5. Send Req, Rsp, Snp and Dat messages
HN-F chooses to complete a certain behavior at a certain time based on its internal state and the received message. The connection between the state and the behavior is not fully described here:
\1. The cache block status determines which type of message to send and the cache block update status. The cache block data determines the content of the Data field in the Dat message.
\2. The SF content determines the fields and quantity of the Snp message to be sent.
\3. Exclusive Monitor determines the content of the RespErr field, etc.
\4. QoS related registers determine when a message is sent, etc.
\5. The received Req, Rsp, and Dat messages determine the transaction process (this process is also achieved by modifying the MSHR status bit)
HN-F consists of the following five components. The following figure shows the HN-F data path:
(1)Link Interface
LI (Link Interface) includes the interface for HN-F to interact with the outside world, and implements the sending and receiving of Flit (CHI protocol messages) based on the CHI protocol channel specification. The four external channels of LI are: RXREQ, RXRSP, RXDAT, TXREQ, TXRSP, TXSNP, TXDAT. The main functions of LI are the packet assembly function of the Network Layer specified by the protocol and the flow control function of the Link Layer (14.2). In addition, it is also responsible for arbitrating the message sending requests from different sources inside HN-F and unpacking the RX channel.
(2) MSR
MSHR is the control center of HN-F. It receives messages from LI, requests from BIQ (Back Invalidation Queue), and results returned by Cache Pipeline, and modifies its internal control bits. In each cycle, MSHR determines its behavior based on the status of its internal control bits: sending requests to read or update LLC and SF to Cache Pipeline, and sending requests to LI to send outbound messages.
The internal storage structure of MSHR is a register stack with multiple MSHR entries. Each transaction (and the replacement transaction it brings) corresponds to one of them. Each entry includes the control bits required by all transactions, as well as some information of the transaction itself, such as SrcID and address. This allows MSHR to process multiple transactions at the same time. When Req arrives, it is determined whether it can enter MSHR based on QoS.
Due to the existence of parallelism, in order to realize PoC and PoS functions, MSHR also includes a conflict handling mechanism: hibernate the later transactions of the same address until the earlier transaction is completed.
Since the LI external channel and Cache Pipeline can only complete one transaction corresponding operation per cycle, MSHR also implements the arbitration function for these structures.
(3) Data Buffer
DBF (Data Buffer) is a register stack, with the same number of items as MSHR, one-to-one correspondence. Each item includes data and Byte Enable (meaning the same as BE in the protocol). The function of DBF is to temporarily store the data involved in the transaction. When data is read from LLC or arrives from RXDAT, it will be written to DBF; when data needs to be written to LLC or sent by RXDAT, DBF is read. For exchange data, see the CaChePipeLine document.
(4) SRAM
The HN-F's SRAM is used to store L3 data and status, as well as SF content, to support two major functions of CPL:
\1. A coherent L3 cache
\2. Reduce the number of SNP messages sent
The first function relies on Data SRAM to store data, Tag SRAM maintains the correspondence between a Cache Line and a physical address and the state of this Cache Line in L3, and LRU SRAM records the access history information of different Ways to determine the replacement way. The second function relies on SF SRAM. Each Cache Line describes the set of all possible states of the data corresponding to its corresponding physical address in RN-F. If a certain address does not hit SF SRAM and BIQ does not hit, it can be considered that the data corresponding to the address does not exist in any RN-F cache.
The SRAM used in the HN-F design is a single-port SRAM. Its read and write are completed in one shot. The read request enters the SRAM at the start of T0, and the read data is available at the start of T1. The write request and write data enter the SRAM at the start of T0, and the write data can be read at the start of T1.
(5) CachePipeLine
CPL (Cache Pipeline) is responsible for updating the L3 data status and SF. CPL is a multi-stage non-blocking pipeline that accepts a request from MSHR every cycle and returns a result of the previous request to MSHR. CPL only accepts a small number of message fields and control bits that it needs from MSHR, and returns to MSHR the L3 status required by MSHR and the target number of Snp messages to be sent.
CPL connects all SRAM ports except the Data SRAM data port and is responsible for reading and writing these SRAMs, as well as calculating new states.
CPL uses the hazard mechanism to ensure that the update of cache block status and SF is atomic. Since there is a certain interval between the reading and writing of SRAM, there may be another read request after a cache line is read but before it is written, and this read request may want to modify the same Set and Way as the previous request. CPL's hazard mechanism avoids the subsequent request from reading outdated information by marking the subsequent request with retry.
Due to the existence of Silent Evict, CPL also includes a BIQ (Back Invalidation Queue), which is used to record the address of the replaced cache line when SF is full and misses, SF has been updated but RN-F has not been actually invalidated.
In this way, when the request for the address replaced by SF that needs Snoop hits the BIQ, multicast is selected to avoid missing an RN-F that still caches the address.
3 CPU Core
3.1 Core Pipeline Stages and Functions
The CPU core implements a 12-stage dual-issue pipeline 64bit high-performance processor architecture. The following figure shows thepipeline stages of the processor.
3.2 Floorplan or Core
Die Floorplan (This figure contains 2 banks)
VC means V-Cache in the figure.
Red represents Cluster (approximately 4mm ² area @12nm).Orange represents NoC nodes.
Choose 1 out of 3 corresponding relationships for each NoC node.
-
2 clusters (4 cores in total)
-
1 DDR + 1 PCIe
-
1 DDR only (PCIe vacancy on the other side)
Each node is only connected to the nearest IP core. Totally 32-core CPU area about 100-130mm ² @12nm
3.3 Memory Management Unit (MMU)
• Sv39 virtual memory systems supported.
• 32/17-entry fully associative I-uTLB/D-uTLB.
• 2048-entry 4-way set-associative shared TLB.
• Hardware page table walker.
• Virtual memory support for full address space and easy hardware for fast address translation.
• Code/data sharing.
• Support for full-featured OS such as Linux.
• XMAE (XuanTie Memory Attributes Extension) technology extends page table entries for additional attributes.
3.4 Platform-Level Interrupt Controller (PLIC)
• Support multi-core interrupt control.
• Up to 1023 PLIC interrupt sources.
• Up to 32 PLIC interrupt priority levels.
• Up to 8 PLIC interrupt targets.
• Selectable edge trigger or level trigger.
3.5 Float Point Unit (FPU)
• RISC-V F and D extensions
• Support half/single/double precision
• Fully IEEE-754 compliant.
• Does not generate floating-point exceptions.
• User configurable rounding modes.
3.6 Interfaces
• Master AXI (M-AXI)
• DCP (S-AXI)
• Debug (JTAG)
• Interrupts
• Low power control
4 Memory Locality Hierarchy
Each CPU core has its own I-Cache and D-Cache. Two cores share one L2 cache. Data coherence among
multiple cores is maintained by hardware.
4.1 Memory Hierarchy
The L1 instruction memory system has the following key features:
• VIPT, two-way set-associative instruction cache.
• Fixed cache line length of 64 bytes.
• 128-bit read interface from the L2 memory system.
The L1 data memory system has the following features:
• PIPT, two-way set associative L1 data cache.
• Fixed cache line length of 64 bytes.
• 128-bit read interface from the L2 memory system.
• Up to 128-bit read data paths from the data L1 memory system to the data path.
• Up to 128-bit write data path from the data path to the L1 memory system.
The L2 Cache has the following features:
• Configurable size of 256KB, 512KB, TMB, 2MB, 4MB, or 8MB.
• PIPT, 16-way set-associative structure.
• Fixed line length of 64 bytes.
• Optional ECC protection.
• Support data prefetch.
4.2 L1 I-Cache
The L1 I-Cache provides the following features:
• Cache size: 64 KB, with a cache line size of 64 bytes, 2-way set-associative;
• Virtually indexed, physically tagged (VIPT); 0
• Data width for access: 128 bits;
• First-in, first-out (FIFO);
• Invalidation by I-Cache or cache line supported;
• Instruction prefetch supported;
• Way prediction supported;
• D-Cache snooping after a request misses the I-Cache (this feature can be enabled and disabled)
4.2 L1 D-Cache
The L1 D-Cache provides the following features:
• Cache size: 64 KB, with a cache line size of 64 bytes, 2-way set-associative;
• Physically indexed, physically tagged (PIPT);
• Maximum data width per read access: 128 bits, supporting byte, halfword, word, doubleword, and
quadword access;
• Maximum data width per write access: 256 bits, supporting access to any combinations of bytes;
• Write policies: write-back with write-allocate, and write-back with write-no-allocate;
• First-in, first-out (FIFO);
• Invalidation and clearing by D-Cache or cache line supported;
• Multi-channel data prefetch for instructions.
4.3 L2 Cache
4.4 L3 Cache**(V-Cache)**
4.5 Cache Coherence
For requests with shareable and cacheable page attributes, data coherence between L1 D-Caches of different cores is maintained by hardware. For requests with non-shareable and cacheable page attributes, the CPU does not maintain data coherence between L1 D-Caches. If non-shareable and cacheable pages need to be shared across cores, data coherence must be maintained by software.
Cluster maintains data coherence between L1 D-Caches of different cores based on the MESI protocol. MESI indicates four states of each cache line in the D-Cache:
• M: indicates that the cache line is available only in this D-Cache and has been modified (UniqueDirty).
• E: indicates that the cache line is available only in this D-Cache and has not been modified (UniqueClean).
• S: indicates that the cache line may be available in multiple D-Caches and has not been modified (ShareClean).
• I: indicates that the cache line is not available in this D-Cache (Invalid).
5 PAD Estimation
6 3D Structure**(Package)**
6.1 Structure Diagram
6.2 Thermal Consideration
TBD
7.1 Architecture
SoC partitioning is a decomposition and reconstruction exploration of the original chip architecture, extending from the original x and y axes to the y direction, exploring design possibilities, improving system performance, expanding to larger spaces, and reducing the design cost and yield of the SoC itself.
Firstly, a SoC design (usually a netlist file) is divided into multiple small Dies for Modularization processing, laying the foundation for subsequent designs. Each Die is designed as an independent Chiplet for flexible layout planning and resource optimization. Subsequently, by adjusting the cost coefficient of the target function (design overhead), a new round of iterations can be performed, and the optimized layout can be gradually completed while the design overhead converges.
Chiplet modeling is a core step in system-level planning. The tool models each divided die to form an independent Chiplet module to ensure design repeatability and scalability. Each die can be physically planned and displayed as an IP in stacked design.
After system planning, physical design and testing can be integrated for collaborative design, and signal, power, power consumption, and timing analysis can be performed across Die levels.
Another indispensable component of Chiplet architecture design is the fabrication cost of the new system, which involves iterative convergence based on design indicators in partitioning, floorplanning, wiring, and optimization, and ultimately adapts to manufacturing costs, including wafer costs, packaging costs, bonding costs, test design costs, etc.
7.2 Front End and Verification
7.3 Raw Backend
Floorplan is responsible for optimizing the layout of all Chiplets in 2.5D/3D integrated circuits, ensuring reasonable resource allocation, and preparing for subsequent wiring and simulation.
Multi-chip integrated system is a hybrid integration of multiple homogeneous or heterogeneous die chips at the packaging level. Compared with traditional chip integration, there are huge differences in quality assurance and testing requirements. Without testability and fault-tolerant design, the design and manufacturing problems of a large number of Bump interconnects and TSVs may become latent risks that undermine system stability and quality. Therefore, 3D DFT based on interconnect facilities is particularly critical.
In the early stage of system planning, DFT and FT (Fault tolerance) design resources are planned, and the hardware and interconnection resources required for testing and fault tolerance are allocated in the partitioning and system physical planning to complete the design preparation of 3D system stability, integrity, and collaborative thermal and stress management.
After obtaining a three-dimensional stacked floorplan with test completeness, interconnection relationship inspection, wiring, and optimization can be carried out to quickly complete the preliminary system structure. Designers can then further evaluate how to design the desired SoC architecture based on the generated multiple structures.
Check the consistency of the physical connection relationship and logical connection relationship for Bump bump interconnection planning. If there are bump misalignment, bump misalignment, or incorrect bump connection problems.
After checking the Bump interconnection, quickly enter the pre-wiring and optimization. The tool performs global wiring and detailed wiring on the stacked structure to ensure that the signal connection between chiplets can meet electrical requirements and automatically iteratively optimize the wiring effect.
7.4 Backend and Package Iteration
7.5 Verification
After completing the system-level planning, we enter an early analysis of system performance, which is a multi-level co-design and simulation.
In the preliminary planning of the multi-chip integrated system, the interconnection wiring still needs to check the robustness of the wiring based on factors such as manufacturing process differences for the final required performance, especially in high-bandwidth and high-power consumption scenarios. In the early analysis of the system, the tool extracts parasitic parameters from the system model, especially for the structure of power lines and signal lines interconnected across Die, to complete the inspection of the overall winding constraints and ensure the integrity and reliability of the structure.
Note: some figures are copyed from www