Giving Your FPGA a Big BoostTuesday September 28, 2021
By Mark Baumann
Director, Product Definition & Applications
It is obviously advantageous to use resources (in this case, memory) that is close and localized to a source of a data request. This has been a strong forcing function for memory resources to increase in both ASICs and FPGAs. Even with the on die real estate becoming increasingly valuable, chip manufacturers are dedicating significant area resources to memory. In the case of Xilinx, they utilize BRAM (a set of smaller blocks of ram which are approximately 36Kb each and are distributed across the die), with a total resource in a larger device of approximately 95Mb. In addition, they use uRAM which is larger blocks of RAM (approximately 288Kb each) that are in columns of the die, with its total resource being approximately 360Mb in the larger FPGA device. Each of these resources have some restrictions with regards to routing and speed when the intent is a unified large block of memory. In the case of Intel (Altera) they utilize a hybrid of the Xilinx model by utilizing M20K blocks which are, as the name implies blocks of 20Kb of ram and again distributed across the die with a total resource of approximately 230Mb in a larger device, this again has some routing and timing restrictions when the desire is a large unified block of memory.
As of the writing of this blog, the FPGAs vendors have been spending the last few years incorporating HBM (High Bandwidth Memory), using some form of interposer to the die of the FPGA. The HBM however, is not a part of the FPGA die it is connected internally in the package with the FPGA die. These are large DRAM chip stacks that provide large density and even large bandwidth memory resource to the FPGA (or ASIC) without taking the valuable FPGA die resource. The potential drawback of this technology is:
- Thermal considerations
- Base technology is still DRAM-based so there exists longer latency and issues of refresh
- Marginal performance when accessing the memory in a truly random pattern
- The added steps of designing, assembling and testing (yielding) multiple complex die in a single package
So, what we can do next is address each of these potential drawbacks?
As with any new or extremely complex technology, there tends to be a cost associated with its use. The same seems to be true of HBM due to the complexity and need to use leading edge technology. The development of a product with this technology is predicated on the stacking of multiple DRAM die to help keep the overall footprint of the memory as small as possible. This allows it to fit in a package with other devices. In order to accomplish this, the vendors such as Samsung and SK Hynix utilize TSV (through silicon vias) to stack die on top of each other. At the same time, high density DRAM products also push to use leading edge process technologies to achieve the density their customers are requesting. The combination of these two can and does push the development process. This results in a product that has additional cost.
In looking at thermal considerations, it is important to realize that DRAM itself has the issue of the storage mechanism being a capacitive storage cell. This allows for the small size but results in the need to REFRESH the cell due to leakage of the charge off the capacitive storage element. A second concern is that heat has an additional negative effect and causes the charge on the cell to leak off even faster. Resulting in a condition that requires more frequent refreshing of the cell as the junction temperature increases. Many DRAM devices can tolerate a junction temperature of approximately 85C° and will function at 95C° to as high as 100C° with additional refresh.
When you place a device that requires additional support at higher temperature next to a device (such as a high end FPGA), that can dissipate 100W or multiple 100s of watts, it can require additional measures to insure that the HBM stacks are properly cooled to maintain the integrity of the device. The result may be a larger heatsink, additional airflow or in an extreme case some form of liquid cooling. Each of these measures add cost and complexity to the system design.
When looking at the design criteria of an HBM stack, the goal is to provide superior memory density with significant access bandwidth for large bulk storage application needs. This allows for large density per mm2 of die area. Since the connection from the HOST (FPGA) to the HBM die stack is made using die layout (line width and spacing) criteria, it is possible to have multiple extremely wide buses which provide the increased bandwidth. However, an overall limitation is that the base technology remains a DRAM with good burst performance, and unimpressive random-access performance, at least as compared to SRAM technologies. There is also the additional limitation of the DRAM cell needing to be refreshed to ensure data integrity. This impacts the ability to maintain a low latency access of randomly accessed locations that may be in a ROW that is being refreshed.
Assembly of the newer MCM (Multi-Chip-Modules) is a technically challenging process. However, with drivers like IBM, Intel, Samsung, Amkor, TSMC, etc., the technical feasibility of this process has been steadily improving from straight interposers to
Embedded Multi Die Interconnect Bridge (EMIB). The feasibility hurdle has been quickly addressed. The remaining issue, at least as of this writing, is the real cost of assembling and yielding and multiple very complex die. These are not insurmountable issues but they are complex enough that they will always have an additive effect on cost due to issues of die yield, die process, die power, etc., that have negative impact on yield of any die and only get multiplied as many die are assembled in a single package.
The last issue we will discuss with regards to HBM is longevity of supply.
Since the base structure is a DRAM and it has become known that the driving force behind DRAM is to continually move into the latest technology in order to accommodate the ever-increasing desire for more and more storage. This has historically involved some changes in device pinout, timing, power requirements, and die size, each of which will have an impact the on utilization of a present-day design, a couple of years down the line. This can cause concern for systems that have historically required system longevity in the 10-year expected lifetime.
The hope of vendors is that the combination of on die SRAM-like memory (uRAM, BRAM, M20K and e-SRAM) along with the appendage of HBM will address the bulk of customer needs. Certainly, the benefits can be significant, when you have localized memory resources, as long as the available resources can meet density, speed, and access requirements of the system you are designing. This assumption holds if the size of the memory does not interfere with other resource requirements of the FPGA, and the size of the memory needed allows the RAM resource to route cleanly.
The issue MoSys has been exposed to on occasion with FPGA resources, is that they tend to be spread across the die to which is ideal to allow for multiple logic functions to easily access the resource without having to route traces across the die, which can cause routing issues for other blocks.
What MoSys has experienced, both by developing its own code that utilizes the FPGA memory resource and from the feedback that we have received from customers that develop for FPGAs, is that there is a real measurable impact in performance when attempting to utilize the FPGA memory resources as a large unified block. It is still possible to develop larger blocks of memory, but there are tradeoffs in doing so.
In fact, the latest product offering from MoSys is an IP, called the GME (Graph Memory Engine) is structured to be extremely efficient in walking through graph structures that reside in memory. This is useful in applications such as LPM, Regular Expression, DDoS, and other algorithm structures such a TCAMs. In each of the above applications, there is significant benefit to having a large memory structure that is accessible in a random pattern. This has diminished the benefits of HBM and can cause both resource and performance tradeoffs within the FPGA by forcing a tradeoff of blocks of logic that benefit from the availability of as much SRAM (Random Access) memory as possible to increase performance and throughput.
Results of Collaboration
Over the last few years, MoSys along with Intel and Xilinx, have been coordinating the use of MoSys Accelerator Engine devices as a complementing device to the FPGA. It is certainly understandable that an FPGA can’t provide all the resources that every application is looking for. But, in a spirit of cooperation, MoSys has been working with all the FPGA vendors to insure interoperability. When we do this, we ensure that the two devices can properly communicate over the SerDes links using the GCI protocol. This ensures that if the resources in the FPGA are insufficient for an application, a customer can take advantage of the high reliability and low latency memory device that utilize the low-latency GCI protocol. It also ensures that the partnership of the two devices can support an extension of the FPGA resources to customers while utilizing a minimum amount of FPGA (LUTs or ALMs) and a small footprint of routing resources on the PCB.
The devices that MoSys provides are high speed, low latency, High access rate and High random-access rate memory-based accelerators. What this has resulted in is a truly complimentary device to FPGAs and ASICs. MoSys offers a family of support memory devices. The MoSys devices are standalone accelerators that use a minimum of FPGA resources (LUTs, ALMs, LEs), a minimum of I/O resources and this results in an ability to extend the available high speed memory to the system by up to 1Gb per device (Up to 8x the density of the most popular QDR devices).
If you are in need of resources beyond what is available on the device to implement oversubscription buffers, ability to implement statistics, fast lookup tables, flow caches, or any number of functions that require high speed and low latency, the MoSys accelerators are an easy opion for extending and complimenting the FPGAs or ASICs resource. By building the device as MoSys does, it directly addresses all the above-mentioned issues that localized resources address.
- Low Latency – Initial latency as low as 18 FPGA clock cycles with follow-on results available for every clock cycle after that
- Utilizing SerDes allows the freedom to place the memory devices anywhere on the PCB allowing for an easier routing and thermal board profiling
- Assembly- The device uses only standard single die processing and assembly
- Allowing for full functional testing of each device
- Longevity – MoSys has a commitment to providing a device for the lifetime of systems that require 10-year support
The MoSys Accelerator Engine family of products are in full production release today! They have been shipping for close to 10 years with excellent reliability. MoSys would like to explore with you how an Accelerator Engine product to support your development. Please feel free to contact MoSys sales or applications engineering at MoSys.com to build the system that needs a few more resources than those found on your present FPGA.
It may even be possible to use a slightly less expensive FPGA and achieve the systems goals you desire?
If you are looking for more technical information, or need to discuss your technical challenges with an expert, we are happy to help. Email us and we will arrange to have one of our technical specialists speak with you. You can also sign up for updates. Finally, please follow us on social media so we can keep in touch.