The Curious Case of Memory Bottleneck
Tuesday October 1, 2019By Michael Miller, CTO, MoSys, Inc.
I was recently reading a very interesting article about solving the memory bottleneck (link) and was glad to see this topic taking front and center in the tech world. The coverage makes it clear that the act of just moving data between storage and a processor is reaching a breaking point as far as execution time is concerned.
With technology advances it is becoming more compelling to move the processor elements closer to or even into the memory without paying a cost penalty for replicating processing elements.
Performance scaling has also become key to remaining competitive while addressing the growing requirements being placed on the network. Software must now be transferrable across multiple hardware environments in order to be both cost-effective and provide the required flexibility to meet changing performance demands.
This creates a challenge in the industry for how far the boundary can be pushed to re-locate compute resources to be tightly coupled to memory resources. Since silicon processes for manufacturing memory and cell design are highly optimized, putting processing into the memory cell structure is very challenging and costly. This is especially true as we continue scaling from one silicon process node to the next. Silicon fabrication companies like TSMC, Samsung, etc., have typically pushed back on anything other than classic DRAM or SRAM memory cell design and process. Combining logic and memory on a single die in a memory cell process requires an added level of complexity to the silicon process. Adding lots of memory to a logic process may well be the next step.
As alternative approach the industry is to move the processing just to the edge of the memory array which provides an acceptable trade off. Maintaining the separation of memory and logic cell designs, allowing designers to overcome the disadvantage of needing to go off die to access memory by utilizing the advantage of wide short buses (the most common example being HBM) to regain some of the performance tradeoff.
For example, at MoSys we have integrated RISC engine cores on a single core with a large block of memory. This puts processing right at the edge of the memory array (or in MoSys’s case, in the center of the die) thereby minimizing the latency and maximizing the random-access throughput. In essence, MoSys is offering a way to move the processing of algorithms to the memory. For MoSys, the focus is adding as much memory onboard which is the inverse of the Multi-core CPUs which seek to maximize logic. Of course, over time the caches have come to dominate the die. But that is just perpetuating moving data from DRAM to the processor which is where we started this conversation. With the multi-core CPU and cache model the system is maintaining multiple copies, which is another set of headaches to be avoided if possible and another story for another blog.
The PHE (Programmable HyperSpeed Engine) is the latest semiconductor device offering from MoSys. The PHE is comprised of 1Gb of high-speed memory with 32 multi-threaded RISC cores, all on a single chip. The 1Gb of memory is intended to be the primary repository of data and not a cache. The RISC cores also have access to 2Mb of scratch pad SRAM. The goal is to allow customers to “offload” a high frequency, repetitive task to the PHE. The PHE has the benefit of being able to support upwards of 20B memory transactions per second. Its purpose is to do the task faster than other solutions by virtue of its high random-access memory rate and highly optimized Instruction Set Architecture. It is designed to be embedded in the core of a system in order to provide a net boost to the systems overall performance. In this regard this is a pure accelerator engine. In that it is dedicated hardware designed to perform some form of repetitive task very efficiently by supporting very efficient random memory accesses.
Together with firmware and the appropriate API support software, the PHE forms the basis of high-level embeddable accelerator engine functions. To read more about MoSys Virtual Accelerator Engine Technology and application specific platforms using the PHE check out this link.
At MoSys, we have been developing technology that we believe will address these challenges. To support the (embedded) acceleration or deployment of a function (i.e., packet filtering, packet forwarding and data analytics) that our customer is performing at the same time provide the customer or user with an alternate or accelerated way of implementation. The fundamental issue that the MoSys Accelerator Engines address is the memory access rate of systems that require true random access to data.