Hardware Architect Freedoms

  • The most basic area to improve performance is that Accelerator Engines either have
    • 512MB memory with tRC of 2.67ns
    • 1GB memory with tRC of 2.67ns
    • Easily replaces many QDR devices
  • Signal integrity on the board
    • The Accelerator Engine I/O is implemented using SerDes
    • The GCI protocol allows for as few as 4 lanes to be used. 
    • Two independent 8 lane ports.
    • The two GCI ports allow the device to be used as a Dual Port Memory
    • Board layout of our Gb speed serial I/O is considerably more reliable than the many QDR or DRAM signals that need to be routed running at 350Mz or greater.
  • Simplifying the RTL by moving algorithms or functions into the Accelerator Engine

Software Architect Freedoms

  • Add 512Mb or 1Gb of memory (QDR replacement)
    • To eliminate swapping data with smaller memory
    • To provide fast access to common tables
  • Base accelerator engines have
    • Fixed Burst READ or WRITE functions
      • These allow one function call to execute multiple READS and WRITES
    • Fixed RMW functions
      • These allow a single RMW function call to execute a READ, a specified MODIFY, then WRITE
      • Atomic operation can also be maintained
    • Using a PHE (Programmable HyperSpeed Engine) makes it possible to move algorithms or functions into the Accelerator Engine
    • Using the onboard 32 Risc core processors and additional memory
    • Move algorithms or functions into the Accelerator Engine
      • Complex algorithms or functions that use considerable RTL/Resources
      • Time consuming tasks
      • Repetitive tasks
    • Parallel processing
      • The PHE Accelerator Engine has 32 Risc core processors
      • Up to 8 threads per processor (Total of 256 threads)
      • Install many copies of an algorithm/function for parallel processing and let the PHE handle parsing of the task to a processor

Option 1: Simple QDR Replacement – Increase Memory ANDSimplify Board Signal Routing and Integrity

1. High-Speed Serial Protocol I/O Interface

Our 16 SerDes lanes can transmit data up to 12.5Gbps, with an optional rate of 10Gbps.  MoSys’ GigaChip Interface (GCI) delivers  full duplex, CRC protected data throughput, enabling up to 10 Billion memory transaction per second on as few as 16 signals.

Traditional memory design requires a lot of interface pins (in some cases 1000’s of pins), making signal routing and integrity a design challenge.

Each Accelerator engine has 2 completely independent, 8 lane, I/O ports that allow simultaneous memory access operations.

DeviceMemorytRCLatency
BE2512Mb2.67ns6ns
BE31Gb2.67ns~25ns

KEEP IT SIMPLE 
BUT
MAKE IT RUN FAST!

  • Serial I/O
    • Has 2 Full-Duplex ports comprised of up to 8 SerDes lanes each
      • SerDes capable of running at 10Gbs to 25Gbs
    • Can operate with as few as 4 lanes
  • Base Acceleration Engines include
    • Fixed Burst READ and WRITE functions LEARN MORE
    • Fixed RMW Function LEARN MORE

Option 2: Dual Port Memory

  • Each of the 8 lane I/O Ports are capable of operating independently
    • Allows sharing of its memory resources

Option 3: Pipelining Data


Option 4: Accelerating FPGA Performance Using BLAZAR Accelerator Engines

Step 1: Identify FPGA Functions to Offload to the Accelerator Engine

  • Simplify software using fixed BURST and RMW functions included
    • BURST READ or WRITE of multiple locations on single function call
    • RMW READ/MODIFY/WRITE on a single function call
      • Statistical/counters
      • Atomic operations can be assured
  • FPGA tasks that would execute faster using the PHE using the 32 Risc cores
    • Simplify
      • RTL by moving functions into the PHE
      • Provides flexibility for the System Architect to sort tasks between hardware RTL and software tasks
    • Move complex algorithms/functions
      • TCAM
      • Prefix matching
      • Data analysis
      • Computational functions
      • Analytical functions
    • General tasks
      • Time consuming tasks
      • Repetitive tasks
      • High RTL usage tasks
    • Speed increase using
      • Parallel processing (32 cores)
      • 256 threads
      • Utilize engine scheduler to optimize execution
        • Install multiple copies of same algorithm/functions and scheduler will find available processors

Step 2: Identified Functions for Offloading

There are Multiple Reasons to Move Functions out of the FPGA
  • Functions Run
    • Faster
    • Multiple copies can be installed and executed in parallel
    • Execution priority can be set
    • System Flexibility
    • Save cost of an ASIC
  • RTL Simplify
    • Combine multiple functions into one user-defined higher level function
    • Define user functions not able to be done in the RTL or execute fast enough.
  • Save FPGA Space
    • Free up resources to “Do More”
    • Simplify RTL by moving common or frequently called functions into PHE

Step 3: Do More…Achieve HyperSpeed!

Free Email Updates
Sign up to get the latest content
We respect your privacy.