General-purpose processors play important roles in systems based on field-programmable gate arrays (FPGAs). As systems get larger and more complex, processors run many control tasks, like an operating system and a network software stack. Moreover, some systems require processors to run main com-pute kernels, which are impractical to deploy on dedicated hardware. For instance, Microsoft uses a combination of hard processors in an application-specific integrated circuit (ASIC) and soft processors in an FPGA in a large FPGA-based system for datacenter services .
As an FPGA and soft processors are increasingly used in a large variety of applications, the performance of soft processors is becoming increasingly important. Even though soft processors are never more efficient than hard processors, they remain useful because of their high flexibility, reconfig-urability, and low cost, as they can be integrated without an additional chip. In the Microsoft system, for example, one main compute kernel, which is too complex to deploy on dedicated hardware, is run by specialized soft processors.
To improve the performance of soft processors, several recent studies have focused on out-of-order (OoO) superscalar approaches, as evidenced from OoO hard processor studies
showing significant performance improvement over in-order approaches. While early research on FPGA-synthesizable OoO processors was mainly for ASIC prototyping , , , , these recent studies have targeted a high-performance, resource-efficient OoO soft processor on an FPGA. In partic-ular, these studies have explored or proposed FPGA-friendly microarchitectures for the reorder buffer (ROB) , the rename unit , the issue queue (IQ) , , , and the memory system , . A key insight derived from these studies is that the performance and resource efficiency of OoO soft processors are highly improved by microarchitectures leveraging FPGA characteristics.
Applying this insight, we propose the RSD: a new open-source RISC-V OoO soft processor. For high performance, the RSD supports several advanced microarchitectural features, like speculative OoO load and store execution, a memory dependence predictor (MDP), speculative scheduling, and a non-blocking cache. In our evaluations, the RSD achieved up to 2.5-times higher Dhrystone million instructions per second (MIPS) with 60% fewer registers and 64% fewer lookup tables
(LUTs) as compared to two state-of-the-art, open-source OoO processors, as summarized in Table I.
This high performance and efficiency was achieved through two novel techniques leveraging FPGA characteristics. The first technique was FPGA-friendly speculative scheduling. Speculative scheduling is a technique to minimize execute-to-use latency. This technique issues instructions speculatively, before the validity of their operands is determined. We ob-served that speculative scheduling could achieve a gain of up to 26.8% in instructions per cycle (IPC) for SPECint 2006/2017 ,  on a software simulator. Even though this technique has generally been used for OoO hard processors, it is not well studied for OoO soft processors. Hence, this paper explores an FPGA-friendly speculative scheduling implementation that achieves a better tradeoff between the performance and hard-ware resource overhead on an FPGA.
The second technique was optimization for multiport-RAM-based components (e.g., the physical register file (PRF)) to significantly improve the resource efficiency by using an FPGA-optimized multiport RAM. In today’s open-source OoO processors, these components are built naively by using flip-flops (FFs) and logic circuits, thus constituting the dominant resource overhead. Several previous works ,  have pointed out this issue for components such as the ROB.
EVALUATED OPEN-SOURCE OOO PROCESSORS.
|RSD||OPA ||BOOM |
|ISA||RV32IM||RV32IM w/o DIV, CSR||RV64GC / RV32IMAC|
|Ld/St Exec.||OoO Ld/St Exec. with Forwarding||OoO Load Exec. & InO Store Exec.w/o Forwarding||OoO Ld/St Exec. with Forwarding|
|Mem. Dep. Predictor||Support||N/A||N/A|
|Speculative Scheduling||Support with IQ or Replay Queue||Support with IQ||N/A|
|Memory||BRAM or DRAM||BRAM only||BRAM or DRAM|
|Interconnect||AXI4 or AHB||N/A||AXI4 or AHB|
There are open questions, however, as to (1) which components we can apply an FPGA-optimized multiport RAM to and (2) how much this technique reduces the consumption of FPGA resources in the entire OoO processor design. We thus explain which components in the RSD had this optimization applied, and we show that it saved almost half the FPGA resources.
The remainder of this paper is structured as follows.
Section 1 discusses related work on OoO soft processors. Section III presents the RSD microarchitecture. Section IV describes the speculative scheduling mechanisms that we explored for the RSD. Section V explains the RSD components for which we applied FPGA-optimized multiport RAM. Finally, section VI presents our evaluation results, before a brief conclusion.
There are several open-source, FPGA-synthesizable OoO processors, but most of them use an FPGA solely for ASIC prototyping or a research/education environment, rather than designing an OoO processor targeting an FPGA , , ,
- For example, the RISC-V BOOM processor runs on an FPGA, but it is aimed at an ASIC implementation and thus not optimized for an FPGA .
We are currently aware of only one open-source OoO processor targeting an FPGA: the open processor architecture (OPA) . The main component optimized for an FPGA is the store queue (STQ). As a conventional STQ is a content-addressable memory (CAM), which is very expensive on an FPGA, the OPA eliminates the STQ completely. Although this optimization may reduce FPGA resources and improve operating frequency, it hurts performance because a store instruction can be executed only when it becomes the oldest instruction in the processor, and a load instruction cannot forward data from a preceding store instruction.
We compare the RSD with the BOOM and the OPA, the two open-source OoO processors mentioned above, which are optimized for an ASIC and an FPGA, respectively. Table I summarizes the supported features of these processors.
THE RSD MICROARCHITECTURE
This section introduces the RSD microarchitecture and provides background on the proposed techniques described in sections IV and V. Fig. 1 shows a block diagram of the RSD. The microarchitecture consists of three blocks: a front-end block, a scheduling block, and an execution block.
The front-end block fetches and decodes instructions from the L1 instruction cache (L1IC) in program order. The current
implementation uses the gshare branch predictor . The following subsections describe the scheduling block and the execution block.
The scheduling block extracts instruction-level parallelism (ILP) for instructions sent from the front-end block, and it issues instructions to the execution block out of program order. The scheduling block mainly consists of the rename unit, the dispatch unit, the issue queue (IQ), and the reorder buffer.
- Rename Unit: To remove false dependencies (write after write and write after read) between instructions, the rename unit renames the operand logical registers of an instruction. Specifically, it renames the logical registers of destination operands to physical registers obtained from the PRF free list, and it then registers the mapped physical registers in a register map table (RMT). The logical registers of source operands are renamed to the physical registers most recently renamed to those logical registers by using the RMT.
- Dispatch Unit: The dispatch unit allocates an entry for a renamed instruction in several components, such as the ROB, the IQ, the load queue (LDQ), and the store queue (STQ), depending on the instruction type. The dispatch unit stalls when any of these structures is full. The following subsections describe the details of these components and the submodules of the dispatch unit.
- Issue Queue: The IQ is responsible for issuing instruc-tions to the execution block when all of an instruction’s source operands are available. It consists of three submodules: the wakeup logic, the select logic, and the instruction payload RAM. The wakeup logic keeps track of the readiness of each uncompleted instruction, and the select logic selects ready instructions and issue them to the execution block.
The instruction payload RAM holds a payload (e.g., the instruction type) required for executing an instruction. When an instruction is stored in the IQ, its payload is thus stored in the RAM.
The RSD supports a speculative scheduling mechanism. The instruction replay logic (IRL) is a unit to support this mechanism, as described in detail in section IV.
We use a matrix-based wakeup logic with a random select logic in the current implementation , , . Fig. 2 shows a block diagram of the matrix-based wakeup logic in the four-entry IQ. In this example, the rename unit renames one instruction, and the IQ issues one instruction per cycle.
REFERENCES D. B. A. Merchant and D. Sager, “Processor with a replay system that includes a replay queue for improved throughput,” Patent, US Patent 7,200,737, 2007.  D. B. A. Merchant, D. Sager and M. Upton, “Computer processor with a replay system having a plurality of checkers,” Patent, US Patent 6,094,717, 2000.
 ——, “Design space exploration of instruction schedulers for out-of-order soft processors,” in International Conference on Field-Programmable Technology, 2010.
 K. Aasaraai and A. Moshovos, “Towards a viable out-of-order soft core: Copy-free, checkpointed register renaming,” in International Conference on Field Programmable Logic and Applications, 2009.
 C. Celio, D. A. Patterson, and K. Asanovi, “The berkeley out-of-order machine (boom): An industry-competitive, synthesizable, parameterized risc-v processor,” EECS Department, University of California, Berkeley, Tech. Rep., 2015.
 A. M. Abdelhadi and G. G. Lemieux, “Modular multi-ported sram-based memories,” in Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, 2014.
 N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg, “Fab-scalar: Composing synthesizable rtl designs of arbitrary cores within a canonical superscalar template,” in 38th Annual International Sympo-sium on Computer Architecture (ISCA), 2011.
 C. Computer Corporation, Alpha 21264 Microprocessor Hardware Ref-erence Manual, 1999.  J. Glueck and A. Wisner, “The lizard core,” https://github.com/cornell-brg/lizard, 2019.  M. Goshima, K. Nishino, T. Kitamura, Y. Nakashima, S. Tomita, and S.-i. Mori, “A high-speed dynamic instruction scheduling scheme for superscalar processors,” in Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, 2001.  A. Johri, “Implementation of instruction scheduler on fpga,” 2011.  I. Kim and M. H. Lipasti, “Understanding scheduling replay schemes,” in 10th International Symposium on High Performance Computer Ar-chitecture, 2004.
 G. Z. Chrysos and J. S. Emer, “Memory dependence prediction using store sets,” in 25th Annual International Symposium on Computer Architecture, 1998.
 C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for fpgas,” in Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2010.  B. C. Lai and J. Lin, “Efficient designs of multiported memory on fpga,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017.
 C. E. Laforest, M. G. Liu, E. R. Rapati, and J. G. Steffan, “Multi-ported memories for fpgas via xor,” in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2012.
 E. Morancho, J. M. Llaberia, and A. Olive, “Recovery mechanism for latency misprediction,” in Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, 2001.  A. Moshovos, S. E. Breach, T. N. Vijaykumar, and G. S. Sohi, “Dynamic speculation and synchronization of data dependences,” in 24th Annual International Symposium on Computer Architecture, 1997.  onikiri, “Onikiri2: a cycle-accurate processor simulator,” https://github. com/onikiri/onikiri2, 2019.  A. Perais, A. Seznec, P. Michaud, A. Sembrant, and E. Hagersten, “Cost-effective speculative scheduling in high performance processors,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015.
 S. McFarling, “Combining branch predictors,” Technical Report TN-36, Digital Western Research Laboratory, Tech. Rep., 1993.
 M. Rosiere, J. Desbarbieux, N. Drach, and F. Wajsbrt, “An out-of-order superscalar processor on fpga: The reorder buffer design,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2012.  M. Rosiere, J.-L. Desbarbieux, N. Drach, and F. Wajsburt, “Morpheo: A high-performance processor generator for a fpga implementation,” in Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP), 2011.
 A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Hasel-man, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus,
E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accelerating large-scale datacenter services,” in Proceeding of the 41st Annual International Symposium on Computer Architecuture, 2014.
 G. Schelle, J. Collins, E. Schuchman, P. Wang, X. Zou, G. Chinya, R. Plate, T. Mattner, F. Olbrich, P. Hammarlund, R. Singhal, J. Brayton, S. Steibl, and H. Wang, “Intel nehalem processor core made fpga synthe-sizable,” in Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2010.
 P. G. Sassone, J. Rupley, II, E. Brekelbaum, G. H. Loh, and B. Black, “Matrix scheduler reloaded,” in Proceedings of the 34th Annual Inter-national Symposium on Computer Architecture, 2007.
 Standard Performance Evaluation Corporation, “SPEC CPU 2006,” https://www.spec.org/cpu2006/, 2019.
 W. W. Terpstra, “Opa: Out-of-order superscalar soft cpu,” in An open source digital design conference (ORCONF), 2015.
 ——, “SPEC CPU 2017,” https://www.spec.org/cpu2017/, 2019.
 H. Wong, V. Betz, and J. Rose, “Microarchitecture and circuits for a 200 mhz out-of-order soft processor memory system,” ACM Trans. Reconfigurable Technol. Syst., 2016.  H. T.-H. Wong, “A superscalar out-of-order x86 soft processor for fpga,” Ph.D. dissertation, University of Toronto (Canada), 2017.
 H. Wong, V. Betz, and J. Rose, “High performance instruction schedul-ing circuits for out-of-order soft processors,” in IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016.
 S. Zhang, A. Wright, T. Bourgeat, and A. Arvind, “Composable building blocks to open up processor design,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018.
 Xilinx, “7 series fpgas configurable logic block,” https://www.xilinx. com/support/documentation/user guides/ug474 7Series CLB.pdf, 2019.