Design and Comparative Analysis of High Speed and Low Power ALU Using RCA and Sklansky Adders for High-Performance Systems

—This study examines how different initial design decisions affect the area, timing, and power of technology-mapped designs. ASIC design flow, tools used during the flow, and the factors to consider to maximize the performance and power ratio are discussed. The ALU (Arithmetic Logic Unit) is a fundamental part of all processors. In this study, two ALUs were implemented using two different types of adder circuits: a Ripple Carry Adder (RCA) and a Sklansky adder. The Cadence EDA tools were used for the implementation. A comparative analysis was conducted for the two designed ALUs in terms of area, power, and timing analysis. The ALU design was also used as an example to examine the whole workflow front-end wise by constructing a block schematic and back-end wise by floorplanning, placing, and routing the physical design.


INTRODUCTION
The basic purpose of this study was to examine the impact of initial design decisions on the area, timing, and power of ASIC designs and to introduce the ASIC design flow from the beginning to the end, i.e. from RTL design to floorplanning and layout. The Arithmetic Logic Unit (ALU) is a key element of a processor that performs arithmetic and logical operations on binary numbers [1][2][3][4][5][6][7]. Several 32-bit ALUs have been designed and coded in VHDL [8][9][10][11][12][13]. An ALU consists of three major parts: Adder, logic unit, and shifter. The adder is responsible for the subtraction and addition operations of signed and unsigned numbers. The logic unit is responsible for bitwise logical operations, and the shifter unit is responsible for the arithmetic and logical shift operations, as shown in Figure 1 [14][15][16][17][18][19][20]. In the first ALU design, a Ripple Carry Adder (RCA) was used, while in the second one, a prefix tree of Sklanskytype adder was used. The adder is a very important component in digital systems, and in the past, much research has been conducted on various adder types to improve their speed and area requirements [21][22][23][24][25]. Apart from the adder, both ALU types consisted of Logical and Shifter blocks.
The design in Figures 1 and 2 has three main blocks: adder/subtractor, bitwise operations, and shifter. The adder block has two different implementations: A ripple carry and a Sklansky type. The ALU registers in data (A/B), opcode (Op), and output were positively edge-triggered. The design was implemented using two approaches. The first approach, shown in Figure 2, has all the functional blocks that compute the operations and the mux at the output decides which functional block is connected to the output. The other approach intended to guard the individual functional blocks using a mux after the input registers. This input mux intended that if the opcode is not meant for the particular functional block, it will get all zeroes as data input, reduce the combinatorial switching, and hence decrease power consumption. But then, the first design approach was chosen as including six 32-bit muxes would take up too much area and would probably kill the basic idea of saving power consumption. [25][26][27][28][29][30].   III. SYNTHESIS After the successful simulation and verification of the design, the next step was to synthesize it. Logic Synthesis is the process to convert HDL code into RTL netlist in terms of standard cells. A 130nm process technology was used for the standard cell library. The RTL compiler tool generated the netlist at the end of synthesis in the form of Verilog code. Opening this Verilog code revealed that it was an entirely structural code with hardware description in terms of connections between standard cells. After making sure that the ALU design was synthesizable, Static Timing Analysis was used to analyze the timing performance of the design without simulating the design.
The ALU RCA design was synthesized using various timing constraints and Tables I-II show the results of those attempts. The RCA-based ALU did not meet the timing constraint for an 800MHz processor. The best timing was achieved by the RCA-based ALU at around 2250ps (444 MHz). The critical path of the RCA-based ALU laid in the RCA block for all the above implementations, though the exact path changed as stricter timing constraints were introduced. But for the most part, the path consisted of input carry rippling through the circuit from the LSB to the MSB. The SKL based ALU was able to meet the timing constraint for an 800MHz processor and was even able to clock up to 1GHz, although it pays in terms of area to reach this frequency. One major difference between the RCA and SKL designs was in terms of the critical path. For stricter timing constraints, the RCA had its critical path lying in the adder block, whereas the SKL had it in the shifter block. This is understandable since the Sklansky adder is much faster compared to the RCAr. Examining the 10 worst paths showed that the critical paths for the SKL design were evenly distributed between the shifter and adder blocks. [31][32][33][34][35]. IV. COMPARISON For both RCA and SKL adders, the power at 0.1 toggling probability is greater than at 0.02, i.e. Power(0.1)>Power(0.02). The leakage power is almost the same at different toggling probabilities, but dynamic power increases with an increase in toggling probability, as Table III shows. This makes sense as increasing the toggling probability increases the switching activity of internal nodes and nuts, which results in increased power. SKL dissipates less power than RCA. The area of implementation depends on the effort used by the RTL compiler and the timing constraints. The structure of SKL is more regular compared to RCA because SKL has an improved structure. The clock capacitance of RCA is more than SKL. Figures 3 and 4 show the area of implementation of the RCA and SKL ALUs respectively.   Figure 5 shows the area of implementation for both ALUs, while Figure 6 shows their power dissipation plots. Figure 7 shows a comparison of the efficiency of both ALUs in terms of power consumed under different timing constraints. RCA is more power-efficient than SKL. SKL increases its power efficiency in lower timing constraints. The choice of the input sequence is very important for power analysis. The input sequence can be chosen from real logic simulations. Three Value Change Dump (VCD) files were created for RCA, having random, regular, and real trace vectors as sets of input patterns. The VCD file gets real switching information in the power analysis and goes into the RTL compiler.   Power comparison between RCA and SKL ALUs.  Table IV indicates that random vectors consume more power while the regular and real trace vectors consume less. The real trace vectors consume a little more power than the regular vectors. The flow for creating a use case power analysis was to synthesize the design with a specified timing constraint and compile cell library files along with the Verilog netlist of the synthesis and the testbench. A simulation was then run on a test vector set and signal switching activity was written into a VCD file. This file was compiled by an RTL compiler to obtain "real" switching information in the power analysis [28][29][30].
Comparing the power dissipation from the analyses made with random, regular, and real-trace test vectors showed that the random file is the most power-consuming, while the regular and the real trace are close, and the latter consumed a little more power. Random vectors were expected to consume more power. The real trace was expected to be the least powerconsuming set but it seems that the regular set has more regularity and thus less switching than the real trace. Power consumption depends on the use of the design. Assuming that the design meets the power constraints, when it is only analyzed with a limited and perhaps irrelevant test vector set it can result in faulty conclusions. It is always good if real use cases can be tried, which is not always the case [35][36][37][38][39][40].
V. PLACE AND ROUTE Finally, the design was examined through the back-end flow using the Cadence SoC Encounter. Floorplanning is the first step of the place and route flow. Floorplanning means specifying where to place different blocks of the design. In this step, it is important to make a good modular partitioning of the design. Also, since the critical path is in the RCA block, the maximum area should be provided in terms of utilization percentage. Once the floorplanning was complete, the necessary stages of physical design, such as pin placement, power grid routing, standard cell placement, clock tree synthesis, and routing, followed. The next step was the Clock Tree Synthesis (CTS) which was optimized in three steps: Pre-CTS, CTS, and Post-CTS. It is useful to perform a timing analysis between each optimization to see if the timing constraints are met.
The actual CTS step is like mapping the design to actual cells. The positions of clock buffers and clock tree were checked and it was found that the design had one level of buffers. The timing was checked again after this step but the constraint still wasn't met. The last step of CTS, i.e. post-CTS optimization, was followed to achieve the optimization based on the existing clock tree. The timing was checked again and it was found to be improved a lot, as the slack time was -0.466, much better than before when it was -2.259. Routing and postroute optimization were performed. The clock and reset signals should have the highest priority because they have to be provided to every block in the design and therefore are critical. Filler cells were used to fill the gaps and connect them to the power rails. Layout verification was performed. Four MinCut violations were found, which were later removed using the fixMinCutVia command. The final timing analysis showed a slack time at -0.395.

VI. CONCLUSION
An ALU design was drawn through the EDA flow from the basic idea and block schematic through RTL, verification, and back-end design with place and route. The synthesis results showed that the area of the RCA ALU changes more rapidly than that of the SKL ALU because the former has to put more effort to meet the stricter timing constraint at the expense of more area. The SKL ALU is a fast adder and easily meets the stricter timing constraint without increasing area and power consumption. It was also observed that the RCA ALU used less area and power compared to the SKL ALU, so it is better to use it if the timing constraint is not high, as it can be a more efficient design in terms of area and power consumption.
Furthermore, it was very instructive to see how different initial decisions affected the area, timing, and power of the technology mapped design. A well-structured design and the knowledge of the circuit help to make the most of the back-end design steps. This makes it easier to perform good floorplanning, helping the tools perform better at too complex tasks.