Speculation Techniques for High Level Synthesis of Control Intensive Designs<sup>\*</sup> Sumit Gupta Nick Savoiu Sunwoo Kim Nikil Dutt Rajesh Gupta Alex Nicolau

# ICS Technical Report

Technical Report #00-40 December 2000

Center for Embedded Computer Systems Department of Information and Computer Science University of California, Irvine. http://www.cecs.uci.edu/~spark {sumitg,savoiu,sunwk,dutt,rgupta,nicolau}@cecs.uci.edu

Information and Computer Science University of California, Irvine

<sup>\*</sup>A version of this paper was published in the Design Automation Conference, June 2001

## Contents

| 6        | Conclusions and Future Work                                                                                               | 9                  |
|----------|---------------------------------------------------------------------------------------------------------------------------|--------------------|
| 5        | Experiments and Results                                                                                                   | 7                  |
| 4        | Priority-based Global List Scheduling Heuristic4.1 Reverse Speculation4.2 Code Restructuring by Early Condition Execution | <b>5</b><br>5<br>6 |
| 3        | Speculation in High Level Synthesis                                                                                       | 3                  |
| <b>2</b> | A Model for Control Intensive Designs                                                                                     | 3                  |
| 1        | Introduction                                                                                                              | <b>2</b>           |

## List of Figures

| 1 | The Hierarchical Task Graph Intermediate Representation of a For Loop                        | 3 |
|---|----------------------------------------------------------------------------------------------|---|
| 2 | Hierarchical Task Graph representation of the Benchmark "waka" [5]. The numbers indicate     |   |
|   | the priorities of the operations                                                             | 4 |
| 3 | (a) An example design along with the control-data flow graph (b) speculative execution of    |   |
|   | operations from within the branches                                                          | 5 |
| 4 | Reverse Speculation for the waka benchmark (a) original design (b) after reverse speculation | 6 |
| 5 | Code Restructuring by Early Condition Execution (a) original design (b) design after early   |   |
|   | condition execution                                                                          | 7 |

#### Abstract

The quality of synthesis results for most high level synthesis approaches is strongly affected by the choice of control flow (through conditions and loops) in the input description. In this paper, we explore the effectiveness of various types of code motions, such as moving operations across conditionals, out of conditionals (speculation) and into conditionals (reverse speculation), and how they can be effectively directed by heuristics so as to lead to improved synthesis results in terms of fewer execution cycles and fewer number of states in the finite state machine controller. We also study the effects of the code motions on the area and latency of the final synthesized netlist. Based on speculative code motions, we present a novel way to perform early condition execution that leads to significant improvements in highly control-intensive designs. Overall, reductions of up to 38 % in execution cycles are obtained with all the code motions enabled.

#### 1 Introduction

High-level synthesis of digital systems from a behavioral description has received significant attention in the last 15 years [1, 2]. However, commercial synthesis tools have gained limited acceptance among designers, primarily due to poor synthesis results in the presence of conditionals and especially loops, and lack of controllability of quality of results.

The quality of results for control intensive designs is significantly affected by the control flow in typical applications. The control flow in a design is also affected by the programming style. Some work has focussed on reducing the sensitivity of synthesis to programming style [3, 4]. For effectiveness, a high-level synthesis (HLS) system has to make the right tradeoffs among available time, performance, and area costs. Furthermore, the presence of complex control flow significantly effects the quality of synthesis results. In this paper, we propose techniques to move operations across control structures (conditionals and loops) that enable HLS algorithms to make these tradeoffs effectively. Scheduling algorithms can use these beyond-basic-block code motion techniques like speculative execution to extract the inherent parallelism in the design and increase resource utilization.

Speculation is not a new concept; indeed, code motions and speculation are supported either partially or fully in several previously presented synthesis systems [5, 6, 7, 8]. CVLS [5] uses condition vectors to improve resource sharing among mutually exclusive operations. Radivojevic et al [6] present a symbolic formulation which generates an ensemble schedule of valid and scheduled traces. The Wavescheduling approach [8] incorporates speculative execution into high level synthesis to achieve its objective of minimizing the expected number of cycles. Santos et al [7] support generalized code motions in a synthesis system where operations can be moved and scheduled irrespective of their position in the input description. However, several of these approaches either do not handle loops or place restrictions on the nesting of loops within conditionals or vice versa. Furthermore, these approaches do not maintain information about hierarchical structuring of the code, which leads to expensive and inefficient code motion techniques (see Section 2).

Most previous works present and compare results in terms of the schedule lengths. This has prevented a clear analysis of the effects of scheduling and code motions on the area and latency of the hardware generated. Because of this, the control logic overheads are usually ignored despite the fact that industry experience has often shown that critical paths in a high performance design lie in the control unit.

We are developing a modular and extensible high-level synthesis research system called *Spark*. We have used parallelizing compiler technology developed previously in our group [9, 10] and instrumented and modified it for high-level synthesis. Since one of the outputs of the system is synthesizable register-transfer level(RTL) VHDL, the system enables evaluation of the effects of several coarse and fine-grain optimizations on logic synthesis results. The input language for Spark is ANSI-C, currently with the restrictions of no pointers and no function recursion. Spark provides an integrated flow from architectural design to logic synthesis.

This paper presents some basic code motion transformations and a simple heuristic that judiciously chooses which code motion should be applied, while making performance, resource and area cost trade-offs. These judicious choices of the various code motions to control the quality of synthesis results is the chief contribution of this paper. The results section shows the effects of the code motions on the quality of synthesis results and demonstrates which code motions are most effective for synthesis.

### 2 A Model for Control Intensive Designs

The Spark system stores the behavioral description in an intermediate representation (IR) which retains all the information given in the input description. This is critical for enabling coarse-level transformations and making global decisions about code motion. The IR consists of basic blocks encapsulated in *Hierarchical Task Graphs* (HTGs) [11, 9]. In addition to operation level and basic block level information, HTGs also maintain coarse, high level information about compound nodes. That is, the basic blocks comprising an if-then-else conditional block or a for loop are aggregated into a compound HTG node.

Figure 1 illustrates the representation of a For loop HTG. The member basic blocks of this HTG are the initialization basic block, the conditional check block, the body of the loop and the loop index increment. Similarly, in Figure 2 the benchmark waka introduced by Wakabayashi et al [5] is shown as represented by HTGs. In this figure, the dashed arrows indicate control flow and the solid lines indicate data flow. Operations are denoted by circular nodes with the operator sign within and the triangle indicates a conditional check. The number next to each operation node indicates the priority of the operation (see Section 4).



Figure 1: The Hierarchical Task Graph Intermediate Representation of a For Loop

The structural nature of the HTGs retains information about the mutual exclusiveness of operations, and can be used for making global decisions about code motion and speculation. HTGs are key to fast and efficient code motion techniques such as *Trailblazing* [9] and *Resource-Directed Loop Pipelining* [10], both of which are implemented in Spark.

## 3 Speculation in High Level Synthesis

In the presence of control structures, maximal parallelism can be extracted with the use of code motion, i.e., code restructuring by the movement of operations within and beyond basic blocks,



Figure 2: Hierarchical Task Graph representation of the Benchmark "waka" [5]. The numbers indicate the priorities of the operations

and across and out of control structures such as conditionals and loops. Code motion exposes concurrency, hence, increasing opportunities for resource utilization without an increase in latency. Also, code motions restructure the control flow in the design and lead to reduced control costs. These reduced control costs can be quantified in terms of the number of states in the finite state machine controller.

One of the key enabling transformations for efficient code motion is speculation. Speculative execution refers to the unconditional execution of instructions that were originally supposed to have executed conditionally. In the compiler context, if the condition evaluates to false, compensation code has to be executed. However, in the hardware synthesis context, we can simply choose to either commit the results or discard them based on the evaluation of the conditions.

The example in Figure 3 demonstrates speculation. In Figure 3(a), variables d and g are calculated based on the result of the calculation of the conditional c. Since d and g are executed on different branches of a conditional, these two operations are *mutually exclusive*. They can, hence, be scheduled on the same hardware resource as shown in the corresponding circuit in Figure 3(a).

Now, consider that enough resources (an additional adder) are available; then the operations within the conditionals can be calculated *speculatively* and concurrently with the calculation of the conditional c as shown in Figure 3(b). Based on the evaluation of the conditional, one of the results will be discarded and the other committed. It is evident from the corresponding hardware circuits that the longest path gets shortened from being a sequential chain of a comparison followed by an addition to being a parallel computation of the comparison and the additions followed by a multiplexing of the results.

However, this example also demonstrates the additional costs of speculaton. Speculation leads to requires more resources and more storage for the intermediate results. So, uncontrolled aggressive speculation can lead to worse results due to extra resources required. On the other hand, resources that are idle can be used to execute operations speculatively. Hence, speculation along with other code motions needs to be directed by a global scheduling heuristic.



Figure 3: (a) An example design along with the control-data flow graph (b) speculative execution of operations from within the branches

#### 4 Priority-based Global List Scheduling Heuristic

Scheduling is the task of assignment of operations to control steps or time intervals so that the allocated resources can compute the operations assigned to each step [1, 2, 12].

For the purpose of evaluating the various code motion transformations, we have chosen a *Priority-based* global list scheduling algorithm, although the transformations presented here can be applied to other scheduling heuristics as well. Since our objective is to minimize the *longest delay* through the design, we assign a priority to each operation which is one more than the maximum of the priorities of all the operations that use its result. Hence, operations that produce outputs have priority zero, and those which depend on them have priority one and so on, as shown in Figure 2 for the waka benchmark [5]. The priority assignment can also be changed to minimize a different cost function, such as average delay.

The Priority algorithm determines the operation with the highest priority which is ready to be scheduled (data dependencies are satisfied) and then employs techniques such as speculation and dynamic renaming [13] to move the operation so as to schedule it on the available resource. To further aid in improving the resource utilization, we have implemented two more code motion techniques, namely, *reverse speculation* and *early condition execution*. These techniques are discussed in the next two sections.

#### 4.1 Reverse Speculation

Synthesis results can vary widely due to syntactic variance of the input description. The Priority scheduling algorithm can determine if it is more "profitable" to speculatively execute operations which are within conditionals. Conversely, a situation may arise that an operation outside a conditional has lower priority than another operation inside the conditional. The operation outside the conditional can then be *reverse speculated* into the conditional, so that the resources freed by this reverse speculation can be better utilized by higher priority operations.

This is demonstrated in Figure 4. The operations g and e lie on the longest path of the design and the operation c does not. Hence, g and e have higher priority than operation c. Operation cis reverse speculated into the conditional as shown in Figure 4(b). Also, the reverse speculation algorithm detects that operation c is required only in one of the branches of the conditionals and



Figure 4: Reverse Speculation for the waka benchmark (a) original design (b) after reverse speculation

hence, moves it only into that conditional. This is determined by checking which basic block(s) have operations that use the result of the operation c and moving c only to those basic block(s).

#### 4.2 Code Restructuring by Early Condition Execution

Reverse speculation can be coupled with another novel transformation that we introduce, namely, *early condition execution*. This transformation involves restructuring the original code, so as to execute conditional checks as soon as possible. This in effect means that the conditional check is "moved up" in the design, and hence, all operations before the conditional are reverse speculated into the conditional. The motivation for this technique comes from the fact that a conditional has high priority since all the operations in its conditional branches depend on it.

Early condition execution is demonstrated for the *waka* benchmark in Figure 5. In this figure, the operations p and q which calculate the conditions are executed as soon as possible and hence the conditionals based on them can be checked early. Unscheduled operations from basic blocks preceding the conditional (d, k and c) are *reverse speculated* into the conditional branches as shown in Figure 5(b). Note that operations d and c are reverse speculated into only those branches which use their results.

This also leads to more efficient resource sharing since the operations on either side of the conditional are mutually exclusive. By this technique and the use of HTGs, we are able to implicitly extract and use information about mutual exclusivity of operations without using computationally expensive Binary decision diagram (BDD) packages [6, 7].

These transformations are implemented in the Spark system and the Priority list scheduling algorithm determines when to apply the transformations based on the priorities of the operations. Hence, at each cycle, the highest priority "available" operation is scheduled on the resource by employing the necessary code motions. Operations left unscheduled at the end of a basic block are



Figure 5: Code Restructuring by Early Condition Execution (a) original design (b) design after early condition execution

moved down into the next basic block or reverse speculated into the conditional branches, as the case may be.

#### 5 Experiments and Results

This section studies the effects of the various code motions directed by the Priority list scheduling algorithm on the synthesis results. The results are presented in terms of the number of states in the finite state machine in the controller of the generated RTL VHDL and the cycles on the longest path in the design. For loops, the longest path length of the loop body is multiplied by the number of loop iterations. Longest path length is equivalent to the execution cycles of the design. The results of logic synthesis are also presented to evaluate the effects of the transformations on area and latency of the synthesized design. Spark is being developed in C++ on both the Sun Solaris and Microsoft Windows platforms. It uses *Graphviz* [14] as its graphical user interface and for graph layout and visualization.

Tables 1 and 2 demonstrate the effectiveness of the various types of code motion described in this paper. The benchmarks used are the *Encoder* from the ADPCM benchmark and the *Prediction* block from the MPEG-1 algorithm (available for download from Spark's web page). The ADPCM benchmark has 37 Basic Blocks and MPEG Prediction has four functions, *calc\_forward\_motion*, *calcid* and *pred* with 36, 36, 1 and 80 basic blocks respectively (only non-empty basic blocks are counted). The *pred* function has a 3-case switch statement which we partitioned into two functions *pred0\_1* and *pred2* since logic synthesis tools cannot handle the complete function. Since the first two functions are extremely similar and the *calcid* function has only 1 basic block, we only present results for the functions *calc\_forward\_motion*, *pred0\_1* and *pred2*. The resources used are indicated in the tables; *ALU* does add and subtract, = is a comparator, [] is an array address decoder and << is a shifter. All components used have single

|                 | ADPCM Encode (37 BBs) |                |  |  |  |
|-----------------|-----------------------|----------------|--|--|--|
| Type of         | 1ALU, 2 ==, 2[], 1 << |                |  |  |  |
| Code Motion     | # States Long. Path   |                |  |  |  |
| Within BBs      | 32                    | 313            |  |  |  |
| + across HTGs   | 27(-15.6%)            | 262(-16.3%)    |  |  |  |
| +speculation    | 23(-14.8%)            | 222(-15.3%)    |  |  |  |
| +reverse spec   | $23(\ 0\ \%)$         | $222(\ 0\ \%)$ |  |  |  |
| + early cond    | 20(-13.0%)            | 192(-13.5%)    |  |  |  |
| Total Reduction | 37.5~%                | 38.7~%         |  |  |  |

Table 1: Comparison of various types of code motion for the ADPCM Encode benchmark

|                 | MPEG Prediction Block $3ALU, 1*, 2[], 3 <<, 2 ==$ |            |                     |              |                 |              |  |  |
|-----------------|---------------------------------------------------|------------|---------------------|--------------|-----------------|--------------|--|--|
|                 | calc_forw                                         | (36  BBs)  | pred0_1             | (30  BBs)    | pred2 ~(52 BBs) |              |  |  |
|                 | # States Long. Path                               |            | # States Long. Path |              | # States        | Long. Path   |  |  |
| Within BBs      | 35 35                                             |            | 44 2588             |              | 48              | 5391         |  |  |
| +across HTGs    | 25(-28.6%)                                        | 25(-28.6%) | 41(-6.8%)           | 2396(-7.4%)  | 44(-8.3%)       | 5006(-7.1%)  |  |  |
| +speculation    | 24(-4%)                                           | 24(-4%)    | 28(-31.7%)          | 1564(-34.7%) | 31(-29.5%)      | 3278(-34.5%) |  |  |
| +reverse spec   | 24(0%)                                            | 24(0%)     | 28(0%)              | 1564(0%)     | 31(0%)          | 3278(0%)     |  |  |
| +early cond     | 23(-4.2%)                                         | 23(-4.2%)  | 27(-3.6%)           | 1563(-0%)    | 31(0%)          | 3278(0%)     |  |  |
| Total Reduction | l Reduction <b>34.3</b> % <b>34.3</b> %           |            | 32.6~%              | $32.3\ \%$   | 37.5 %          | 36.8 %       |  |  |

Table 2: Comparison of various types of code motion for the MPEG Pred benchmarks

cycle execution time. The number of resources has been chosen so as to get shortest schedule length possible.

Tables 1 and 2 compares results for code motions only within basic blocks (first row), across Hierarchical Task Graphs (HTGs), i.e., across entire if-then-else structures and loops but without speculation (second row), then with speculation enabled (third row), with reverse speculation enabled as well (fourth row) and the fifth row also performs the early condition execution code transformation<sup>1</sup>. The number of states in the FSM and the longest path length (execution cycles) are presented along with the percentage reductions of each row over the previous row in parentheses.

Significant reductions are achieved when code motions across HTGs are enabled and then again when speculation is enabled. The tables demonstrate that reverse speculation on its own does not lead to improvements, but when performed as a part of early condition execution, additional reductions of up to 13 % can be achieved. The maximum benefits by this transformation are seen in the ADPCM benchmark since this benchmark is highly control intensive with nearly as many conditional checks as operations. The total reduction in execution cycles achieved with all the transformations enabled over code motion within basic blocks is up to 38 %.

We have synthesized the RTL VHDL generated by Spark using Synopsys's *Design Compiler* and present the results for two functions from the MPEG Pred benchmark in Tables 3 and 4. The columns in these tables present the results the critical path length (c nanoseconds), the execution cycles of the longest path (l), the total delay through the design (c.l nanoseconds) and the unit area (total of the combinational and non-combinational areas). The unit area is based on the technology

<sup>&</sup>lt;sup>1</sup>Although both benchmarks have loops, we have not applied any loop transformations, across loop boundary code motion or loop unrolling for these experiments

| T                    | calculate_forward_motion func(36 BBs) |                     |               |                       |  |  |
|----------------------|---------------------------------------|---------------------|---------------|-----------------------|--|--|
| Type of              | Crit. P.                              | rit. P. Long. Delay |               | Unit                  |  |  |
| Code Motion          | (c ns)                                | P.(l)               | (c.l nanosec) | $\operatorname{Area}$ |  |  |
| Within BBs           | 22.30                                 | 35                  | 780.5         | 10752                 |  |  |
| +across HTGs+spec    | 22.81                                 | 24                  | 547.4(-29.8%) | 14813(+27.4%)         |  |  |
| +rev spec+early cond | 22.49                                 | 23                  | 517.3(-5.5%)  | 14261(-3.7%)          |  |  |

Table 3: Effect of code motion on the logic synthesis results for the *calculate\_forward\_motion* function of the MPEG Pred benchmark

| Tuna of              | pred_case_0_1 function(30 BBs) |       |               |               |  |  |
|----------------------|--------------------------------|-------|---------------|---------------|--|--|
|                      | Crit. P.                       | Long. | Delay         | Unit          |  |  |
| Code Motion          | (c ns)                         | P.(l) | (c.l nanosec) | Area          |  |  |
| Within BBs           | 22.31                          | 2588  | 57738         | 276077        |  |  |
| +across HTGs+spec    | 22.46                          | 1564  | 35127(-39.2%) | 286019(+3.5%) |  |  |
| +rev spec+early cond | 21.86                          | 1563  | 34167(-2.7%)  | 290882(+1.7%) |  |  |

Table 4: Effect of code motion on the logic synthesis results for the *pred\_case\_0\_1* function of the MPEG Pred benchmark

library being used; we have used the LSI-10K library that is distributed with Synopsys tools. The percentage reductions in the delay and area over those in the previous row are given in parentheses.

The first row in these tables give results for only within basic block code motions, the second row with across HTGs and speculation also enabled and the third row with all code motions enabled. These tables shows that total delay of the design can be reduced significantly by speculation and early condition execution. Speculation comes with a higher area cost since we have not done any binding and rely on the logic synthesis tool for this. This leads to higher storage, interconnect and control costs. However, the critical path length of the design remains almost constant and hence, the clock cycle length does not increase due to these techniques.

Table 5 compares the results of our system, Spark with Brewer [6] and Jess [7] for the benchmarks waka [5], rotor and s2r [6] for the indicated resources. The columns present the number of basic blocks, the longest path length/cycles and the CPU run time in seconds of Brewer's system on a SUN Sparc Station 10 (probably running at 33 or 66 Mhz) and of Spark on a 170 Mhz SUN Sparc Station 5. The results show that the Spark system produces similar results when compared to the other systems for these benchmarks. Although we do not have the run times for the system "Jess", this system uses combinatorial approaches such as genetic algorithms or simulated annealing for search space exploration which usually have high run times. This table demonstrates that by making judicious choices of code motions, using even a simple priority list scheduling heuristic, the Spark system is able to produce good results without resorting to more complex scheduling heuristics.

### 6 Conclusions and Future Work

In this paper, we have presented a comparative study of the effects of various kinds of code motions on the quality of results of high-level synthesis and studied which are most effective. The results demonstrate that code motions across entire conditional blocks and speculative code mo-

|       |     |                 | Jess                 | Brewer               |       | Spark                |      |
|-------|-----|-----------------|----------------------|----------------------|-------|----------------------|------|
| Bench | #   | Resources       | $\operatorname{Sch}$ | $\operatorname{Sch}$ | Run   | $\operatorname{Sch}$ | Run  |
| mark  | BBs |                 | Len                  | Len                  | Time  | Len                  | Time |
| waka  | 9   | 1+,1-,2=        | 7                    | 7                    | -     | 7                    | 0.1  |
| rotor | 11  | $2+-,2^{*},1[]$ | 7                    | 7                    | 13.7  | 7                    | 0.16 |
| s2r   | 21  | $3+-,2^*,1[]$   | 8                    | 9                    | 177.9 | 9                    | 0.38 |

Table 5: Comparison with previous work

tions lead to most improvements. Early condition execution can provide an additional reduction up to 13 % in execution cycles for highly control-intensive designs. We have tried to compare our approach to similar works using commonly used benchmarks. However, this is not always possible mainly because of our focus on control intensive and moderately complex designs. We are currently implementing resource binding to reduce the area increase due to code motions.

### References

- [1] D. D. Gajski, N. D. Dutt, Allen C-H. Wu, and Steve Y-L. Lin. *High-Level Synthesis: Introduction to Chip and System Design.* Kluwer Academic, 1992.
- [2] R. Camposano and W. Wolf. *High Level VLSI Synthesis*. Kluwer Academic, 1991.
- [3] V. Chaiyakul, D.D. Gajski, and L. Ramachandran. Minimizing syntactic variance with assignment decision diagrams. Technical Report ICS-TR-92-34, UC Irvine, 1992.
- [4] M.Janssen, F.Catthoor, and H.De Man. A specification invariant technique for operation cost minimisation in flow-graphs. In *International Symposium on High-level Synthesis*, 1994.
- [5] K. Wakabayashi and H. Tanak. Global scheduling independent of control dependencies based on condition vectors. In *Proceedings of the Design Automation Conference*, 1996.
- [6] I. Radivojevic and F. Brewer. A new symbolic technique for control-dependent scheduling. IEEE Transactions on CAD, January 1996.
- [7] L.C.V. dos Santos and J.A.G. Jess. A reordering technique for efficient code motion. In Design Automation Conference, 1999.
- [8] G. Lakshminarayana, A. Raghunathan, and N.K. Jha. Incorporating speculative execution into scheduling of control-flow intensive behavioral descriptions. In *Proceedings of the Design Automation Conference*, 1998.
- [9] A. Nicolau and S. Novack. Trailblazing: A hierarchical approach to percolation scheduling. In *International Conference on Parallel Processing*, 1993.
- [10] S. Novack and A. Nicolau. An efficient, global resource-directed approach to exploiting instruction-level parallelism. In Conference on Parallel Architectures and Compilation Techniques, 1996.
- [11] M. Girkar and C.D. Polychronopoulos. Automatic extraction of functional parallelism from ordinary programs. *IEEE Transactions on Parallel and Distributed Systems*, March 1992.
- [12] P. G. Paulin and J. P. Knight. Force-directed scheduling in automated data path synthesis. In Design Automation Conference, 1987.
- [13] S.-M. Moon and K. Ebcioglu. An efficient resource-constrained global scheduling technique for superscalar and vliw processors. In *International Symposium on Microarchitecture*, 1992.
- [14] AT&T Research Labs. Graphviz open source graph drawing software. http://www.research.att.com/sw/tools/graphviz/.