Feature-rich, application-pervasive modern electronics increasingly require some form of digital signal processing. Today, mobile phones, hearing aids, Internet telephones, medical imaging systems, voice-activated appliances and toys, televisions, video recording and audio systems are heavily DSP-based.

Simultaneously, the heat is on to make these systems ever smaller, frequently integrating the entire system functionality on a single piece of silicon, or system-on-a-chip (SoC). After all, nobody wants a hearing aid the size of a Palm Pilot.

Let's talk digital signal processing

DSPs take analog data, convert it to a digital format, process it, filter it, packetize it and/or transmit it, and then turn it back into an analog format for human consumption. They do so by using complex algorithms. Moving these complex DSP algorithms into SoC requires new design methodologies that are just beginning to emerge.

Most SoCs are implemented in custom application-specific integrated circuits (ASICs). However, with the advent of large system-level field-programmable gate arrays (FPGAs), SoC designs are also making their way into FPGAs. Putting a system on an ASIC is a challenging proposition. Putting one on an FPGA is more challenging because of the structure of FPGA architectures.

Form factors and functions

A design optimized for an FPGA is generally different from that which would be used in an ASIC. This is because the two types of devices are structured differently. In contrast to gate arrays that have a single gate as the basic logic unit, FPGAs have the equivalent of 20 to 40 ASIC gates per logic unit, each typically containing a four-input look-up table (LUT), carry logic, muxes and a flip-flop (register). Newer FPGA architectures also have blocks of random access memory (RAM) distributed throughout the array or located on the periphery of the device. Although FPGA logic units can be used to implement memory, doing so is usually silicon-inefficient and expensive. As a result, it is important to achieve the most efficient use of the on-FPGA memory. Sadly, most FPGA design and logic synthesis tools provide little or no support for the embedded static RAM (SRAM) blocks. Until now, they have had to be designed-in manually, taking a lot of extra time and effort.

From algorithm to RT level

Virtually all DSP-based system designs are developed purely as software in some variation of the C-language, using floating-point math and commercially available design tools. These programs are initially developed, executed and debugged using fast PCs or DSP processors. However, DSP processors are large, expensive and power-consuming. Eventually, these designs must be implemented in hardware, frequently with the objective of shrinking them into a single high-performance IC that consumes as little power as possible.

Unfortunately, while software is written in C, hardware designs are written in hardware description languages such as Verilog and VHDL. Historically, the design methodology for implementing a C-language DSP algorithm in hardware has been to manually re-render the algorithm in a fixed-point format and then manually rewrite it again in a register-transfer (RT) level hardware description language. This is a daunting task that can take weeks or months. What's more, because the architecture of the hardware design is explicit in any RT-level description, this methodology does not allow the designer to explore alternative hardware architectures to ensure that the hardware design is optimal for the particular application (e.g. low-power, high-performance).

When FPGAs are the silicon medium, the design challenge becomes greater because the logic structures are fixed and the available on-chip memory is limited.

Enter architectural synthesis

A new class of EDA tools has been introduced in the past year that helps designers define hardware architectures for FPGAs and ASICs from SystemC or the American National Standards Institute- (ANSI) C software. Such software provides the designer with complete control of the process and can specify which resources will be used to execute the DSP programs. These can include arithmetic logic units (ALUs), multipliers, adders, RAM, ROM, registers and special custom resources that execute entire algorithms (e.g. fast-Fourier FFT or discrete cosine transforms — DCT) in a single clock cycle. When the optimum architecture is found, the tool automatically generates the RT-level Verilog or VHDL. However, the original design remains in the C-language so it may be re-used and its hardware architecture re-optimized for other applications.

For example, a high-performance hardware architecture can be created from a C-language Viterbi algorithm for a high-speed satellite application. Then, without any modification, the same C-language Viterbi algorithm can be re-rendered in a low-power architecture for a mobile phone. The C-language algorithm never changes; only the hardware architecture does.

A technique called dataflow analysis is used during architectural synthesis to automatically identify and exploit data dependencies in the code. Operations are assigned to the datapath resources to maximize parallelism in the hardware design. The architectural synthesis tool then generates a hardware architecture that includes a very long instruction word (VLIW) controller and datapath resources, and schedules the algorithmic operations on the hardware resources. The designer may modify the architecture, scheduling or resource allocation at will. Once the designer is satisfied, the tool automatically generates the RT-level Verilog or VHDL hardware description.

In contrast to HDL-based designs, which can take weeks or months for a single iteration, C-language-based design can be iterated in as little as a few minutes using architectural synthesis. Because design iterations are so fast, designers have the freedom to explore and analyze many architectures to achieve an optimized design.

Putting DSP algorithms in FPGAs

DSP algorithms have several characteristics that affect their implementation in hardware, and especially in FPGAs. First, they are computationally intensive and require lots of registers. Second, they tend to use non-standard word widths (e.g. 13 or 33 bits) that frequently do not fit in the fixed memory architectures of on-FPGA SRAM (e.g. 8, 16, 32 bits). Third, because they must perform repetitive calculations on samples or frames in real-time, they demand high processing throughput. If the processing of the first sample or frame is not completed in time, the next one will be lost. Finally, DSP algorithms must often store the results of one frame or sample for comparison to the next frame or sample. Because FPGAs have limited memory, these storage requirements can force the use of external, and slower, off-chip memory. As a result, on-FPGA memory is at a premium and needs to be conserved whenever possible. These characteristics mandate special design techniques and tools for FPGA implementations.

The nature of DSPs

The computationally intensive nature of DSP algorithms means they rely heavily on the use of registers to store intermediate values. In fact, every datapath resource (ALU, multiplier) in a DSP-like architecture is likely to have at least one register file associated with it. Implementing a one-bit register in a gate array takes about eight or nine gates. PGAs, however, have pre-defined flip-flops included in their high-level logic units. Many devices have one flip-flop per logic cell (LC) or logic element (LE). Registers that are larger than one-bit can be built by chaining together several logic units.

If the registers are specified in the HDL description as registers, the logic synthesis tool will synthesize them directly into the registers in the logic units on the FPGA. For example, a 16 × 16 multiplication will need two 16-bit registers to store the left and the right operands. Each 16-bit register can be created by chaining together 16 LCs or LEs and using the flip-flop in each logic unit to store one bit of the 16-bit words. This solution requires 32 logic units to store the multiplication. However, this can be an expensive solution because, in most cases the logic resources are no longer available for logic operations. In other words, using logic elements just for their register can waste valuable logic resources.

A more efficient means of implementing the 16-bit registers is to use a single LUT for each one. Because each LUT can be configured as a 16-bit RAM, a single LUT can be used to store each 16-bit operand. Depending on the software, it can evaluate the number and size of the required registers in the design and optimizes their implementation between built-in and LUT RAM registers. In this example, the software reduces the number of LUTs required for the two 16-bit registers from 32 logic units to only two logic units — a 93% improvement.

Implementing the register in this way can waste all the other resources in the logic element (e.g. the LUT, carry logic and muxes). Because an FPGA logic element contains the equivalent of between 25 and 40 gates, at least 15 equivalent gates of logic resources can be wasted if the registers in the FPGA logic units are used to implement registers in the design. Again, depending on the software, this use of LUT RAM for registers, when appropriate, can be automated.

For the utmost in efficient design, the latest generation of software has tools that look at all the register files in the hardware architecture and optimizes their implementation in the LUT RAMs and/or logic unit registers. For example, A|RT's 3GPP turbocoder IP core in has 1,504 one-bit registers. If these one-bit registers were specified as such in the HDL description of the design, it would require 94 VitrexII slices just for the implementation of the registers, plus about 4,700 additional slices to implement the datapath logic and microcode ROM. The coder requires only 3,071 slices and 16 block RAMS in an XCV400E.

Non-standard word widths

DSP algorithms are typically developed using floating-point math because it is infinitely accurate. However, when the algorithms are implemented in hardware, they are converted to a fixed-point representation to conserve silicon. Because every bit in every word takes up silicon, the fixed-point words are kept as small as possible.

The words must also be wide enough, with enough precision to achieve the behavior of the floating-point algorithm within the specified dynamic range and signal-to-noise ratio. The smallest possible fixed-point word width that has enough bits to provide an accurate result may not be a standard eight, 16 or 32 bits wide. The fixed-point words could be any width, 23, 17, or 51 bits. Depending on what is required at any point in the execution of the algorithm, the word width may change within the same algorithm. Using a non-standard word width to perform the arithmetic will result in a variable value that must be stored in a non-standard word width.

Storing variables with unusual word-widths is not a problem in an ASIC because the designer has complete flexibility to configure the silicon any way he likes. In an FPGA, however, the on-chip memory is configured in fixed standard widths of one, two, four, eight, 16, or, in some cases, 32 bits, that are implemented in discrete blocks of 4096 bits or 2048 bits, depending on the software. Storing words that do not fall within these standard architectures can result in a less-than-optimal use of the FPGA memory. In fact, it can be wasteful.

Example A

Suppose a designer specifies a 33-bit-wide word in the HDL description. Because the widest possible word in either a particular FPGA is only 32 bits wide, a 33-bit word won't fit in a single ESB or Block RAM. The default solution would use two 32-bit wide SRAM blocks, with the first 32 bits of each word stored in the first RAM block, and the 33rd bit of each word stored in the other RAM block. For every 33-bit word that is stored that way, 31 bits of memory are wasted — an expensive proposition.

A more efficient option is to use a combination of LUT RAM and Block RAM or ESBs to store the words. Because FPGA LUTs can be configured as SRAM with 16 one-bit words each, a LUT RAM can be used to store the 33rd bit of each word.

Although designers can and do configure FPGA memory manually to achieve a better result, the process can be time-consuming — and the result can be less than perfect. Again, software that automates the process by searching the code for the various words to be stored, and computing an optimized combination of LUT RAM and larger RAM blocks to minimize the amount of memory used offers the most efficiency.

Example B

In this example, the tool would configure the memory to store the first 32 bits of the 33-bit word in a 32-bit wide block of RAM and store the 33rd bit in a LUT RAM. In this case, such software would reduce the required amount of RAM by nearly 50% from the worst-case implementation. This feature is particularly important in algorithms that must store massive amounts of data, such as adaptive differential pulse code modulation (ADPCM) or speech recognition.

Repetitive calculations

DSP designs are packed with multiple algorithms, such as FFTs butterflies or DCTs, that must be executed thousands of times a second. Parallelizing these operations is mandatory to achieve the required throughput in any implementation. With a maximum realistic clock of 80 MHz, FPGAs are much slower than 200 MHz ASICs or 400 MHz DSP processors and throughput becomes even more critical. A speech recognition algorithm that requires 90% of a 500 MHz Pentium III will require massive parallelization to execute in even the fastest FPGA.

Architectural synthesis allows designers to achieve the highest level of parallelism by letting them create special resources, based on complex C-code algorithms or super instructions, (e.g. FFTs or DCTs), that execute in a single or a few clock cycles. For example, an FFT butterfly that requires two multiplications, an add and an accumulation would take at least four clock cycles to complete using conventional ALUs, multipliers and registers. By creating a special resource from a super instruction that includes the two multiplications, the addition and the accumulation, the butterfly FFT can be executed in a single cycle instead of four. Software is now available that automates the creation of these single-cycle, multiple-instruction resources at the designer's request. It then automatically searches the C-code for any instance of the super instruction and instantiates the special resource to execute that code segment in just one clock cycle. Much larger super instructions can also be implemented in single-cycle special resources, using modern software.

Freeing up FPGA memory usage

The implementation of any algorithm in hardware requires that a controller and microcode be created to schedule the execution of the datapath operations on the hardware resources. The microcode can become long in complex designs, taking up a substantial amount of the FPGA's limited memory resources. At the same time, certain algorithms that must compare the value of the previous sample to the value of the current sample can require substantial amounts of memory to store the data. Because on-FPGA memory is limited, conserving memory can be paramount. If there is not sufficient on-chip RAM, an external memory will be required that will degrade performance and increase board size, power consumption and system costs. Minimizing the storage requirement for the microcode can make the difference between a single-chip FPGA solution or a solution with external memory.

Fortunately, the LUT architecture of FPGAs can be exploited to reduce the total memory requirements for microcode storage. Frequently, there is enough overlap within the instructions that not all the bits are required to identify them. For example, in a design with 13-bit instructions, it may be possible to identify each unique instruction using just four of the bits. In this situation, the code can be stored as four-bit words and then decoded prior to execution using the FPGA LUTs. One LUT is used to decode each of the bits in the original instruction word.

The memory savings can be significant. In a complex design, the microcode can be as long as 5,000 lines. In the hypothetical example with the 13-bit instructions, 5,000 lines of code would require 65,000 bits of SRAM for storage. That is the equivalent 16 of the block RAMs in a Virtex, or 32 of the ESBs in an Altera APEX. By using LUTs to create small decoders, and storing the instructions in four bits of memory each, the required memory can be reduced to only 20,000 bits, plus 13 LUTs, a 70% reduction in the memory required for the code.

Conclusion

Implementing DSP software in hardware can be a daunting task. The predefined logic structures, fixed memory architecture, limited on-chip memory and slow throughput of FPGAs make this task even more difficult when an FPGA is the target medium.

Architectural synthesis tools can substantially improve the silicon efficiency of SoC implementations by automating and optimizing the configuration of register files and on-FPGA memory. These tools can also achieve high levels of throughput required of DSP applications by helping the designer introduce an extraordinary degree of parallelism in the hardware implementation — a critical requirement in slow FPGAs.

About the author

Johannes Steensma is Adelante Technologies' vice president of engineering in the United States where he manages the development and deployment of complex DSP algorithms for 3G, VoIP, DVB, wireless LAN in high-performance, low-power intellectual property (IP) cores for ASIC and FPGA implementation. He received his MSEE degree from the University of Twente, The Netherlands, in 1989 and his Ph.D. (summa cum laude) in Applied Sciences from the Katholieke Universities in Leuven, Belgium, in 1994. He was a researcher at the Interuniversity Microelectronics Center (IMEC) from 1990 to 1994 where he worked on test techniques for high speed customized data path architectures. From 1995 to 1996, he served as a design consultant in the field of digital signal processing for the Center of Microelectronics (CME) in the Netherlands. Steensma co-founded Dedris Embedded Algorithms, which was acquired by Adelante Technologies in 1999. He can be reached at:
Johannes.Steensma@adelentetech.com