Function Unit Specialization through Code Analysis
This paper was actually accidental: I was enrolled in a graduate class and had to come up with a project for the quarter. I had a very out-there idea of taking what we had learned about VLIW machines and applying it to configurable computing architectures.
A basic tenant of configurable computing, and the one that got me hooked as a senior in undergraduate engineering, as the idea that the lines between hardware and software could be blurred. Software had always been tailored for the hardware that it ran on, whether it was done that way by hand or if left up to the compiler. Well, what if you could design hardware that better fit the needs of a particular application, automatically? My idea was to use the well-known tricks of scheduling for VLIW machines and use those as a set of “hints” for the what the ideal hardware could look like.
The class professor wasn’t too hot on this idea, but let me do it anyway ‘cuz I’m tenacious. So for the next few weeks I poked around an open-source compiler and messed with making a mythical hardware design out of the compiler’s scheduler and ended up with some interesting results.
For example, most DSP’s have a single instruction multiply-and-add. This is because the most common thing they do is add and multiply (you can find a bunch of these in the my previous paper). Well it turns out that you get only modest savings from doing that. My research showed that for most implementations, making faster individual multiply and add units performed better than making a monolithic one.
The results were interesting enought that the prof convinced me to submit my work to a handful of conferences. Ha! This was literally a stretched out homework assignment and he expected it to be accepted at a research conference? Well, it did, and to one of the most prestigious ones — ICCAD. Here’s a link to my paper, the abstract and embedded copy can be found below.
Many previous attempts at ASIP synthesis have employed template matching techniques to target function units to application code, or directly design new units to extract maximum performance. This paper presents an entirely new approach to specializing hardware for application specific needs. In our framework of a parametrized VLIW processor, we use a post-modulo scheduling analysis to reduce the allocated hardware resources while increasing the code’s performance. Initial results indicate significant savings in area, as well as optimizations to increase FIR filter code performance 200% to 300%.
Optimizing FPGA-based vector product designs
This post discusses one of my favorite academic projects. This work was largely done while I was a visiting researcher at the wonderful Imperial College of London.
I remember literally staring at the math behind my lab’s latest research work. I was a student at UCLA, and our lab just was on a high from having made the cover story on Scientific American. The lab worked on the (then hot hot hot) topic of reconfigurable computing. Specifically, we worked on machines that would perform image correlation very quickly; my specific project performed at 12FPS running at just a 16MHz clock!
What struck me was that what the machine was doing could represent a more generic way of how machines performed mathematics — vector math, to be precise. So over the course of about a summer, I wrote the following paper (I also wrote the actual C++ code that did the work, but can’t track it down!). Here’s what that code could do:
Re-write the following equation, using simple arithmetic, so that all multiplies are done using powers of two, and with the fewest number of adds or subtracts:
There are 7 plus signs already — how many more will you need so that there are only powers of two? (Answer at the end of this post!)
This is exactly the sort of optimization that is needed to minimize hardware design complexity (a multiply by a power of two is just a shift, after all). I loved this project because it transformed this algebraic problem into a graph coloring problem, and by doing so, allows computers to do all the dirty work. For example, if you gave the code a fixed point representation of the DCT, it would optimize and transform the 8×8 matrix multiply to something that approximates the widely used version by Feig & Winograd. Except it took the computer just about a day to do it.
This paper presents a method, called multiple constant multiplier trees (MCMTs), for producing optimized reconfigurable hardware implementations of vector products. An algorithm for generating MCMTs has been developed and implemented, which is based on a novel representation of common subexpressions in constant data patterns. Our optimization framework covers a wider solution space than previous approaches; it also supports exploitation of full and partial run-time reconfiguration as well as technology-specific constraints, such as fanout limits and routing. We demonstrate that while distributed arithmetic techniques require storage size exponential in the number of coefficients, the resource utilization of MCMTs usually grows linearly with problem size. MCMTs have been implemented in Xilinx 4000 and Virtex FPGAs, and their size and speed efficiency are confirmed in comparisons with Xilinx LogiCore and ASIC implementations of FIR filter designs. Preliminary results show that the size of MCMT circuits is less than half of that of comparable distributed arithmetic cores.
Download the full paper (PDF), or browse through the paper below (it looks blurry cuz the font wasn’t properly embedded, crap.)