The new method, named HLSTransform, allows for rapid prototyping of FPGA designs without writing code at the register-transfer level (RTL). The FPGA designs synthesized with HLS achieve up to a 12.75x and 8.25x reduction in energy used per token compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively. It also increases inference speeds by up to 2.46x compared to CPU and maintains 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. The authors have open-sourced their code and documented the synthesis process, hoping to democratize the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods.
Key takeaways:
- The researchers developed an accelerator for transformers, Llama 2, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs) to address the high energy demands of GPUs.
- The FPGA designs synthesized with HLS achieved up to a 12.75x reduction in energy used per token compared to an Intel Xeon CPU and up to 8.25x reduction compared to an NVIDIA RTX 3090 GPU.
- The FPGA designs also increased inference speeds by up to 2.46x compared to CPU and maintained 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate.
- The researchers have open-sourced their code and documented their steps for synthesis, hoping to democratize the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods.