Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

May 10, 2024 - news.bensbites.com
The article discusses the development of an accelerator for transformers, Llama 2, which is an open-source state-of-the-art Large Language Model (LLM) using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). This development is in response to the high energy demands of Graphics Processing Units (GPUs), which are currently the leading hardware accelerator for deep learning applications. The high energy use of GPUs not only raises environmental concerns but also increases operational costs and makes them unsuitable for edge computing.

The new method, named HLSTransform, allows for rapid prototyping of FPGA designs without writing code at the register-transfer level (RTL). The FPGA designs synthesized with HLS achieve up to a 12.75x and 8.25x reduction in energy used per token compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively. It also increases inference speeds by up to 2.46x compared to CPU and maintains 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. The authors have open-sourced their code and documented the synthesis process, hoping to democratize the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods.

Key takeaways:

  • The researchers developed an accelerator for transformers, Llama 2, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs) to address the high energy demands of GPUs.
  • The FPGA designs synthesized with HLS achieved up to a 12.75x reduction in energy used per token compared to an Intel Xeon CPU and up to 8.25x reduction compared to an NVIDIA RTX 3090 GPU.
  • The FPGA designs also increased inference speeds by up to 2.46x compared to CPU and maintained 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate.
  • The researchers have open-sourced their code and documented their steps for synthesis, hoping to democratize the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods.
View Full Article

Comments (0)

Be the first to comment!