Apple collaborates with NVIDIA to research faster LLM performance - 9to5Mac
Dec 18, 2024 - 9to5mac.com
Apple engineers have collaborated with NVIDIA to enhance text generation performance in large language models (LLMs) by integrating Apple's Recurrent Drafter (ReDrafter) technique into NVIDIA's TensorRT-LLM framework. ReDrafter, which combines beam search and dynamic tree attention, was open-sourced by Apple earlier this year and has shown significant improvements in text generation speed. This integration allows ML developers using NVIDIA GPUs to achieve a 2.7x speed-up in token generation per second for greedy decoding, reducing latency and computational costs while using fewer resources.
The collaboration involved NVIDIA adding new operators or exposing existing ones to better support sophisticated models and decoding methods. This advancement is particularly beneficial for production applications that rely on LLMs, as it enhances inference efficiency and reduces latency for end-users. More details about this work can be found on Apple's and NVIDIA's websites.
Key takeaways:
Apple and NVIDIA collaborated to enhance text generation performance with large language models using Apple's Recurrent Drafter (ReDrafter) technique.
ReDrafter combines beam search and dynamic tree attention to achieve faster and state-of-the-art text generation.
Integration of ReDrafter into NVIDIA TensorRT-LLM resulted in a 2.7x speed-up in token generation for greedy decoding on NVIDIA GPUs.
This advancement can significantly reduce latency and computational costs for LLM applications, benefiting developers using NVIDIA GPUs.