Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Nvidia’s new software doubles inference speed on H100 GPUs

Sep 17, 2023 - aibeat.co
Nvidia has announced a new open-source software package, TensorRT-LLM, designed to significantly improve the performance of large language model inference on its latest GPU accelerators, H100. The software, set to release in the next few weeks, can double the inference speed on H100, according to Nvidia. It comes with fully optimized versions of various large language models and incorporates techniques like in-flight batching and half-precision numerical formats to maximize GPU utilization.

Benchmark results show that TensorRT-LLM delivered up to 8 times higher throughput on H100s compared to an Nvidia A100 alone. Nvidia estimates these performance boosts translate into 3 to 5.6 times lower total cost of ownership when deploying language models. The software also makes popular models easily deployable, removing the need for manual tuning. Early access to the package is available now, with full release expected in the coming weeks.

Key takeaways:

  • Nvidia has announced a new open-source software, TensorRT-LLM, aimed at improving the performance of large language model inference on its latest GPU accelerators, H100. The software can double the inference speed on H100.
  • TensorRT-LLM comes with fully optimized versions of numerous large language models commonly used in production environments. It incorporates techniques like in-flight batching and half-precision numerical formats to maximize utilization of Nvidia’s H100 GPUs.
  • Benchmark results show that TensorRT-LLM delivered up to 8 times higher throughput on H100s compared to an Nvidia A100 alone. Nvidia estimates these performance boosts translate into 3 to 5.6 times lower total cost of ownership when deploying language models.
  • TensorRT-LLM makes popular models easily deployable out of the box, with pre-optimized versions of GPT-J, Llama, and other frameworks included. The software is integrated into Nvidia’s Triton inference serving system and NeMo model deployment toolkit to further simplify workflows.
View Full Article

Comments (0)

Be the first to comment!