Benchmark results show that TensorRT-LLM delivered up to 8 times higher throughput on H100s compared to an Nvidia A100 alone. Nvidia estimates these performance boosts translate into 3 to 5.6 times lower total cost of ownership when deploying language models. The software also makes popular models easily deployable, removing the need for manual tuning. Early access to the package is available now, with full release expected in the coming weeks.
Key takeaways:
- Nvidia has announced a new open-source software, TensorRT-LLM, aimed at improving the performance of large language model inference on its latest GPU accelerators, H100. The software can double the inference speed on H100.
- TensorRT-LLM comes with fully optimized versions of numerous large language models commonly used in production environments. It incorporates techniques like in-flight batching and half-precision numerical formats to maximize utilization of Nvidia’s H100 GPUs.
- Benchmark results show that TensorRT-LLM delivered up to 8 times higher throughput on H100s compared to an Nvidia A100 alone. Nvidia estimates these performance boosts translate into 3 to 5.6 times lower total cost of ownership when deploying language models.
- TensorRT-LLM makes popular models easily deployable out of the box, with pre-optimized versions of GPT-J, Llama, and other frameworks included. The software is integrated into Nvidia’s Triton inference serving system and NeMo model deployment toolkit to further simplify workflows.