TensorRT-LLM optimizes LLM inference performance on Nvidia GPUs in four ways: inclusion of ready-to-run, inference-optimized versions of the latest LLMs; a software library that allows inference versions of LLMs to run simultaneously on multiple GPUs and servers; in-flight batching, a new scheduler that improves GPU utilization; and optimization to take advantage of the H100’s Transformer Engine. The software is expected to significantly speed up live applications running on LLMs powered by Nvidia GPUs, including the flagship H100 accelerator.
Key takeaways:
- Nvidia is planning to release a new open-source software library, TensorRT-LLM, which is expected to double the performance of the H100 for running inference on large language models (LLMs).
- The software will be integrated into the Nvidia NeMo LLM framework as part of the Nvidia AI Enterprise software suite and will support several Nvidia GPUs beyond the H100.
- TensorRT-LLM optimizes LLM inference performance on Nvidia GPUs in four ways: inclusion of ready-to-run, inference-optimized versions of the latest LLMs; a software library that allows LLMs to automatically run on multiple GPUs and servers; in-flight batching for improved GPU utilization; and optimization for the H100’s Transformer Engine.
- With TensorRT-LLM, an H100 can perform inference two times faster than a regular H100 and eight times faster than the previous-generation A100, leading to improved power efficiency.