GitHub - epolewski/EricLLM: A fast batching API to serve LLM models

EricLLM is a fast batching API designed to serve LLM models. It is capable of launching the API with multiple workers, and its performance can be increased with a --gpu_balance switch that prevents small models from splitting over GPUs. The API can be tested using curl and has a variety of options available, such as setting the model directory, host, port, max sequence length, and more. It also allows for the running of multiple workers with the --num_workers flag.

The API was created out of frustration with vLLM, which was causing issues with running large models across cards. EricLLM is multi-threaded and rivals state-of-the-art solutions for throughput and model loading time. It uses the ExllamaV2 engine and can be run in the Text-Generation-Webui environment. Despite its advantages, it is still in development and has some rough edges that need to be polished. It is currently about 1/3rd the speed of vLLM, but it is still one of the fastest batching APIs available and supports the exl2 format with variable bitrate.

Key takeaways:

EricLLM is a fast batching API designed to serve LLM models, offering a feature-compatible alternative to vLLM with more performance than the text-generation-webui.
The API can be launched with multiple workers, and in a dual-GPU setup, performance can be increased with the --gpu_balance switch.
EricLLM supports various options such as setting the model directory, host, port, max sequence length, max input length, GPU split, and the number of worker processes to use.
Despite being slower than vLLM, EricLLM is one of the fastest batching APIs currently available, supporting the exl2 format with variable bitrate and offering faster model loading times.

GitHub - epolewski/EricLLM: A fast batching API to serve LLM models

Key takeaways:

Comments (0)

Newsletter