Making my local LLM voice assistant faster and more scalable with RAG

The author discusses their efforts to improve the speed and efficiency of their open-source smart home voice assistant. They initially used dual RTX 3090's to offload Whisper to GPU, but found the system was still slow due to the large amount of data being processed. The author then explored using a method called Retrieval Augmented Generation (RAG) to reduce the context length of the prompts, which significantly improved the speed of the system.

The RAG method involves splitting the large prompt into smaller sections and using embeddings to determine which parts of the prompt the language model needs to answer the query. The author built a RAG API and configured it to work with their existing system. The API pre-computes embeddings for data that is unlikely to change frequently and calculates the similarity between the user prompt and the pre-computed embeddings. This allows the system to quickly identify the most relevant information, reducing the context length and improving the speed of the voice assistant.

Key takeaways:

The author uses open-source technology for his smart home and has been trying to improve the speed of his voice assistant, which currently runs slowly.
Language models have two phases, 'prefill' and 'decode', with 'prefill' taking up a majority of the inference time, especially for long contexts.
The author introduces the use of RAG (Retrieval Augmented Generation), a method that uses 'embeddings' to project text input onto a high-dimensional space, to reduce the context length and potentially speed up the process.
By using RAG, the author was able to make his system more scalable and improve the speed and efficiency of his voice assistant.

Making my local LLM voice assistant faster and more scalable with RAG

Key takeaways:

Comments (0)

Newsletter