The RAG method involves splitting the large prompt into smaller sections and using embeddings to determine which parts of the prompt the language model needs to answer the query. The author built a RAG API and configured it to work with their existing system. The API pre-computes embeddings for data that is unlikely to change frequently and calculates the similarity between the user prompt and the pre-computed embeddings. This allows the system to quickly identify the most relevant information, reducing the context length and improving the speed of the voice assistant.
Key takeaways:
- The author uses open-source technology for his smart home and has been trying to improve the speed of his voice assistant, which currently runs slowly.
- Language models have two phases, 'prefill' and 'decode', with 'prefill' taking up a majority of the inference time, especially for long contexts.
- The author introduces the use of RAG (Retrieval Augmented Generation), a method that uses 'embeddings' to project text input onto a high-dimensional space, to reduce the context length and potentially speed up the process.
- By using RAG, the author was able to make his system more scalable and improve the speed and efficiency of his voice assistant.